mirror of
				https://github.com/explosion/spaCy.git
				synced 2025-11-04 09:57:26 +03:00 
			
		
		
		
	Merge branch 'master' into spacy.io
This commit is contained in:
		
						commit
						69dbd59a13
					
				
							
								
								
									
										106
									
								
								.github/contributors/ameyuuno.md
									
									
									
									
										vendored
									
									
										Normal file
									
								
							
							
						
						
									
										106
									
								
								.github/contributors/ameyuuno.md
									
									
									
									
										vendored
									
									
										Normal file
									
								
							| 
						 | 
					@ -0,0 +1,106 @@
 | 
				
			||||||
 | 
					# spaCy contributor agreement
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					This spaCy Contributor Agreement (**"SCA"**) is based on the
 | 
				
			||||||
 | 
					[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
 | 
				
			||||||
 | 
					The SCA applies to any contribution that you make to any product or project
 | 
				
			||||||
 | 
					managed by us (the **"project"**), and sets out the intellectual property rights
 | 
				
			||||||
 | 
					you grant to us in the contributed materials. The term **"us"** shall mean
 | 
				
			||||||
 | 
					[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
 | 
				
			||||||
 | 
					**"you"** shall mean the person or entity identified below.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					If you agree to be bound by these terms, fill in the information requested
 | 
				
			||||||
 | 
					below and include the filled-in version with your first pull request, under the
 | 
				
			||||||
 | 
					folder [`.github/contributors/`](/.github/contributors/). The name of the file
 | 
				
			||||||
 | 
					should be your GitHub username, with the extension `.md`. For example, the user
 | 
				
			||||||
 | 
					example_user would create the file `.github/contributors/example_user.md`.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					Read this agreement carefully before signing. These terms and conditions
 | 
				
			||||||
 | 
					constitute a binding legal agreement.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					## Contributor Agreement
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					1. The term "contribution" or "contributed materials" means any source code,
 | 
				
			||||||
 | 
					object code, patch, tool, sample, graphic, specification, manual,
 | 
				
			||||||
 | 
					documentation, or any other material posted or submitted by you to the project.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					2. With respect to any worldwide copyrights, or copyright applications and
 | 
				
			||||||
 | 
					registrations, in your contribution:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    * you hereby assign to us joint ownership, and to the extent that such
 | 
				
			||||||
 | 
					    assignment is or becomes invalid, ineffective or unenforceable, you hereby
 | 
				
			||||||
 | 
					    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
 | 
				
			||||||
 | 
					    royalty-free, unrestricted license to exercise all rights under those
 | 
				
			||||||
 | 
					    copyrights. This includes, at our option, the right to sublicense these same
 | 
				
			||||||
 | 
					    rights to third parties through multiple levels of sublicensees or other
 | 
				
			||||||
 | 
					    licensing arrangements;
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    * you agree that each of us can do all things in relation to your
 | 
				
			||||||
 | 
					    contribution as if each of us were the sole owners, and if one of us makes
 | 
				
			||||||
 | 
					    a derivative work of your contribution, the one who makes the derivative
 | 
				
			||||||
 | 
					    work (or has it made will be the sole owner of that derivative work;
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    * you agree that you will not assert any moral rights in your contribution
 | 
				
			||||||
 | 
					    against us, our licensees or transferees;
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    * you agree that we may register a copyright in your contribution and
 | 
				
			||||||
 | 
					    exercise all ownership rights associated with it; and
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    * you agree that neither of us has any duty to consult with, obtain the
 | 
				
			||||||
 | 
					    consent of, pay or render an accounting to the other for any use or
 | 
				
			||||||
 | 
					    distribution of your contribution.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					3. With respect to any patents you own, or that you can license without payment
 | 
				
			||||||
 | 
					to any third party, you hereby grant to us a perpetual, irrevocable,
 | 
				
			||||||
 | 
					non-exclusive, worldwide, no-charge, royalty-free license to:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    * make, have made, use, sell, offer to sell, import, and otherwise transfer
 | 
				
			||||||
 | 
					    your contribution in whole or in part, alone or in combination with or
 | 
				
			||||||
 | 
					    included in any product, work or materials arising out of the project to
 | 
				
			||||||
 | 
					    which your contribution was submitted, and
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    * at our option, to sublicense these same rights to third parties through
 | 
				
			||||||
 | 
					    multiple levels of sublicensees or other licensing arrangements.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					4. Except as set out above, you keep all right, title, and interest in your
 | 
				
			||||||
 | 
					contribution. The rights that you grant to us under these terms are effective
 | 
				
			||||||
 | 
					on the date you first submitted a contribution to us, even if your submission
 | 
				
			||||||
 | 
					took place before the date you sign these terms.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					5. You covenant, represent, warrant and agree that:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    * Each contribution that you submit is and shall be an original work of
 | 
				
			||||||
 | 
					    authorship and you can legally grant the rights set out in this SCA;
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    * to the best of your knowledge, each contribution will not violate any
 | 
				
			||||||
 | 
					    third party's copyrights, trademarks, patents, or other intellectual
 | 
				
			||||||
 | 
					    property rights; and
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    * each contribution shall be in compliance with U.S. export control laws and
 | 
				
			||||||
 | 
					    other applicable export and import laws. You agree to notify us if you
 | 
				
			||||||
 | 
					    become aware of any circumstance which would make any of the foregoing
 | 
				
			||||||
 | 
					    representations inaccurate in any respect. We may publicly disclose your 
 | 
				
			||||||
 | 
					    participation in the project, including the fact that you have signed the SCA.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					6. This SCA is governed by the laws of the State of California and applicable
 | 
				
			||||||
 | 
					U.S. Federal law. Any choice of law rules will not apply.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					7. Please place an “x” on one of the applicable statement below. Please do NOT
 | 
				
			||||||
 | 
					mark both statements:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    * [x] I am signing on behalf of myself as an individual and no other person
 | 
				
			||||||
 | 
					    or entity, including my employer, has or will have rights with respect my
 | 
				
			||||||
 | 
					    contributions.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    * [ ] I am signing on behalf of my employer or a legal entity and I have the
 | 
				
			||||||
 | 
					    actual authority to contractually bind that entity.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					## Contributor Details
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					| Field                          | Entry                |
 | 
				
			||||||
 | 
					|------------------------------- | -------------------- |
 | 
				
			||||||
 | 
					| Name                           | Alexey Kim           |
 | 
				
			||||||
 | 
					| Company name (if applicable)   |                      |
 | 
				
			||||||
 | 
					| Title or role (if applicable)  |                      |
 | 
				
			||||||
 | 
					| Date                           | 2019-07-09           |
 | 
				
			||||||
 | 
					| GitHub username                | ameyuuno             |
 | 
				
			||||||
 | 
					| Website (optional)             | https://ameyuuno.io  |
 | 
				
			||||||
							
								
								
									
										106
									
								
								.github/contributors/askhogan.md
									
									
									
									
										vendored
									
									
										Normal file
									
								
							
							
						
						
									
										106
									
								
								.github/contributors/askhogan.md
									
									
									
									
										vendored
									
									
										Normal file
									
								
							| 
						 | 
					@ -0,0 +1,106 @@
 | 
				
			||||||
 | 
					# spaCy contributor agreement
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					This spaCy Contributor Agreement (**"SCA"**) is based on the
 | 
				
			||||||
 | 
					[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
 | 
				
			||||||
 | 
					The SCA applies to any contribution that you make to any product or project
 | 
				
			||||||
 | 
					managed by us (the **"project"**), and sets out the intellectual property rights
 | 
				
			||||||
 | 
					you grant to us in the contributed materials. The term **"us"** shall mean
 | 
				
			||||||
 | 
					[ExplosionAI GmbH](https://explosion.ai/legal). The term
 | 
				
			||||||
 | 
					**"you"** shall mean the person or entity identified below.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					If you agree to be bound by these terms, fill in the information requested
 | 
				
			||||||
 | 
					below and include the filled-in version with your first pull request, under the
 | 
				
			||||||
 | 
					folder [`.github/contributors/`](/.github/contributors/). The name of the file
 | 
				
			||||||
 | 
					should be your GitHub username, with the extension `.md`. For example, the user
 | 
				
			||||||
 | 
					example_user would create the file `.github/contributors/example_user.md`.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					Read this agreement carefully before signing. These terms and conditions
 | 
				
			||||||
 | 
					constitute a binding legal agreement.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					## Contributor Agreement
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					1. The term "contribution" or "contributed materials" means any source code,
 | 
				
			||||||
 | 
					object code, patch, tool, sample, graphic, specification, manual,
 | 
				
			||||||
 | 
					documentation, or any other material posted or submitted by you to the project.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					2. With respect to any worldwide copyrights, or copyright applications and
 | 
				
			||||||
 | 
					registrations, in your contribution:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    * you hereby assign to us joint ownership, and to the extent that such
 | 
				
			||||||
 | 
					    assignment is or becomes invalid, ineffective or unenforceable, you hereby
 | 
				
			||||||
 | 
					    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
 | 
				
			||||||
 | 
					    royalty-free, unrestricted license to exercise all rights under those
 | 
				
			||||||
 | 
					    copyrights. This includes, at our option, the right to sublicense these same
 | 
				
			||||||
 | 
					    rights to third parties through multiple levels of sublicensees or other
 | 
				
			||||||
 | 
					    licensing arrangements;
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    * you agree that each of us can do all things in relation to your
 | 
				
			||||||
 | 
					    contribution as if each of us were the sole owners, and if one of us makes
 | 
				
			||||||
 | 
					    a derivative work of your contribution, the one who makes the derivative
 | 
				
			||||||
 | 
					    work (or has it made will be the sole owner of that derivative work;
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    * you agree that you will not assert any moral rights in your contribution
 | 
				
			||||||
 | 
					    against us, our licensees or transferees;
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    * you agree that we may register a copyright in your contribution and
 | 
				
			||||||
 | 
					    exercise all ownership rights associated with it; and
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    * you agree that neither of us has any duty to consult with, obtain the
 | 
				
			||||||
 | 
					    consent of, pay or render an accounting to the other for any use or
 | 
				
			||||||
 | 
					    distribution of your contribution.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					3. With respect to any patents you own, or that you can license without payment
 | 
				
			||||||
 | 
					to any third party, you hereby grant to us a perpetual, irrevocable,
 | 
				
			||||||
 | 
					non-exclusive, worldwide, no-charge, royalty-free license to:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    * make, have made, use, sell, offer to sell, import, and otherwise transfer
 | 
				
			||||||
 | 
					    your contribution in whole or in part, alone or in combination with or
 | 
				
			||||||
 | 
					    included in any product, work or materials arising out of the project to
 | 
				
			||||||
 | 
					    which your contribution was submitted, and
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    * at our option, to sublicense these same rights to third parties through
 | 
				
			||||||
 | 
					    multiple levels of sublicensees or other licensing arrangements.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					4. Except as set out above, you keep all right, title, and interest in your
 | 
				
			||||||
 | 
					contribution. The rights that you grant to us under these terms are effective
 | 
				
			||||||
 | 
					on the date you first submitted a contribution to us, even if your submission
 | 
				
			||||||
 | 
					took place before the date you sign these terms.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					5. You covenant, represent, warrant and agree that:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    * Each contribution that you submit is and shall be an original work of
 | 
				
			||||||
 | 
					    authorship and you can legally grant the rights set out in this SCA;
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    * to the best of your knowledge, each contribution will not violate any
 | 
				
			||||||
 | 
					    third party's copyrights, trademarks, patents, or other intellectual
 | 
				
			||||||
 | 
					    property rights; and
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    * each contribution shall be in compliance with U.S. export control laws and
 | 
				
			||||||
 | 
					    other applicable export and import laws. You agree to notify us if you
 | 
				
			||||||
 | 
					    become aware of any circumstance which would make any of the foregoing
 | 
				
			||||||
 | 
					    representations inaccurate in any respect. We may publicly disclose your
 | 
				
			||||||
 | 
					    participation in the project, including the fact that you have signed the SCA.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					6. This SCA is governed by the laws of the State of California and applicable
 | 
				
			||||||
 | 
					U.S. Federal law. Any choice of law rules will not apply.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					7. Please place an “x” on one of the applicable statement below. Please do NOT
 | 
				
			||||||
 | 
					mark both statements:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    * [X] I am signing on behalf of myself as an individual and no other person
 | 
				
			||||||
 | 
					    or entity, including my employer, has or will have rights with respect to my
 | 
				
			||||||
 | 
					    contributions.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    * [ ] I am signing on behalf of my employer or a legal entity and I have the
 | 
				
			||||||
 | 
					    actual authority to contractually bind that entity.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					## Contributor Details
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					| Field                          | Entry                |
 | 
				
			||||||
 | 
					|------------------------------- | -------------------- |
 | 
				
			||||||
 | 
					| Name                           | Patrick Hogan        |
 | 
				
			||||||
 | 
					| Company name (if applicable)   |                      |
 | 
				
			||||||
 | 
					| Title or role (if applicable)  |                      |
 | 
				
			||||||
 | 
					| Date                           | 7/7/2019             |
 | 
				
			||||||
 | 
					| GitHub username                | askhogan@gmail.com   |
 | 
				
			||||||
 | 
					| Website (optional)             |                      |
 | 
				
			||||||
							
								
								
									
										106
									
								
								.github/contributors/cedar101.md
									
									
									
									
										vendored
									
									
										Normal file
									
								
							
							
						
						
									
										106
									
								
								.github/contributors/cedar101.md
									
									
									
									
										vendored
									
									
										Normal file
									
								
							| 
						 | 
					@ -0,0 +1,106 @@
 | 
				
			||||||
 | 
					# spaCy contributor agreement
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					This spaCy Contributor Agreement (**"SCA"**) is based on the
 | 
				
			||||||
 | 
					[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
 | 
				
			||||||
 | 
					The SCA applies to any contribution that you make to any product or project
 | 
				
			||||||
 | 
					managed by us (the **"project"**), and sets out the intellectual property rights
 | 
				
			||||||
 | 
					you grant to us in the contributed materials. The term **"us"** shall mean
 | 
				
			||||||
 | 
					[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
 | 
				
			||||||
 | 
					**"you"** shall mean the person or entity identified below.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					If you agree to be bound by these terms, fill in the information requested
 | 
				
			||||||
 | 
					below and include the filled-in version with your first pull request, under the
 | 
				
			||||||
 | 
					folder [`.github/contributors/`](/.github/contributors/). The name of the file
 | 
				
			||||||
 | 
					should be your GitHub username, with the extension `.md`. For example, the user
 | 
				
			||||||
 | 
					example_user would create the file `.github/contributors/example_user.md`.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					Read this agreement carefully before signing. These terms and conditions
 | 
				
			||||||
 | 
					constitute a binding legal agreement.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					## Contributor Agreement
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					1. The term "contribution" or "contributed materials" means any source code,
 | 
				
			||||||
 | 
					object code, patch, tool, sample, graphic, specification, manual,
 | 
				
			||||||
 | 
					documentation, or any other material posted or submitted by you to the project.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					2. With respect to any worldwide copyrights, or copyright applications and
 | 
				
			||||||
 | 
					registrations, in your contribution:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    * you hereby assign to us joint ownership, and to the extent that such
 | 
				
			||||||
 | 
					    assignment is or becomes invalid, ineffective or unenforceable, you hereby
 | 
				
			||||||
 | 
					    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
 | 
				
			||||||
 | 
					    royalty-free, unrestricted license to exercise all rights under those
 | 
				
			||||||
 | 
					    copyrights. This includes, at our option, the right to sublicense these same
 | 
				
			||||||
 | 
					    rights to third parties through multiple levels of sublicensees or other
 | 
				
			||||||
 | 
					    licensing arrangements;
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    * you agree that each of us can do all things in relation to your
 | 
				
			||||||
 | 
					    contribution as if each of us were the sole owners, and if one of us makes
 | 
				
			||||||
 | 
					    a derivative work of your contribution, the one who makes the derivative
 | 
				
			||||||
 | 
					    work (or has it made will be the sole owner of that derivative work;
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    * you agree that you will not assert any moral rights in your contribution
 | 
				
			||||||
 | 
					    against us, our licensees or transferees;
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    * you agree that we may register a copyright in your contribution and
 | 
				
			||||||
 | 
					    exercise all ownership rights associated with it; and
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    * you agree that neither of us has any duty to consult with, obtain the
 | 
				
			||||||
 | 
					    consent of, pay or render an accounting to the other for any use or
 | 
				
			||||||
 | 
					    distribution of your contribution.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					3. With respect to any patents you own, or that you can license without payment
 | 
				
			||||||
 | 
					to any third party, you hereby grant to us a perpetual, irrevocable,
 | 
				
			||||||
 | 
					non-exclusive, worldwide, no-charge, royalty-free license to:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    * make, have made, use, sell, offer to sell, import, and otherwise transfer
 | 
				
			||||||
 | 
					    your contribution in whole or in part, alone or in combination with or
 | 
				
			||||||
 | 
					    included in any product, work or materials arising out of the project to
 | 
				
			||||||
 | 
					    which your contribution was submitted, and
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    * at our option, to sublicense these same rights to third parties through
 | 
				
			||||||
 | 
					    multiple levels of sublicensees or other licensing arrangements.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					4. Except as set out above, you keep all right, title, and interest in your
 | 
				
			||||||
 | 
					contribution. The rights that you grant to us under these terms are effective
 | 
				
			||||||
 | 
					on the date you first submitted a contribution to us, even if your submission
 | 
				
			||||||
 | 
					took place before the date you sign these terms.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					5. You covenant, represent, warrant and agree that:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    * Each contribution that you submit is and shall be an original work of
 | 
				
			||||||
 | 
					    authorship and you can legally grant the rights set out in this SCA;
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    * to the best of your knowledge, each contribution will not violate any
 | 
				
			||||||
 | 
					    third party's copyrights, trademarks, patents, or other intellectual
 | 
				
			||||||
 | 
					    property rights; and
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    * each contribution shall be in compliance with U.S. export control laws and
 | 
				
			||||||
 | 
					    other applicable export and import laws. You agree to notify us if you
 | 
				
			||||||
 | 
					    become aware of any circumstance which would make any of the foregoing
 | 
				
			||||||
 | 
					    representations inaccurate in any respect. We may publicly disclose your 
 | 
				
			||||||
 | 
					    participation in the project, including the fact that you have signed the SCA.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					6. This SCA is governed by the laws of the State of California and applicable
 | 
				
			||||||
 | 
					U.S. Federal law. Any choice of law rules will not apply.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					7. Please place an “x” on one of the applicable statement below. Please do NOT
 | 
				
			||||||
 | 
					mark both statements:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    * [x] I am signing on behalf of myself as an individual and no other person
 | 
				
			||||||
 | 
					    or entity, including my employer, has or will have rights with respect to my
 | 
				
			||||||
 | 
					    contributions.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    * [] I am signing on behalf of my employer or a legal entity and I have the
 | 
				
			||||||
 | 
					    actual authority to contractually bind that entity.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					## Contributor Details
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					| Field                          | Entry                    |
 | 
				
			||||||
 | 
					|------------------------------- | ------------------------ |
 | 
				
			||||||
 | 
					| Name                           | Kim, Baeg-il             |
 | 
				
			||||||
 | 
					| Company name (if applicable)   |                          |
 | 
				
			||||||
 | 
					| Title or role (if applicable)  |                          |
 | 
				
			||||||
 | 
					| Date                           | 2019-07-03               |
 | 
				
			||||||
 | 
					| GitHub username                | cedar101                 |
 | 
				
			||||||
 | 
					| Website (optional)             |                          |
 | 
				
			||||||
							
								
								
									
										106
									
								
								.github/contributors/khellan.md
									
									
									
									
										vendored
									
									
										Normal file
									
								
							
							
						
						
									
										106
									
								
								.github/contributors/khellan.md
									
									
									
									
										vendored
									
									
										Normal file
									
								
							| 
						 | 
					@ -0,0 +1,106 @@
 | 
				
			||||||
 | 
					# spaCy contributor agreement
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					This spaCy Contributor Agreement (**"SCA"**) is based on the
 | 
				
			||||||
 | 
					[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
 | 
				
			||||||
 | 
					The SCA applies to any contribution that you make to any product or project
 | 
				
			||||||
 | 
					managed by us (the **"project"**), and sets out the intellectual property rights
 | 
				
			||||||
 | 
					you grant to us in the contributed materials. The term **"us"** shall mean
 | 
				
			||||||
 | 
					[ExplosionAI GmbH](https://explosion.ai/legal). The term
 | 
				
			||||||
 | 
					**"you"** shall mean the person or entity identified below.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					If you agree to be bound by these terms, fill in the information requested
 | 
				
			||||||
 | 
					below and include the filled-in version with your first pull request, under the
 | 
				
			||||||
 | 
					folder [`.github/contributors/`](/.github/contributors/). The name of the file
 | 
				
			||||||
 | 
					should be your GitHub username, with the extension `.md`. For example, the user
 | 
				
			||||||
 | 
					example_user would create the file `.github/contributors/example_user.md`.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					Read this agreement carefully before signing. These terms and conditions
 | 
				
			||||||
 | 
					constitute a binding legal agreement.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					## Contributor Agreement
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					1. The term "contribution" or "contributed materials" means any source code,
 | 
				
			||||||
 | 
					object code, patch, tool, sample, graphic, specification, manual,
 | 
				
			||||||
 | 
					documentation, or any other material posted or submitted by you to the project.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					2. With respect to any worldwide copyrights, or copyright applications and
 | 
				
			||||||
 | 
					registrations, in your contribution:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    * you hereby assign to us joint ownership, and to the extent that such
 | 
				
			||||||
 | 
					    assignment is or becomes invalid, ineffective or unenforceable, you hereby
 | 
				
			||||||
 | 
					    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
 | 
				
			||||||
 | 
					    royalty-free, unrestricted license to exercise all rights under those
 | 
				
			||||||
 | 
					    copyrights. This includes, at our option, the right to sublicense these same
 | 
				
			||||||
 | 
					    rights to third parties through multiple levels of sublicensees or other
 | 
				
			||||||
 | 
					    licensing arrangements;
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    * you agree that each of us can do all things in relation to your
 | 
				
			||||||
 | 
					    contribution as if each of us were the sole owners, and if one of us makes
 | 
				
			||||||
 | 
					    a derivative work of your contribution, the one who makes the derivative
 | 
				
			||||||
 | 
					    work (or has it made will be the sole owner of that derivative work;
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    * you agree that you will not assert any moral rights in your contribution
 | 
				
			||||||
 | 
					    against us, our licensees or transferees;
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    * you agree that we may register a copyright in your contribution and
 | 
				
			||||||
 | 
					    exercise all ownership rights associated with it; and
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    * you agree that neither of us has any duty to consult with, obtain the
 | 
				
			||||||
 | 
					    consent of, pay or render an accounting to the other for any use or
 | 
				
			||||||
 | 
					    distribution of your contribution.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					3. With respect to any patents you own, or that you can license without payment
 | 
				
			||||||
 | 
					to any third party, you hereby grant to us a perpetual, irrevocable,
 | 
				
			||||||
 | 
					non-exclusive, worldwide, no-charge, royalty-free license to:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    * make, have made, use, sell, offer to sell, import, and otherwise transfer
 | 
				
			||||||
 | 
					    your contribution in whole or in part, alone or in combination with or
 | 
				
			||||||
 | 
					    included in any product, work or materials arising out of the project to
 | 
				
			||||||
 | 
					    which your contribution was submitted, and
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    * at our option, to sublicense these same rights to third parties through
 | 
				
			||||||
 | 
					    multiple levels of sublicensees or other licensing arrangements.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					4. Except as set out above, you keep all right, title, and interest in your
 | 
				
			||||||
 | 
					contribution. The rights that you grant to us under these terms are effective
 | 
				
			||||||
 | 
					on the date you first submitted a contribution to us, even if your submission
 | 
				
			||||||
 | 
					took place before the date you sign these terms.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					5. You covenant, represent, warrant and agree that:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    * Each contribution that you submit is and shall be an original work of
 | 
				
			||||||
 | 
					    authorship and you can legally grant the rights set out in this SCA;
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    * to the best of your knowledge, each contribution will not violate any
 | 
				
			||||||
 | 
					    third party's copyrights, trademarks, patents, or other intellectual
 | 
				
			||||||
 | 
					    property rights; and
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    * each contribution shall be in compliance with U.S. export control laws and
 | 
				
			||||||
 | 
					    other applicable export and import laws. You agree to notify us if you
 | 
				
			||||||
 | 
					    become aware of any circumstance which would make any of the foregoing
 | 
				
			||||||
 | 
					    representations inaccurate in any respect. We may publicly disclose your
 | 
				
			||||||
 | 
					    participation in the project, including the fact that you have signed the SCA.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					6. This SCA is governed by the laws of the State of California and applicable
 | 
				
			||||||
 | 
					U.S. Federal law. Any choice of law rules will not apply.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					7. Please place an “x” on one of the applicable statement below. Please do NOT
 | 
				
			||||||
 | 
					mark both statements:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    * [x] I am signing on behalf of myself as an individual and no other person
 | 
				
			||||||
 | 
					    or entity, including my employer, has or will have rights with respect to my
 | 
				
			||||||
 | 
					    contributions.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    * [ ] I am signing on behalf of my employer or a legal entity and I have the
 | 
				
			||||||
 | 
					    actual authority to contractually bind that entity.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					## Contributor Details
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					| Field                          | Entry                |
 | 
				
			||||||
 | 
					|------------------------------- | -------------------- |
 | 
				
			||||||
 | 
					| Name                           | Knut O. Hellan       |
 | 
				
			||||||
 | 
					| Company name (if applicable)   |                      |
 | 
				
			||||||
 | 
					| Title or role (if applicable)  |                      |
 | 
				
			||||||
 | 
					| Date                           | 02.07.2019           |
 | 
				
			||||||
 | 
					| GitHub username                | khellan              |
 | 
				
			||||||
 | 
					| Website (optional)             | knuthellan.com       |
 | 
				
			||||||
							
								
								
									
										106
									
								
								.github/contributors/kognate.md
									
									
									
									
										vendored
									
									
										Normal file
									
								
							
							
						
						
									
										106
									
								
								.github/contributors/kognate.md
									
									
									
									
										vendored
									
									
										Normal file
									
								
							| 
						 | 
					@ -0,0 +1,106 @@
 | 
				
			||||||
 | 
					# spaCy contributor agreement
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					This spaCy Contributor Agreement (**"SCA"**) is based on the
 | 
				
			||||||
 | 
					[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
 | 
				
			||||||
 | 
					The SCA applies to any contribution that you make to any product or project
 | 
				
			||||||
 | 
					managed by us (the **"project"**), and sets out the intellectual property rights
 | 
				
			||||||
 | 
					you grant to us in the contributed materials. The term **"us"** shall mean
 | 
				
			||||||
 | 
					[ExplosionAI GmbH](https://explosion.ai/legal). The term
 | 
				
			||||||
 | 
					**"you"** shall mean the person or entity identified below.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					If you agree to be bound by these terms, fill in the information requested
 | 
				
			||||||
 | 
					below and include the filled-in version with your first pull request, under the
 | 
				
			||||||
 | 
					folder [`.github/contributors/`](/.github/contributors/). The name of the file
 | 
				
			||||||
 | 
					should be your GitHub username, with the extension `.md`. For example, the user
 | 
				
			||||||
 | 
					example_user would create the file `.github/contributors/example_user.md`.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					Read this agreement carefully before signing. These terms and conditions
 | 
				
			||||||
 | 
					constitute a binding legal agreement.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					## Contributor Agreement
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					1. The term "contribution" or "contributed materials" means any source code,
 | 
				
			||||||
 | 
					object code, patch, tool, sample, graphic, specification, manual,
 | 
				
			||||||
 | 
					documentation, or any other material posted or submitted by you to the project.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					2. With respect to any worldwide copyrights, or copyright applications and
 | 
				
			||||||
 | 
					registrations, in your contribution:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    * you hereby assign to us joint ownership, and to the extent that such
 | 
				
			||||||
 | 
					    assignment is or becomes invalid, ineffective or unenforceable, you hereby
 | 
				
			||||||
 | 
					    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
 | 
				
			||||||
 | 
					    royalty-free, unrestricted license to exercise all rights under those
 | 
				
			||||||
 | 
					    copyrights. This includes, at our option, the right to sublicense these same
 | 
				
			||||||
 | 
					    rights to third parties through multiple levels of sublicensees or other
 | 
				
			||||||
 | 
					    licensing arrangements;
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    * you agree that each of us can do all things in relation to your
 | 
				
			||||||
 | 
					    contribution as if each of us were the sole owners, and if one of us makes
 | 
				
			||||||
 | 
					    a derivative work of your contribution, the one who makes the derivative
 | 
				
			||||||
 | 
					    work (or has it made will be the sole owner of that derivative work;
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    * you agree that you will not assert any moral rights in your contribution
 | 
				
			||||||
 | 
					    against us, our licensees or transferees;
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    * you agree that we may register a copyright in your contribution and
 | 
				
			||||||
 | 
					    exercise all ownership rights associated with it; and
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    * you agree that neither of us has any duty to consult with, obtain the
 | 
				
			||||||
 | 
					    consent of, pay or render an accounting to the other for any use or
 | 
				
			||||||
 | 
					    distribution of your contribution.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					3. With respect to any patents you own, or that you can license without payment
 | 
				
			||||||
 | 
					to any third party, you hereby grant to us a perpetual, irrevocable,
 | 
				
			||||||
 | 
					non-exclusive, worldwide, no-charge, royalty-free license to:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    * make, have made, use, sell, offer to sell, import, and otherwise transfer
 | 
				
			||||||
 | 
					    your contribution in whole or in part, alone or in combination with or
 | 
				
			||||||
 | 
					    included in any product, work or materials arising out of the project to
 | 
				
			||||||
 | 
					    which your contribution was submitted, and
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    * at our option, to sublicense these same rights to third parties through
 | 
				
			||||||
 | 
					    multiple levels of sublicensees or other licensing arrangements.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					4. Except as set out above, you keep all right, title, and interest in your
 | 
				
			||||||
 | 
					contribution. The rights that you grant to us under these terms are effective
 | 
				
			||||||
 | 
					on the date you first submitted a contribution to us, even if your submission
 | 
				
			||||||
 | 
					took place before the date you sign these terms.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					5. You covenant, represent, warrant and agree that:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    * Each contribution that you submit is and shall be an original work of
 | 
				
			||||||
 | 
					    authorship and you can legally grant the rights set out in this SCA;
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    * to the best of your knowledge, each contribution will not violate any
 | 
				
			||||||
 | 
					    third party's copyrights, trademarks, patents, or other intellectual
 | 
				
			||||||
 | 
					    property rights; and
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    * each contribution shall be in compliance with U.S. export control laws and
 | 
				
			||||||
 | 
					    other applicable export and import laws. You agree to notify us if you
 | 
				
			||||||
 | 
					    become aware of any circumstance which would make any of the foregoing
 | 
				
			||||||
 | 
					    representations inaccurate in any respect. We may publicly disclose your
 | 
				
			||||||
 | 
					    participation in the project, including the fact that you have signed the SCA.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					6. This SCA is governed by the laws of the State of California and applicable
 | 
				
			||||||
 | 
					U.S. Federal law. Any choice of law rules will not apply.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					7. Please place an “x” on one of the applicable statement below. Please do NOT
 | 
				
			||||||
 | 
					mark both statements:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    * [X] I am signing on behalf of myself as an individual and no other person
 | 
				
			||||||
 | 
					    or entity, including my employer, has or will have rights with respect to my
 | 
				
			||||||
 | 
					    contributions.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    * [ ] I am signing on behalf of my employer or a legal entity and I have the
 | 
				
			||||||
 | 
					    actual authority to contractually bind that entity.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					## Contributor Details
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					| Field                          | Entry                |
 | 
				
			||||||
 | 
					|------------------------------- | -------------------- |
 | 
				
			||||||
 | 
					| Name                           | Joshua B. Smith      |
 | 
				
			||||||
 | 
					| Company name (if applicable)   |                      |
 | 
				
			||||||
 | 
					| Title or role (if applicable)  |                      |
 | 
				
			||||||
 | 
					| Date                           | July 7, 2019         |
 | 
				
			||||||
 | 
					| GitHub username                | kognate              |
 | 
				
			||||||
 | 
					| Website (optional)             |                      |
 | 
				
			||||||
							
								
								
									
										106
									
								
								.github/contributors/rokasramas.md
									
									
									
									
										vendored
									
									
										Normal file
									
								
							
							
						
						
									
										106
									
								
								.github/contributors/rokasramas.md
									
									
									
									
										vendored
									
									
										Normal file
									
								
							| 
						 | 
					@ -0,0 +1,106 @@
 | 
				
			||||||
 | 
					# spaCy contributor agreement
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					This spaCy Contributor Agreement (**"SCA"**) is based on the
 | 
				
			||||||
 | 
					[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
 | 
				
			||||||
 | 
					The SCA applies to any contribution that you make to any product or project
 | 
				
			||||||
 | 
					managed by us (the **"project"**), and sets out the intellectual property rights
 | 
				
			||||||
 | 
					you grant to us in the contributed materials. The term **"us"** shall mean
 | 
				
			||||||
 | 
					[ExplosionAI GmbH](https://explosion.ai/legal). The term
 | 
				
			||||||
 | 
					**"you"** shall mean the person or entity identified below.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					If you agree to be bound by these terms, fill in the information requested
 | 
				
			||||||
 | 
					below and include the filled-in version with your first pull request, under the
 | 
				
			||||||
 | 
					folder [`.github/contributors/`](/.github/contributors/). The name of the file
 | 
				
			||||||
 | 
					should be your GitHub username, with the extension `.md`. For example, the user
 | 
				
			||||||
 | 
					example_user would create the file `.github/contributors/example_user.md`.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					Read this agreement carefully before signing. These terms and conditions
 | 
				
			||||||
 | 
					constitute a binding legal agreement.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					## Contributor Agreement
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					1. The term "contribution" or "contributed materials" means any source code,
 | 
				
			||||||
 | 
					object code, patch, tool, sample, graphic, specification, manual,
 | 
				
			||||||
 | 
					documentation, or any other material posted or submitted by you to the project.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					2. With respect to any worldwide copyrights, or copyright applications and
 | 
				
			||||||
 | 
					registrations, in your contribution:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    * you hereby assign to us joint ownership, and to the extent that such
 | 
				
			||||||
 | 
					    assignment is or becomes invalid, ineffective or unenforceable, you hereby
 | 
				
			||||||
 | 
					    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
 | 
				
			||||||
 | 
					    royalty-free, unrestricted license to exercise all rights under those
 | 
				
			||||||
 | 
					    copyrights. This includes, at our option, the right to sublicense these same
 | 
				
			||||||
 | 
					    rights to third parties through multiple levels of sublicensees or other
 | 
				
			||||||
 | 
					    licensing arrangements;
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    * you agree that each of us can do all things in relation to your
 | 
				
			||||||
 | 
					    contribution as if each of us were the sole owners, and if one of us makes
 | 
				
			||||||
 | 
					    a derivative work of your contribution, the one who makes the derivative
 | 
				
			||||||
 | 
					    work (or has it made will be the sole owner of that derivative work;
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    * you agree that you will not assert any moral rights in your contribution
 | 
				
			||||||
 | 
					    against us, our licensees or transferees;
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    * you agree that we may register a copyright in your contribution and
 | 
				
			||||||
 | 
					    exercise all ownership rights associated with it; and
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    * you agree that neither of us has any duty to consult with, obtain the
 | 
				
			||||||
 | 
					    consent of, pay or render an accounting to the other for any use or
 | 
				
			||||||
 | 
					    distribution of your contribution.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					3. With respect to any patents you own, or that you can license without payment
 | 
				
			||||||
 | 
					to any third party, you hereby grant to us a perpetual, irrevocable,
 | 
				
			||||||
 | 
					non-exclusive, worldwide, no-charge, royalty-free license to:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    * make, have made, use, sell, offer to sell, import, and otherwise transfer
 | 
				
			||||||
 | 
					    your contribution in whole or in part, alone or in combination with or
 | 
				
			||||||
 | 
					    included in any product, work or materials arising out of the project to
 | 
				
			||||||
 | 
					    which your contribution was submitted, and
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    * at our option, to sublicense these same rights to third parties through
 | 
				
			||||||
 | 
					    multiple levels of sublicensees or other licensing arrangements.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					4. Except as set out above, you keep all right, title, and interest in your
 | 
				
			||||||
 | 
					contribution. The rights that you grant to us under these terms are effective
 | 
				
			||||||
 | 
					on the date you first submitted a contribution to us, even if your submission
 | 
				
			||||||
 | 
					took place before the date you sign these terms.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					5. You covenant, represent, warrant and agree that:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    * Each contribution that you submit is and shall be an original work of
 | 
				
			||||||
 | 
					    authorship and you can legally grant the rights set out in this SCA;
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    * to the best of your knowledge, each contribution will not violate any
 | 
				
			||||||
 | 
					    third party's copyrights, trademarks, patents, or other intellectual
 | 
				
			||||||
 | 
					    property rights; and
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    * each contribution shall be in compliance with U.S. export control laws and
 | 
				
			||||||
 | 
					    other applicable export and import laws. You agree to notify us if you
 | 
				
			||||||
 | 
					    become aware of any circumstance which would make any of the foregoing
 | 
				
			||||||
 | 
					    representations inaccurate in any respect. We may publicly disclose your
 | 
				
			||||||
 | 
					    participation in the project, including the fact that you have signed the SCA.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					6. This SCA is governed by the laws of the State of California and applicable
 | 
				
			||||||
 | 
					U.S. Federal law. Any choice of law rules will not apply.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					7. Please place an “x” on one of the applicable statement below. Please do NOT
 | 
				
			||||||
 | 
					mark both statements:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    * [ ] I am signing on behalf of myself as an individual and no other person
 | 
				
			||||||
 | 
					    or entity, including my employer, has or will have rights with respect to my
 | 
				
			||||||
 | 
					    contributions.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    * [x] I am signing on behalf of my employer or a legal entity and I have the
 | 
				
			||||||
 | 
					    actual authority to contractually bind that entity.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					## Contributor Details
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					| Field                          | Entry                   |
 | 
				
			||||||
 | 
					|------------------------------- | ----------------------- |
 | 
				
			||||||
 | 
					| Name                           | Rokas Ramanauskas       |
 | 
				
			||||||
 | 
					| Company name (if applicable)   | TokenMill               |
 | 
				
			||||||
 | 
					| Title or role (if applicable)  | Software Engineer       |
 | 
				
			||||||
 | 
					| Date                           | 2019-07-02              |
 | 
				
			||||||
 | 
					| GitHub username                | rokasramas              |
 | 
				
			||||||
 | 
					| Website (optional)             | http://www.tokenmill.lt |
 | 
				
			||||||
							
								
								
									
										106
									
								
								.github/contributors/yashpatadia.md
									
									
									
									
										vendored
									
									
										Normal file
									
								
							
							
						
						
									
										106
									
								
								.github/contributors/yashpatadia.md
									
									
									
									
										vendored
									
									
										Normal file
									
								
							| 
						 | 
					@ -0,0 +1,106 @@
 | 
				
			||||||
 | 
					# spaCy contributor agreement
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					This spaCy Contributor Agreement (**"SCA"**) is based on the
 | 
				
			||||||
 | 
					[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
 | 
				
			||||||
 | 
					The SCA applies to any contribution that you make to any product or project
 | 
				
			||||||
 | 
					managed by us (the **"project"**), and sets out the intellectual property rights
 | 
				
			||||||
 | 
					you grant to us in the contributed materials. The term **"us"** shall mean
 | 
				
			||||||
 | 
					[ExplosionAI GmbH](https://explosion.ai/legal). The term
 | 
				
			||||||
 | 
					**"you"** shall mean the person or entity identified below.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					If you agree to be bound by these terms, fill in the information requested
 | 
				
			||||||
 | 
					below and include the filled-in version with your first pull request, under the
 | 
				
			||||||
 | 
					folder [`.github/contributors/`](/.github/contributors/). The name of the file
 | 
				
			||||||
 | 
					should be your GitHub username, with the extension `.md`. For example, the user
 | 
				
			||||||
 | 
					example_user would create the file `.github/contributors/example_user.md`.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					Read this agreement carefully before signing. These terms and conditions
 | 
				
			||||||
 | 
					constitute a binding legal agreement.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					## Contributor Agreement
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					1. The term "contribution" or "contributed materials" means any source code,
 | 
				
			||||||
 | 
					object code, patch, tool, sample, graphic, specification, manual,
 | 
				
			||||||
 | 
					documentation, or any other material posted or submitted by you to the project.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					2. With respect to any worldwide copyrights, or copyright applications and
 | 
				
			||||||
 | 
					registrations, in your contribution:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    * you hereby assign to us joint ownership, and to the extent that such
 | 
				
			||||||
 | 
					    assignment is or becomes invalid, ineffective or unenforceable, you hereby
 | 
				
			||||||
 | 
					    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
 | 
				
			||||||
 | 
					    royalty-free, unrestricted license to exercise all rights under those
 | 
				
			||||||
 | 
					    copyrights. This includes, at our option, the right to sublicense these same
 | 
				
			||||||
 | 
					    rights to third parties through multiple levels of sublicensees or other
 | 
				
			||||||
 | 
					    licensing arrangements;
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    * you agree that each of us can do all things in relation to your
 | 
				
			||||||
 | 
					    contribution as if each of us were the sole owners, and if one of us makes
 | 
				
			||||||
 | 
					    a derivative work of your contribution, the one who makes the derivative
 | 
				
			||||||
 | 
					    work (or has it made will be the sole owner of that derivative work;
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    * you agree that you will not assert any moral rights in your contribution
 | 
				
			||||||
 | 
					    against us, our licensees or transferees;
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    * you agree that we may register a copyright in your contribution and
 | 
				
			||||||
 | 
					    exercise all ownership rights associated with it; and
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    * you agree that neither of us has any duty to consult with, obtain the
 | 
				
			||||||
 | 
					    consent of, pay or render an accounting to the other for any use or
 | 
				
			||||||
 | 
					    distribution of your contribution.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					3. With respect to any patents you own, or that you can license without payment
 | 
				
			||||||
 | 
					to any third party, you hereby grant to us a perpetual, irrevocable,
 | 
				
			||||||
 | 
					non-exclusive, worldwide, no-charge, royalty-free license to:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    * make, have made, use, sell, offer to sell, import, and otherwise transfer
 | 
				
			||||||
 | 
					    your contribution in whole or in part, alone or in combination with or
 | 
				
			||||||
 | 
					    included in any product, work or materials arising out of the project to
 | 
				
			||||||
 | 
					    which your contribution was submitted, and
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    * at our option, to sublicense these same rights to third parties through
 | 
				
			||||||
 | 
					    multiple levels of sublicensees or other licensing arrangements.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					4. Except as set out above, you keep all right, title, and interest in your
 | 
				
			||||||
 | 
					contribution. The rights that you grant to us under these terms are effective
 | 
				
			||||||
 | 
					on the date you first submitted a contribution to us, even if your submission
 | 
				
			||||||
 | 
					took place before the date you sign these terms.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					5. You covenant, represent, warrant and agree that:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    * Each contribution that you submit is and shall be an original work of
 | 
				
			||||||
 | 
					    authorship and you can legally grant the rights set out in this SCA;
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    * to the best of your knowledge, each contribution will not violate any
 | 
				
			||||||
 | 
					    third party's copyrights, trademarks, patents, or other intellectual
 | 
				
			||||||
 | 
					    property rights; and
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    * each contribution shall be in compliance with U.S. export control laws and
 | 
				
			||||||
 | 
					    other applicable export and import laws. You agree to notify us if you
 | 
				
			||||||
 | 
					    become aware of any circumstance which would make any of the foregoing
 | 
				
			||||||
 | 
					    representations inaccurate in any respect. We may publicly disclose your
 | 
				
			||||||
 | 
					    participation in the project, including the fact that you have signed the SCA.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					6. This SCA is governed by the laws of the State of California and applicable
 | 
				
			||||||
 | 
					U.S. Federal law. Any choice of law rules will not apply.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					7. Please place an “x” on one of the applicable statement below. Please do NOT
 | 
				
			||||||
 | 
					mark both statements:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    * [x] I am signing on behalf of myself as an individual and no other person
 | 
				
			||||||
 | 
					    or entity, including my employer, has or will have rights with respect to my
 | 
				
			||||||
 | 
					    contributions.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    * [ ] I am signing on behalf of my employer or a legal entity and I have the
 | 
				
			||||||
 | 
					    actual authority to contractually bind that entity.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					## Contributor Details
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					| Field                          | Entry                |
 | 
				
			||||||
 | 
					|------------------------------- | -------------------- |
 | 
				
			||||||
 | 
					| Name                           |     Yash Patadia     |
 | 
				
			||||||
 | 
					| Company name (if applicable)   |                      |
 | 
				
			||||||
 | 
					| Title or role (if applicable)  |                      |
 | 
				
			||||||
 | 
					| Date                           |      11/07/2019      |
 | 
				
			||||||
 | 
					| GitHub username                |       yash1994       |
 | 
				
			||||||
 | 
					| Website (optional)             |                      |
 | 
				
			||||||
							
								
								
									
										2
									
								
								.gitignore
									
									
									
									
										vendored
									
									
								
							
							
						
						
									
										2
									
								
								.gitignore
									
									
									
									
										vendored
									
									
								
							| 
						 | 
					@ -56,6 +56,8 @@ parts/
 | 
				
			||||||
sdist/
 | 
					sdist/
 | 
				
			||||||
var/
 | 
					var/
 | 
				
			||||||
*.egg-info/
 | 
					*.egg-info/
 | 
				
			||||||
 | 
					pip-wheel-metadata/
 | 
				
			||||||
 | 
					Pipfile.lock
 | 
				
			||||||
.installed.cfg
 | 
					.installed.cfg
 | 
				
			||||||
*.egg
 | 
					*.egg
 | 
				
			||||||
.eggs
 | 
					.eggs
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
							
								
								
									
										0
									
								
								bin/__init__.py
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										0
									
								
								bin/__init__.py
									
									
									
									
									
										Normal file
									
								
							| 
						 | 
					@ -5,7 +5,6 @@ import logging
 | 
				
			||||||
from pathlib import Path
 | 
					from pathlib import Path
 | 
				
			||||||
from collections import defaultdict
 | 
					from collections import defaultdict
 | 
				
			||||||
from gensim.models import Word2Vec
 | 
					from gensim.models import Word2Vec
 | 
				
			||||||
from preshed.counter import PreshCounter
 | 
					 | 
				
			||||||
import plac
 | 
					import plac
 | 
				
			||||||
import spacy
 | 
					import spacy
 | 
				
			||||||
 | 
					
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -292,8 +292,8 @@ def evaluate(gold_ud, system_ud, deprel_weights=None, check_parse=True):
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    def spans_score(gold_spans, system_spans):
 | 
					    def spans_score(gold_spans, system_spans):
 | 
				
			||||||
        correct, gi, si = 0, 0, 0
 | 
					        correct, gi, si = 0, 0, 0
 | 
				
			||||||
        undersegmented = list()
 | 
					        undersegmented = []
 | 
				
			||||||
        oversegmented = list()
 | 
					        oversegmented = []
 | 
				
			||||||
        combo = 0
 | 
					        combo = 0
 | 
				
			||||||
        previous_end_si_earlier = False
 | 
					        previous_end_si_earlier = False
 | 
				
			||||||
        previous_end_gi_earlier = False
 | 
					        previous_end_gi_earlier = False
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
							
								
								
									
										0
									
								
								bin/wiki_entity_linking/__init__.py
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										0
									
								
								bin/wiki_entity_linking/__init__.py
									
									
									
									
									
										Normal file
									
								
							
							
								
								
									
										171
									
								
								bin/wiki_entity_linking/kb_creator.py
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										171
									
								
								bin/wiki_entity_linking/kb_creator.py
									
									
									
									
									
										Normal file
									
								
							| 
						 | 
					@ -0,0 +1,171 @@
 | 
				
			||||||
 | 
					# coding: utf-8
 | 
				
			||||||
 | 
					from __future__ import unicode_literals
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					from .train_descriptions import EntityEncoder
 | 
				
			||||||
 | 
					from . import wikidata_processor as wd, wikipedia_processor as wp
 | 
				
			||||||
 | 
					from spacy.kb import KnowledgeBase
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					import csv
 | 
				
			||||||
 | 
					import datetime
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					INPUT_DIM = 300  # dimension of pre-trained input vectors
 | 
				
			||||||
 | 
					DESC_WIDTH = 64  # dimension of output entity vectors
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					def create_kb(nlp, max_entities_per_alias, min_entity_freq, min_occ,
 | 
				
			||||||
 | 
					              entity_def_output, entity_descr_output,
 | 
				
			||||||
 | 
					              count_input, prior_prob_input, wikidata_input):
 | 
				
			||||||
 | 
					    # Create the knowledge base from Wikidata entries
 | 
				
			||||||
 | 
					    kb = KnowledgeBase(vocab=nlp.vocab, entity_vector_length=DESC_WIDTH)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    # disable this part of the pipeline when rerunning the KB generation from preprocessed files
 | 
				
			||||||
 | 
					    read_raw_data = True
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    if read_raw_data:
 | 
				
			||||||
 | 
					        print()
 | 
				
			||||||
 | 
					        print(" * _read_wikidata_entities", datetime.datetime.now())
 | 
				
			||||||
 | 
					        title_to_id, id_to_descr = wd.read_wikidata_entities_json(wikidata_input)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					        # write the title-ID and ID-description mappings to file
 | 
				
			||||||
 | 
					        _write_entity_files(entity_def_output, entity_descr_output, title_to_id, id_to_descr)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    else:
 | 
				
			||||||
 | 
					        # read the mappings from file
 | 
				
			||||||
 | 
					        title_to_id = get_entity_to_id(entity_def_output)
 | 
				
			||||||
 | 
					        id_to_descr = get_id_to_description(entity_descr_output)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    print()
 | 
				
			||||||
 | 
					    print(" * _get_entity_frequencies", datetime.datetime.now())
 | 
				
			||||||
 | 
					    print()
 | 
				
			||||||
 | 
					    entity_frequencies = wp.get_all_frequencies(count_input=count_input)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    # filter the entities for in the KB by frequency, because there's just too much data (8M entities) otherwise
 | 
				
			||||||
 | 
					    filtered_title_to_id = dict()
 | 
				
			||||||
 | 
					    entity_list = []
 | 
				
			||||||
 | 
					    description_list = []
 | 
				
			||||||
 | 
					    frequency_list = []
 | 
				
			||||||
 | 
					    for title, entity in title_to_id.items():
 | 
				
			||||||
 | 
					        freq = entity_frequencies.get(title, 0)
 | 
				
			||||||
 | 
					        desc = id_to_descr.get(entity, None)
 | 
				
			||||||
 | 
					        if desc and freq > min_entity_freq:
 | 
				
			||||||
 | 
					            entity_list.append(entity)
 | 
				
			||||||
 | 
					            description_list.append(desc)
 | 
				
			||||||
 | 
					            frequency_list.append(freq)
 | 
				
			||||||
 | 
					            filtered_title_to_id[title] = entity
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    print("Kept", len(filtered_title_to_id.keys()), "out of", len(title_to_id.keys()),
 | 
				
			||||||
 | 
					          "titles with filter frequency", min_entity_freq)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    print()
 | 
				
			||||||
 | 
					    print(" * train entity encoder", datetime.datetime.now())
 | 
				
			||||||
 | 
					    print()
 | 
				
			||||||
 | 
					    encoder = EntityEncoder(nlp, INPUT_DIM, DESC_WIDTH)
 | 
				
			||||||
 | 
					    encoder.train(description_list=description_list, to_print=True)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    print()
 | 
				
			||||||
 | 
					    print(" * get entity embeddings", datetime.datetime.now())
 | 
				
			||||||
 | 
					    print()
 | 
				
			||||||
 | 
					    embeddings = encoder.apply_encoder(description_list)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    print()
 | 
				
			||||||
 | 
					    print(" * adding", len(entity_list), "entities", datetime.datetime.now())
 | 
				
			||||||
 | 
					    kb.set_entities(entity_list=entity_list, prob_list=frequency_list, vector_list=embeddings)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    print()
 | 
				
			||||||
 | 
					    print(" * adding aliases", datetime.datetime.now())
 | 
				
			||||||
 | 
					    print()
 | 
				
			||||||
 | 
					    _add_aliases(kb, title_to_id=filtered_title_to_id,
 | 
				
			||||||
 | 
					                 max_entities_per_alias=max_entities_per_alias, min_occ=min_occ,
 | 
				
			||||||
 | 
					                 prior_prob_input=prior_prob_input)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    print()
 | 
				
			||||||
 | 
					    print("kb size:", len(kb), kb.get_size_entities(), kb.get_size_aliases())
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    print("done with kb", datetime.datetime.now())
 | 
				
			||||||
 | 
					    return kb
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					def _write_entity_files(entity_def_output, entity_descr_output, title_to_id, id_to_descr):
 | 
				
			||||||
 | 
					    with open(entity_def_output, mode='w', encoding='utf8') as id_file:
 | 
				
			||||||
 | 
					        id_file.write("WP_title" + "|" + "WD_id" + "\n")
 | 
				
			||||||
 | 
					        for title, qid in title_to_id.items():
 | 
				
			||||||
 | 
					            id_file.write(title + "|" + str(qid) + "\n")
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    with open(entity_descr_output, mode='w', encoding='utf8') as descr_file:
 | 
				
			||||||
 | 
					        descr_file.write("WD_id" + "|" + "description" + "\n")
 | 
				
			||||||
 | 
					        for qid, descr in id_to_descr.items():
 | 
				
			||||||
 | 
					            descr_file.write(str(qid) + "|" + descr + "\n")
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					def get_entity_to_id(entity_def_output):
 | 
				
			||||||
 | 
					    entity_to_id = dict()
 | 
				
			||||||
 | 
					    with open(entity_def_output, 'r', encoding='utf8') as csvfile:
 | 
				
			||||||
 | 
					        csvreader = csv.reader(csvfile, delimiter='|')
 | 
				
			||||||
 | 
					        # skip header
 | 
				
			||||||
 | 
					        next(csvreader)
 | 
				
			||||||
 | 
					        for row in csvreader:
 | 
				
			||||||
 | 
					            entity_to_id[row[0]] = row[1]
 | 
				
			||||||
 | 
					    return entity_to_id
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					def get_id_to_description(entity_descr_output):
 | 
				
			||||||
 | 
					    id_to_desc = dict()
 | 
				
			||||||
 | 
					    with open(entity_descr_output, 'r', encoding='utf8') as csvfile:
 | 
				
			||||||
 | 
					        csvreader = csv.reader(csvfile, delimiter='|')
 | 
				
			||||||
 | 
					        # skip header
 | 
				
			||||||
 | 
					        next(csvreader)
 | 
				
			||||||
 | 
					        for row in csvreader:
 | 
				
			||||||
 | 
					            id_to_desc[row[0]] = row[1]
 | 
				
			||||||
 | 
					    return id_to_desc
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					def _add_aliases(kb, title_to_id, max_entities_per_alias, min_occ, prior_prob_input):
 | 
				
			||||||
 | 
					    wp_titles = title_to_id.keys()
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    # adding aliases with prior probabilities
 | 
				
			||||||
 | 
					    # we can read this file sequentially, it's sorted by alias, and then by count
 | 
				
			||||||
 | 
					    with open(prior_prob_input, mode='r', encoding='utf8') as prior_file:
 | 
				
			||||||
 | 
					        # skip header
 | 
				
			||||||
 | 
					        prior_file.readline()
 | 
				
			||||||
 | 
					        line = prior_file.readline()
 | 
				
			||||||
 | 
					        previous_alias = None
 | 
				
			||||||
 | 
					        total_count = 0
 | 
				
			||||||
 | 
					        counts = []
 | 
				
			||||||
 | 
					        entities = []
 | 
				
			||||||
 | 
					        while line:
 | 
				
			||||||
 | 
					            splits = line.replace('\n', "").split(sep='|')
 | 
				
			||||||
 | 
					            new_alias = splits[0]
 | 
				
			||||||
 | 
					            count = int(splits[1])
 | 
				
			||||||
 | 
					            entity = splits[2]
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					            if new_alias != previous_alias and previous_alias:
 | 
				
			||||||
 | 
					                # done reading the previous alias --> output
 | 
				
			||||||
 | 
					                if len(entities) > 0:
 | 
				
			||||||
 | 
					                    selected_entities = []
 | 
				
			||||||
 | 
					                    prior_probs = []
 | 
				
			||||||
 | 
					                    for ent_count, ent_string in zip(counts, entities):
 | 
				
			||||||
 | 
					                        if ent_string in wp_titles:
 | 
				
			||||||
 | 
					                            wd_id = title_to_id[ent_string]
 | 
				
			||||||
 | 
					                            p_entity_givenalias = ent_count / total_count
 | 
				
			||||||
 | 
					                            selected_entities.append(wd_id)
 | 
				
			||||||
 | 
					                            prior_probs.append(p_entity_givenalias)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					                    if selected_entities:
 | 
				
			||||||
 | 
					                        try:
 | 
				
			||||||
 | 
					                            kb.add_alias(alias=previous_alias, entities=selected_entities, probabilities=prior_probs)
 | 
				
			||||||
 | 
					                        except ValueError as e:
 | 
				
			||||||
 | 
					                            print(e)
 | 
				
			||||||
 | 
					                total_count = 0
 | 
				
			||||||
 | 
					                counts = []
 | 
				
			||||||
 | 
					                entities = []
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					            total_count += count
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					            if len(entities) < max_entities_per_alias and count >= min_occ:
 | 
				
			||||||
 | 
					                counts.append(count)
 | 
				
			||||||
 | 
					                entities.append(entity)
 | 
				
			||||||
 | 
					            previous_alias = new_alias
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					            line = prior_file.readline()
 | 
				
			||||||
 | 
					
 | 
				
			||||||
							
								
								
									
										121
									
								
								bin/wiki_entity_linking/train_descriptions.py
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										121
									
								
								bin/wiki_entity_linking/train_descriptions.py
									
									
									
									
									
										Normal file
									
								
							| 
						 | 
					@ -0,0 +1,121 @@
 | 
				
			||||||
 | 
					# coding: utf-8
 | 
				
			||||||
 | 
					from random import shuffle
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					import numpy as np
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					from spacy._ml import zero_init, create_default_optimizer
 | 
				
			||||||
 | 
					from spacy.cli.pretrain import get_cossim_loss
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					from thinc.v2v import Model
 | 
				
			||||||
 | 
					from thinc.api import chain
 | 
				
			||||||
 | 
					from thinc.neural._classes.affine import Affine
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					class EntityEncoder:
 | 
				
			||||||
 | 
					    """
 | 
				
			||||||
 | 
					    Train the embeddings of entity descriptions to fit a fixed-size entity vector (e.g. 64D).
 | 
				
			||||||
 | 
					    This entity vector will be stored in the KB, for further downstream use in the entity model.
 | 
				
			||||||
 | 
					    """
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    DROP = 0
 | 
				
			||||||
 | 
					    EPOCHS = 5
 | 
				
			||||||
 | 
					    STOP_THRESHOLD = 0.04
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    BATCH_SIZE = 1000
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    def __init__(self, nlp, input_dim, desc_width):
 | 
				
			||||||
 | 
					        self.nlp = nlp
 | 
				
			||||||
 | 
					        self.input_dim = input_dim
 | 
				
			||||||
 | 
					        self.desc_width = desc_width
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    def apply_encoder(self, description_list):
 | 
				
			||||||
 | 
					        if self.encoder is None:
 | 
				
			||||||
 | 
					            raise ValueError("Can not apply encoder before training it")
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					        batch_size = 100000
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					        start = 0
 | 
				
			||||||
 | 
					        stop = min(batch_size, len(description_list))
 | 
				
			||||||
 | 
					        encodings = []
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					        while start < len(description_list):
 | 
				
			||||||
 | 
					            docs = list(self.nlp.pipe(description_list[start:stop]))
 | 
				
			||||||
 | 
					            doc_embeddings = [self._get_doc_embedding(doc) for doc in docs]
 | 
				
			||||||
 | 
					            enc = self.encoder(np.asarray(doc_embeddings))
 | 
				
			||||||
 | 
					            encodings.extend(enc.tolist())
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					            start = start + batch_size
 | 
				
			||||||
 | 
					            stop = min(stop + batch_size, len(description_list))
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					        return encodings
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    def train(self, description_list, to_print=False):
 | 
				
			||||||
 | 
					        processed, loss = self._train_model(description_list)
 | 
				
			||||||
 | 
					        if to_print:
 | 
				
			||||||
 | 
					            print("Trained on", processed, "entities across", self.EPOCHS, "epochs")
 | 
				
			||||||
 | 
					            print("Final loss:", loss)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    def _train_model(self, description_list):
 | 
				
			||||||
 | 
					        # TODO: when loss gets too low, a 'mean of empty slice' warning is thrown by numpy
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					        self._build_network(self.input_dim, self.desc_width)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					        processed = 0
 | 
				
			||||||
 | 
					        loss = 1
 | 
				
			||||||
 | 
					        descriptions = description_list.copy()   # copy this list so that shuffling does not affect other functions
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					        for i in range(self.EPOCHS):
 | 
				
			||||||
 | 
					            shuffle(descriptions)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					            batch_nr = 0
 | 
				
			||||||
 | 
					            start = 0
 | 
				
			||||||
 | 
					            stop = min(self.BATCH_SIZE, len(descriptions))
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					            while loss > self.STOP_THRESHOLD and start < len(descriptions):
 | 
				
			||||||
 | 
					                batch = []
 | 
				
			||||||
 | 
					                for descr in descriptions[start:stop]:
 | 
				
			||||||
 | 
					                    doc = self.nlp(descr)
 | 
				
			||||||
 | 
					                    doc_vector = self._get_doc_embedding(doc)
 | 
				
			||||||
 | 
					                    batch.append(doc_vector)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					                loss = self._update(batch)
 | 
				
			||||||
 | 
					                print(i, batch_nr, loss)
 | 
				
			||||||
 | 
					                processed += len(batch)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					                batch_nr += 1
 | 
				
			||||||
 | 
					                start = start + self.BATCH_SIZE
 | 
				
			||||||
 | 
					                stop = min(stop + self.BATCH_SIZE, len(descriptions))
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					        return processed, loss
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    @staticmethod
 | 
				
			||||||
 | 
					    def _get_doc_embedding(doc):
 | 
				
			||||||
 | 
					        indices = np.zeros((len(doc),), dtype="i")
 | 
				
			||||||
 | 
					        for i, word in enumerate(doc):
 | 
				
			||||||
 | 
					            if word.orth in doc.vocab.vectors.key2row:
 | 
				
			||||||
 | 
					                indices[i] = doc.vocab.vectors.key2row[word.orth]
 | 
				
			||||||
 | 
					            else:
 | 
				
			||||||
 | 
					                indices[i] = 0
 | 
				
			||||||
 | 
					        word_vectors = doc.vocab.vectors.data[indices]
 | 
				
			||||||
 | 
					        doc_vector = np.mean(word_vectors, axis=0)
 | 
				
			||||||
 | 
					        return doc_vector
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    def _build_network(self, orig_width, hidden_with):
 | 
				
			||||||
 | 
					        with Model.define_operators({">>": chain}):
 | 
				
			||||||
 | 
					            # very simple encoder-decoder model
 | 
				
			||||||
 | 
					            self.encoder = (
 | 
				
			||||||
 | 
					                Affine(hidden_with, orig_width)
 | 
				
			||||||
 | 
					            )
 | 
				
			||||||
 | 
					            self.model = self.encoder >> zero_init(Affine(orig_width, hidden_with, drop_factor=0.0))
 | 
				
			||||||
 | 
					        self.sgd = create_default_optimizer(self.model.ops)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    def _update(self, vectors):
 | 
				
			||||||
 | 
					        predictions, bp_model = self.model.begin_update(np.asarray(vectors), drop=self.DROP)
 | 
				
			||||||
 | 
					        loss, d_scores = self._get_loss(scores=predictions, golds=np.asarray(vectors))
 | 
				
			||||||
 | 
					        bp_model(d_scores, sgd=self.sgd)
 | 
				
			||||||
 | 
					        return loss / len(vectors)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    @staticmethod
 | 
				
			||||||
 | 
					    def _get_loss(golds, scores):
 | 
				
			||||||
 | 
					        loss, gradients = get_cossim_loss(scores, golds)
 | 
				
			||||||
 | 
					        return loss, gradients
 | 
				
			||||||
							
								
								
									
										353
									
								
								bin/wiki_entity_linking/training_set_creator.py
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										353
									
								
								bin/wiki_entity_linking/training_set_creator.py
									
									
									
									
									
										Normal file
									
								
							| 
						 | 
					@ -0,0 +1,353 @@
 | 
				
			||||||
 | 
					# coding: utf-8
 | 
				
			||||||
 | 
					from __future__ import unicode_literals
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					import os
 | 
				
			||||||
 | 
					import re
 | 
				
			||||||
 | 
					import bz2
 | 
				
			||||||
 | 
					import datetime
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					from spacy.gold import GoldParse
 | 
				
			||||||
 | 
					from bin.wiki_entity_linking import kb_creator
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					"""
 | 
				
			||||||
 | 
					Process Wikipedia interlinks to generate a training dataset for the EL algorithm.
 | 
				
			||||||
 | 
					Gold-standard entities are stored in one file in standoff format (by character offset).
 | 
				
			||||||
 | 
					"""
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					ENTITY_FILE = "gold_entities.csv"
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					def create_training(wikipedia_input, entity_def_input, training_output):
 | 
				
			||||||
 | 
					    wp_to_id = kb_creator.get_entity_to_id(entity_def_input)
 | 
				
			||||||
 | 
					    _process_wikipedia_texts(wikipedia_input, wp_to_id, training_output, limit=None)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					def _process_wikipedia_texts(wikipedia_input, wp_to_id, training_output, limit=None):
 | 
				
			||||||
 | 
					    """
 | 
				
			||||||
 | 
					    Read the XML wikipedia data to parse out training data:
 | 
				
			||||||
 | 
					    raw text data + positive instances
 | 
				
			||||||
 | 
					    """
 | 
				
			||||||
 | 
					    title_regex = re.compile(r'(?<=<title>).*(?=</title>)')
 | 
				
			||||||
 | 
					    id_regex = re.compile(r'(?<=<id>)\d*(?=</id>)')
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    read_ids = set()
 | 
				
			||||||
 | 
					    entityfile_loc = training_output / ENTITY_FILE
 | 
				
			||||||
 | 
					    with open(entityfile_loc, mode="w", encoding='utf8') as entityfile:
 | 
				
			||||||
 | 
					        # write entity training header file
 | 
				
			||||||
 | 
					        _write_training_entity(outputfile=entityfile,
 | 
				
			||||||
 | 
					                               article_id="article_id",
 | 
				
			||||||
 | 
					                               alias="alias",
 | 
				
			||||||
 | 
					                               entity="WD_id",
 | 
				
			||||||
 | 
					                               start="start",
 | 
				
			||||||
 | 
					                               end="end")
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					        with bz2.open(wikipedia_input, mode='rb') as file:
 | 
				
			||||||
 | 
					            line = file.readline()
 | 
				
			||||||
 | 
					            cnt = 0
 | 
				
			||||||
 | 
					            article_text = ""
 | 
				
			||||||
 | 
					            article_title = None
 | 
				
			||||||
 | 
					            article_id = None
 | 
				
			||||||
 | 
					            reading_text = False
 | 
				
			||||||
 | 
					            reading_revision = False
 | 
				
			||||||
 | 
					            while line and (not limit or cnt < limit):
 | 
				
			||||||
 | 
					                if cnt % 1000000 == 0:
 | 
				
			||||||
 | 
					                    print(datetime.datetime.now(), "processed", cnt, "lines of Wikipedia dump")
 | 
				
			||||||
 | 
					                clean_line = line.strip().decode("utf-8")
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					                if clean_line == "<revision>":
 | 
				
			||||||
 | 
					                    reading_revision = True
 | 
				
			||||||
 | 
					                elif clean_line == "</revision>":
 | 
				
			||||||
 | 
					                    reading_revision = False
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					                # Start reading new page
 | 
				
			||||||
 | 
					                if clean_line == "<page>":
 | 
				
			||||||
 | 
					                    article_text = ""
 | 
				
			||||||
 | 
					                    article_title = None
 | 
				
			||||||
 | 
					                    article_id = None
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					                # finished reading this page
 | 
				
			||||||
 | 
					                elif clean_line == "</page>":
 | 
				
			||||||
 | 
					                    if article_id:
 | 
				
			||||||
 | 
					                        try:
 | 
				
			||||||
 | 
					                            _process_wp_text(wp_to_id, entityfile, article_id, article_title, article_text.strip(),
 | 
				
			||||||
 | 
					                                             training_output)
 | 
				
			||||||
 | 
					                        except Exception as e:
 | 
				
			||||||
 | 
					                            print("Error processing article", article_id, article_title, e)
 | 
				
			||||||
 | 
					                    else:
 | 
				
			||||||
 | 
					                        print("Done processing a page, but couldn't find an article_id ?", article_title)
 | 
				
			||||||
 | 
					                    article_text = ""
 | 
				
			||||||
 | 
					                    article_title = None
 | 
				
			||||||
 | 
					                    article_id = None
 | 
				
			||||||
 | 
					                    reading_text = False
 | 
				
			||||||
 | 
					                    reading_revision = False
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					                # start reading text within a page
 | 
				
			||||||
 | 
					                if "<text" in clean_line:
 | 
				
			||||||
 | 
					                    reading_text = True
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					                if reading_text:
 | 
				
			||||||
 | 
					                    article_text += " " + clean_line
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					                # stop reading text within a page (we assume a new page doesn't start on the same line)
 | 
				
			||||||
 | 
					                if "</text" in clean_line:
 | 
				
			||||||
 | 
					                    reading_text = False
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					                # read the ID of this article (outside the revision portion of the document)
 | 
				
			||||||
 | 
					                if not reading_revision:
 | 
				
			||||||
 | 
					                    ids = id_regex.search(clean_line)
 | 
				
			||||||
 | 
					                    if ids:
 | 
				
			||||||
 | 
					                        article_id = ids[0]
 | 
				
			||||||
 | 
					                        if article_id in read_ids:
 | 
				
			||||||
 | 
					                            print("Found duplicate article ID", article_id, clean_line)  # This should never happen ...
 | 
				
			||||||
 | 
					                        read_ids.add(article_id)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					                # read the title of this article (outside the revision portion of the document)
 | 
				
			||||||
 | 
					                if not reading_revision:
 | 
				
			||||||
 | 
					                    titles = title_regex.search(clean_line)
 | 
				
			||||||
 | 
					                    if titles:
 | 
				
			||||||
 | 
					                        article_title = titles[0].strip()
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					                line = file.readline()
 | 
				
			||||||
 | 
					                cnt += 1
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					text_regex = re.compile(r'(?<=<text xml:space=\"preserve\">).*(?=</text)')
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					def _process_wp_text(wp_to_id, entityfile, article_id, article_title, article_text, training_output):
 | 
				
			||||||
 | 
					    found_entities = False
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    # ignore meta Wikipedia pages
 | 
				
			||||||
 | 
					    if article_title.startswith("Wikipedia:"):
 | 
				
			||||||
 | 
					        return
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    # remove the text tags
 | 
				
			||||||
 | 
					    text = text_regex.search(article_text).group(0)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    # stop processing if this is a redirect page
 | 
				
			||||||
 | 
					    if text.startswith("#REDIRECT"):
 | 
				
			||||||
 | 
					        return
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    # get the raw text without markup etc, keeping only interwiki links
 | 
				
			||||||
 | 
					    clean_text = _get_clean_wp_text(text)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    # read the text char by char to get the right offsets for the interwiki links
 | 
				
			||||||
 | 
					    final_text = ""
 | 
				
			||||||
 | 
					    open_read = 0
 | 
				
			||||||
 | 
					    reading_text = True
 | 
				
			||||||
 | 
					    reading_entity = False
 | 
				
			||||||
 | 
					    reading_mention = False
 | 
				
			||||||
 | 
					    reading_special_case = False
 | 
				
			||||||
 | 
					    entity_buffer = ""
 | 
				
			||||||
 | 
					    mention_buffer = ""
 | 
				
			||||||
 | 
					    for index, letter in enumerate(clean_text):
 | 
				
			||||||
 | 
					        if letter == '[':
 | 
				
			||||||
 | 
					            open_read += 1
 | 
				
			||||||
 | 
					        elif letter == ']':
 | 
				
			||||||
 | 
					            open_read -= 1
 | 
				
			||||||
 | 
					        elif letter == '|':
 | 
				
			||||||
 | 
					            if reading_text:
 | 
				
			||||||
 | 
					                final_text += letter
 | 
				
			||||||
 | 
					            # switch from reading entity to mention in the [[entity|mention]] pattern
 | 
				
			||||||
 | 
					            elif reading_entity:
 | 
				
			||||||
 | 
					                reading_text = False
 | 
				
			||||||
 | 
					                reading_entity = False
 | 
				
			||||||
 | 
					                reading_mention = True
 | 
				
			||||||
 | 
					            else:
 | 
				
			||||||
 | 
					                reading_special_case = True
 | 
				
			||||||
 | 
					        else:
 | 
				
			||||||
 | 
					            if reading_entity:
 | 
				
			||||||
 | 
					                entity_buffer += letter
 | 
				
			||||||
 | 
					            elif reading_mention:
 | 
				
			||||||
 | 
					                mention_buffer += letter
 | 
				
			||||||
 | 
					            elif reading_text:
 | 
				
			||||||
 | 
					                final_text += letter
 | 
				
			||||||
 | 
					            else:
 | 
				
			||||||
 | 
					                raise ValueError("Not sure at point", clean_text[index-2:index+2])
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					        if open_read > 2:
 | 
				
			||||||
 | 
					            reading_special_case = True
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					        if open_read == 2 and reading_text:
 | 
				
			||||||
 | 
					            reading_text = False
 | 
				
			||||||
 | 
					            reading_entity = True
 | 
				
			||||||
 | 
					            reading_mention = False
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					        # we just finished reading an entity
 | 
				
			||||||
 | 
					        if open_read == 0 and not reading_text:
 | 
				
			||||||
 | 
					            if '#' in entity_buffer or entity_buffer.startswith(':'):
 | 
				
			||||||
 | 
					                reading_special_case = True
 | 
				
			||||||
 | 
					            # Ignore cases with nested structures like File: handles etc
 | 
				
			||||||
 | 
					            if not reading_special_case:
 | 
				
			||||||
 | 
					                if not mention_buffer:
 | 
				
			||||||
 | 
					                    mention_buffer = entity_buffer
 | 
				
			||||||
 | 
					                start = len(final_text)
 | 
				
			||||||
 | 
					                end = start + len(mention_buffer)
 | 
				
			||||||
 | 
					                qid = wp_to_id.get(entity_buffer, None)
 | 
				
			||||||
 | 
					                if qid:
 | 
				
			||||||
 | 
					                    _write_training_entity(outputfile=entityfile,
 | 
				
			||||||
 | 
					                                           article_id=article_id,
 | 
				
			||||||
 | 
					                                           alias=mention_buffer,
 | 
				
			||||||
 | 
					                                           entity=qid,
 | 
				
			||||||
 | 
					                                           start=start,
 | 
				
			||||||
 | 
					                                           end=end)
 | 
				
			||||||
 | 
					                found_entities = True
 | 
				
			||||||
 | 
					                final_text += mention_buffer
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					            entity_buffer = ""
 | 
				
			||||||
 | 
					            mention_buffer = ""
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					            reading_text = True
 | 
				
			||||||
 | 
					            reading_entity = False
 | 
				
			||||||
 | 
					            reading_mention = False
 | 
				
			||||||
 | 
					            reading_special_case = False
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    if found_entities:
 | 
				
			||||||
 | 
					        _write_training_article(article_id=article_id, clean_text=final_text, training_output=training_output)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					info_regex = re.compile(r'{[^{]*?}')
 | 
				
			||||||
 | 
					htlm_regex = re.compile(r'<!--[^-]*-->')
 | 
				
			||||||
 | 
					category_regex = re.compile(r'\[\[Category:[^\[]*]]')
 | 
				
			||||||
 | 
					file_regex = re.compile(r'\[\[File:[^[\]]+]]')
 | 
				
			||||||
 | 
					ref_regex = re.compile(r'<ref.*?>')     # non-greedy
 | 
				
			||||||
 | 
					ref_2_regex = re.compile(r'</ref.*?>')  # non-greedy
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					def _get_clean_wp_text(article_text):
 | 
				
			||||||
 | 
					    clean_text = article_text.strip()
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    # remove bolding & italic markup
 | 
				
			||||||
 | 
					    clean_text = clean_text.replace('\'\'\'', '')
 | 
				
			||||||
 | 
					    clean_text = clean_text.replace('\'\'', '')
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    # remove nested {{info}} statements by removing the inner/smallest ones first and iterating
 | 
				
			||||||
 | 
					    try_again = True
 | 
				
			||||||
 | 
					    previous_length = len(clean_text)
 | 
				
			||||||
 | 
					    while try_again:
 | 
				
			||||||
 | 
					        clean_text = info_regex.sub('', clean_text)  # non-greedy match excluding a nested {
 | 
				
			||||||
 | 
					        if len(clean_text) < previous_length:
 | 
				
			||||||
 | 
					            try_again = True
 | 
				
			||||||
 | 
					        else:
 | 
				
			||||||
 | 
					            try_again = False
 | 
				
			||||||
 | 
					        previous_length = len(clean_text)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    # remove HTML comments
 | 
				
			||||||
 | 
					    clean_text = htlm_regex.sub('', clean_text)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    # remove Category and File statements
 | 
				
			||||||
 | 
					    clean_text = category_regex.sub('', clean_text)
 | 
				
			||||||
 | 
					    clean_text = file_regex.sub('', clean_text)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    # remove multiple =
 | 
				
			||||||
 | 
					    while '==' in clean_text:
 | 
				
			||||||
 | 
					        clean_text = clean_text.replace("==", "=")
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    clean_text = clean_text.replace(". =", ".")
 | 
				
			||||||
 | 
					    clean_text = clean_text.replace(" = ", ". ")
 | 
				
			||||||
 | 
					    clean_text = clean_text.replace("= ", ".")
 | 
				
			||||||
 | 
					    clean_text = clean_text.replace(" =", "")
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    # remove refs (non-greedy match)
 | 
				
			||||||
 | 
					    clean_text = ref_regex.sub('', clean_text)
 | 
				
			||||||
 | 
					    clean_text = ref_2_regex.sub('', clean_text)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    # remove additional wikiformatting
 | 
				
			||||||
 | 
					    clean_text = re.sub(r'<blockquote>', '', clean_text)
 | 
				
			||||||
 | 
					    clean_text = re.sub(r'</blockquote>', '', clean_text)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    # change special characters back to normal ones
 | 
				
			||||||
 | 
					    clean_text = clean_text.replace(r'<', '<')
 | 
				
			||||||
 | 
					    clean_text = clean_text.replace(r'>', '>')
 | 
				
			||||||
 | 
					    clean_text = clean_text.replace(r'"', '"')
 | 
				
			||||||
 | 
					    clean_text = clean_text.replace(r'&nbsp;', ' ')
 | 
				
			||||||
 | 
					    clean_text = clean_text.replace(r'&', '&')
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    # remove multiple spaces
 | 
				
			||||||
 | 
					    while '  ' in clean_text:
 | 
				
			||||||
 | 
					        clean_text = clean_text.replace('  ', ' ')
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    return clean_text.strip()
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					def _write_training_article(article_id, clean_text, training_output):
 | 
				
			||||||
 | 
					    file_loc = training_output / str(article_id) + ".txt"
 | 
				
			||||||
 | 
					    with open(file_loc, mode='w', encoding='utf8') as outputfile:
 | 
				
			||||||
 | 
					        outputfile.write(clean_text)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					def _write_training_entity(outputfile, article_id, alias, entity, start, end):
 | 
				
			||||||
 | 
					    outputfile.write(article_id + "|" + alias + "|" + entity + "|" + str(start) + "|" + str(end) + "\n")
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					def is_dev(article_id):
 | 
				
			||||||
 | 
					    return article_id.endswith("3")
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					def read_training(nlp, training_dir, dev, limit):
 | 
				
			||||||
 | 
					    # This method provides training examples that correspond to the entity annotations found by the nlp object
 | 
				
			||||||
 | 
					    entityfile_loc = training_dir / ENTITY_FILE
 | 
				
			||||||
 | 
					    data = []
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    # assume the data is written sequentially, so we can reuse the article docs
 | 
				
			||||||
 | 
					    current_article_id = None
 | 
				
			||||||
 | 
					    current_doc = None
 | 
				
			||||||
 | 
					    ents_by_offset = dict()
 | 
				
			||||||
 | 
					    skip_articles = set()
 | 
				
			||||||
 | 
					    total_entities = 0
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    with open(entityfile_loc, mode='r', encoding='utf8') as file:
 | 
				
			||||||
 | 
					        for line in file:
 | 
				
			||||||
 | 
					            if not limit or len(data) < limit:
 | 
				
			||||||
 | 
					                fields = line.replace('\n', "").split(sep='|')
 | 
				
			||||||
 | 
					                article_id = fields[0]
 | 
				
			||||||
 | 
					                alias = fields[1]
 | 
				
			||||||
 | 
					                wp_title = fields[2]
 | 
				
			||||||
 | 
					                start = fields[3]
 | 
				
			||||||
 | 
					                end = fields[4]
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					                if dev == is_dev(article_id) and article_id != "article_id" and article_id not in skip_articles:
 | 
				
			||||||
 | 
					                    if not current_doc or (current_article_id != article_id):
 | 
				
			||||||
 | 
					                        # parse the new article text
 | 
				
			||||||
 | 
					                        file_name = article_id + ".txt"
 | 
				
			||||||
 | 
					                        try:
 | 
				
			||||||
 | 
					                            with open(os.path.join(training_dir, file_name), mode="r", encoding='utf8') as f:
 | 
				
			||||||
 | 
					                                text = f.read()
 | 
				
			||||||
 | 
					                                if len(text) < 30000:   # threshold for convenience / speed of processing
 | 
				
			||||||
 | 
					                                    current_doc = nlp(text)
 | 
				
			||||||
 | 
					                                    current_article_id = article_id
 | 
				
			||||||
 | 
					                                    ents_by_offset = dict()
 | 
				
			||||||
 | 
					                                    for ent in current_doc.ents:
 | 
				
			||||||
 | 
					                                        sent_length = len(ent.sent)
 | 
				
			||||||
 | 
					                                        # custom filtering to avoid too long or too short sentences
 | 
				
			||||||
 | 
					                                        if 5 < sent_length < 100:
 | 
				
			||||||
 | 
					                                            ents_by_offset[str(ent.start_char) + "_" + str(ent.end_char)] = ent
 | 
				
			||||||
 | 
					                                else:
 | 
				
			||||||
 | 
					                                    skip_articles.add(article_id)
 | 
				
			||||||
 | 
					                                    current_doc = None
 | 
				
			||||||
 | 
					                        except Exception as e:
 | 
				
			||||||
 | 
					                            print("Problem parsing article", article_id, e)
 | 
				
			||||||
 | 
					                            skip_articles.add(article_id)
 | 
				
			||||||
 | 
					                            raise e
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					                    # repeat checking this condition in case an exception was thrown
 | 
				
			||||||
 | 
					                    if current_doc and (current_article_id == article_id):
 | 
				
			||||||
 | 
					                        found_ent = ents_by_offset.get(start + "_" + end,  None)
 | 
				
			||||||
 | 
					                        if found_ent:
 | 
				
			||||||
 | 
					                            if found_ent.text != alias:
 | 
				
			||||||
 | 
					                                skip_articles.add(article_id)
 | 
				
			||||||
 | 
					                                current_doc = None
 | 
				
			||||||
 | 
					                            else:
 | 
				
			||||||
 | 
					                                sent = found_ent.sent.as_doc()
 | 
				
			||||||
 | 
					                                # currently feeding the gold data one entity per sentence at a time
 | 
				
			||||||
 | 
					                                gold_start = int(start) - found_ent.sent.start_char
 | 
				
			||||||
 | 
					                                gold_end = int(end) - found_ent.sent.start_char
 | 
				
			||||||
 | 
					                                gold_entities = [(gold_start, gold_end, wp_title)]
 | 
				
			||||||
 | 
					                                gold = GoldParse(doc=sent, links=gold_entities)
 | 
				
			||||||
 | 
					                                data.append((sent, gold))
 | 
				
			||||||
 | 
					                                total_entities += 1
 | 
				
			||||||
 | 
					                                if len(data) % 2500 == 0:
 | 
				
			||||||
 | 
					                                    print(" -read", total_entities, "entities")
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    print(" -read", total_entities, "entities")
 | 
				
			||||||
 | 
					    return data
 | 
				
			||||||
							
								
								
									
										119
									
								
								bin/wiki_entity_linking/wikidata_processor.py
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										119
									
								
								bin/wiki_entity_linking/wikidata_processor.py
									
									
									
									
									
										Normal file
									
								
							| 
						 | 
					@ -0,0 +1,119 @@
 | 
				
			||||||
 | 
					# coding: utf-8
 | 
				
			||||||
 | 
					from __future__ import unicode_literals
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					import bz2
 | 
				
			||||||
 | 
					import json
 | 
				
			||||||
 | 
					import datetime
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					def read_wikidata_entities_json(wikidata_file, limit=None, to_print=False):
 | 
				
			||||||
 | 
					    # Read the JSON wiki data and parse out the entities. Takes about 7u30 to parse 55M lines.
 | 
				
			||||||
 | 
					    # get latest-all.json.bz2 from https://dumps.wikimedia.org/wikidatawiki/entities/
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    lang = 'en'
 | 
				
			||||||
 | 
					    site_filter = 'enwiki'
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    # properties filter (currently disabled to get ALL data)
 | 
				
			||||||
 | 
					    prop_filter = dict()
 | 
				
			||||||
 | 
					    # prop_filter = {'P31': {'Q5', 'Q15632617'}}     # currently defined as OR: one property suffices to be selected
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    title_to_id = dict()
 | 
				
			||||||
 | 
					    id_to_descr = dict()
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    # parse appropriate fields - depending on what we need in the KB
 | 
				
			||||||
 | 
					    parse_properties = False
 | 
				
			||||||
 | 
					    parse_sitelinks = True
 | 
				
			||||||
 | 
					    parse_labels = False
 | 
				
			||||||
 | 
					    parse_descriptions = True
 | 
				
			||||||
 | 
					    parse_aliases = False
 | 
				
			||||||
 | 
					    parse_claims = False
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    with bz2.open(wikidata_file, mode='rb') as file:
 | 
				
			||||||
 | 
					        line = file.readline()
 | 
				
			||||||
 | 
					        cnt = 0
 | 
				
			||||||
 | 
					        while line and (not limit or cnt < limit):
 | 
				
			||||||
 | 
					            if cnt % 500000 == 0:
 | 
				
			||||||
 | 
					                print(datetime.datetime.now(), "processed", cnt, "lines of WikiData dump")
 | 
				
			||||||
 | 
					            clean_line = line.strip()
 | 
				
			||||||
 | 
					            if clean_line.endswith(b","):
 | 
				
			||||||
 | 
					                clean_line = clean_line[:-1]
 | 
				
			||||||
 | 
					            if len(clean_line) > 1:
 | 
				
			||||||
 | 
					                obj = json.loads(clean_line)
 | 
				
			||||||
 | 
					                entry_type = obj["type"]
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					                if entry_type == "item":
 | 
				
			||||||
 | 
					                    # filtering records on their properties (currently disabled to get ALL data)
 | 
				
			||||||
 | 
					                    # keep = False
 | 
				
			||||||
 | 
					                    keep = True
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					                    claims = obj["claims"]
 | 
				
			||||||
 | 
					                    if parse_claims:
 | 
				
			||||||
 | 
					                        for prop, value_set in prop_filter.items():
 | 
				
			||||||
 | 
					                            claim_property = claims.get(prop, None)
 | 
				
			||||||
 | 
					                            if claim_property:
 | 
				
			||||||
 | 
					                                for cp in claim_property:
 | 
				
			||||||
 | 
					                                    cp_id = cp['mainsnak'].get('datavalue', {}).get('value', {}).get('id')
 | 
				
			||||||
 | 
					                                    cp_rank = cp['rank']
 | 
				
			||||||
 | 
					                                    if cp_rank != "deprecated" and cp_id in value_set:
 | 
				
			||||||
 | 
					                                        keep = True
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					                    if keep:
 | 
				
			||||||
 | 
					                        unique_id = obj["id"]
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					                        if to_print:
 | 
				
			||||||
 | 
					                            print("ID:", unique_id)
 | 
				
			||||||
 | 
					                            print("type:", entry_type)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					                        # parsing all properties that refer to other entities
 | 
				
			||||||
 | 
					                        if parse_properties:
 | 
				
			||||||
 | 
					                            for prop, claim_property in claims.items():
 | 
				
			||||||
 | 
					                                cp_dicts = [cp['mainsnak']['datavalue'].get('value') for cp in claim_property
 | 
				
			||||||
 | 
					                                            if cp['mainsnak'].get('datavalue')]
 | 
				
			||||||
 | 
					                                cp_values = [cp_dict.get('id') for cp_dict in cp_dicts if isinstance(cp_dict, dict)
 | 
				
			||||||
 | 
					                                             if cp_dict.get('id') is not None]
 | 
				
			||||||
 | 
					                                if cp_values:
 | 
				
			||||||
 | 
					                                    if to_print:
 | 
				
			||||||
 | 
					                                        print("prop:", prop, cp_values)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					                        found_link = False
 | 
				
			||||||
 | 
					                        if parse_sitelinks:
 | 
				
			||||||
 | 
					                            site_value = obj["sitelinks"].get(site_filter, None)
 | 
				
			||||||
 | 
					                            if site_value:
 | 
				
			||||||
 | 
					                                site = site_value['title']
 | 
				
			||||||
 | 
					                                if to_print:
 | 
				
			||||||
 | 
					                                    print(site_filter, ":", site)
 | 
				
			||||||
 | 
					                                title_to_id[site] = unique_id
 | 
				
			||||||
 | 
					                                found_link = True
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					                        if parse_labels:
 | 
				
			||||||
 | 
					                            labels = obj["labels"]
 | 
				
			||||||
 | 
					                            if labels:
 | 
				
			||||||
 | 
					                                lang_label = labels.get(lang, None)
 | 
				
			||||||
 | 
					                                if lang_label:
 | 
				
			||||||
 | 
					                                    if to_print:
 | 
				
			||||||
 | 
					                                        print("label (" + lang + "):", lang_label["value"])
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					                        if found_link and parse_descriptions:
 | 
				
			||||||
 | 
					                            descriptions = obj["descriptions"]
 | 
				
			||||||
 | 
					                            if descriptions:
 | 
				
			||||||
 | 
					                                lang_descr = descriptions.get(lang, None)
 | 
				
			||||||
 | 
					                                if lang_descr:
 | 
				
			||||||
 | 
					                                    if to_print:
 | 
				
			||||||
 | 
					                                        print("description (" + lang + "):", lang_descr["value"])
 | 
				
			||||||
 | 
					                                    id_to_descr[unique_id] = lang_descr["value"]
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					                        if parse_aliases:
 | 
				
			||||||
 | 
					                            aliases = obj["aliases"]
 | 
				
			||||||
 | 
					                            if aliases:
 | 
				
			||||||
 | 
					                                lang_aliases = aliases.get(lang, None)
 | 
				
			||||||
 | 
					                                if lang_aliases:
 | 
				
			||||||
 | 
					                                    for item in lang_aliases:
 | 
				
			||||||
 | 
					                                        if to_print:
 | 
				
			||||||
 | 
					                                            print("alias (" + lang + "):", item["value"])
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					                        if to_print:
 | 
				
			||||||
 | 
					                            print()
 | 
				
			||||||
 | 
					            line = file.readline()
 | 
				
			||||||
 | 
					            cnt += 1
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    return title_to_id, id_to_descr
 | 
				
			||||||
							
								
								
									
										182
									
								
								bin/wiki_entity_linking/wikipedia_processor.py
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										182
									
								
								bin/wiki_entity_linking/wikipedia_processor.py
									
									
									
									
									
										Normal file
									
								
							| 
						 | 
					@ -0,0 +1,182 @@
 | 
				
			||||||
 | 
					# coding: utf-8
 | 
				
			||||||
 | 
					from __future__ import unicode_literals
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					import re
 | 
				
			||||||
 | 
					import bz2
 | 
				
			||||||
 | 
					import csv
 | 
				
			||||||
 | 
					import datetime
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					"""
 | 
				
			||||||
 | 
					Process a Wikipedia dump to calculate entity frequencies and prior probabilities in combination with certain mentions.
 | 
				
			||||||
 | 
					Write these results to file for downstream KB and training data generation.
 | 
				
			||||||
 | 
					"""
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					map_alias_to_link = dict()
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					# these will/should be matched ignoring case
 | 
				
			||||||
 | 
					wiki_namespaces = ["b", "betawikiversity", "Book", "c", "Category", "Commons",
 | 
				
			||||||
 | 
					                   "d", "dbdump", "download", "Draft", "Education", "Foundation",
 | 
				
			||||||
 | 
					                   "Gadget", "Gadget definition", "gerrit", "File", "Help", "Image", "Incubator",
 | 
				
			||||||
 | 
					                   "m", "mail", "mailarchive", "media", "MediaWiki", "MediaWiki talk", "Mediawikiwiki",
 | 
				
			||||||
 | 
					                   "MediaZilla", "Meta", "Metawikipedia", "Module",
 | 
				
			||||||
 | 
					                   "mw", "n", "nost", "oldwikisource", "outreach", "outreachwiki", "otrs", "OTRSwiki",
 | 
				
			||||||
 | 
					                   "Portal", "phab", "Phabricator", "Project", "q", "quality", "rev",
 | 
				
			||||||
 | 
					                   "s", "spcom", "Special", "species", "Strategy", "sulutil", "svn",
 | 
				
			||||||
 | 
					                   "Talk", "Template", "Template talk", "Testwiki", "ticket", "TimedText", "Toollabs", "tools",
 | 
				
			||||||
 | 
					                   "tswiki", "User", "User talk", "v", "voy",
 | 
				
			||||||
 | 
					                   "w", "Wikibooks", "Wikidata", "wikiHow", "Wikinvest", "wikilivres", "Wikimedia", "Wikinews",
 | 
				
			||||||
 | 
					                   "Wikipedia", "Wikipedia talk", "Wikiquote", "Wikisource", "Wikispecies", "Wikitech",
 | 
				
			||||||
 | 
					                   "Wikiversity", "Wikivoyage", "wikt", "wiktionary", "wmf", "wmania", "WP"]
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					# find the links
 | 
				
			||||||
 | 
					link_regex = re.compile(r'\[\[[^\[\]]*\]\]')
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					# match on interwiki links, e.g. `en:` or `:fr:`
 | 
				
			||||||
 | 
					ns_regex = r":?" + "[a-z][a-z]" + ":"
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					# match on Namespace: optionally preceded by a :
 | 
				
			||||||
 | 
					for ns in wiki_namespaces:
 | 
				
			||||||
 | 
					    ns_regex += "|" + ":?" + ns + ":"
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					ns_regex = re.compile(ns_regex, re.IGNORECASE)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					def read_wikipedia_prior_probs(wikipedia_input, prior_prob_output):
 | 
				
			||||||
 | 
					    """
 | 
				
			||||||
 | 
					    Read the XML wikipedia data and parse out intra-wiki links to estimate prior probabilities.
 | 
				
			||||||
 | 
					    The full file takes about 2h to parse 1100M lines.
 | 
				
			||||||
 | 
					    It works relatively fast because it runs line by line, irrelevant of which article the intrawiki is from.
 | 
				
			||||||
 | 
					    """
 | 
				
			||||||
 | 
					    with bz2.open(wikipedia_input, mode='rb') as file:
 | 
				
			||||||
 | 
					        line = file.readline()
 | 
				
			||||||
 | 
					        cnt = 0
 | 
				
			||||||
 | 
					        while line:
 | 
				
			||||||
 | 
					            if cnt % 5000000 == 0:
 | 
				
			||||||
 | 
					                print(datetime.datetime.now(), "processed", cnt, "lines of Wikipedia dump")
 | 
				
			||||||
 | 
					            clean_line = line.strip().decode("utf-8")
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					            aliases, entities, normalizations = get_wp_links(clean_line)
 | 
				
			||||||
 | 
					            for alias, entity, norm in zip(aliases, entities, normalizations):
 | 
				
			||||||
 | 
					                _store_alias(alias, entity, normalize_alias=norm, normalize_entity=True)
 | 
				
			||||||
 | 
					                _store_alias(alias, entity, normalize_alias=norm, normalize_entity=True)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					            line = file.readline()
 | 
				
			||||||
 | 
					            cnt += 1
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    # write all aliases and their entities and count occurrences to file
 | 
				
			||||||
 | 
					    with open(prior_prob_output, mode='w', encoding='utf8') as outputfile:
 | 
				
			||||||
 | 
					        outputfile.write("alias" + "|" + "count" + "|" + "entity" + "\n")
 | 
				
			||||||
 | 
					        for alias, alias_dict in sorted(map_alias_to_link.items(), key=lambda x: x[0]):
 | 
				
			||||||
 | 
					            for entity, count in sorted(alias_dict.items(), key=lambda x: x[1], reverse=True):
 | 
				
			||||||
 | 
					                outputfile.write(alias + "|" + str(count) + "|" + entity + "\n")
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					def _store_alias(alias, entity, normalize_alias=False, normalize_entity=True):
 | 
				
			||||||
 | 
					    alias = alias.strip()
 | 
				
			||||||
 | 
					    entity = entity.strip()
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    # remove everything after # as this is not part of the title but refers to a specific paragraph
 | 
				
			||||||
 | 
					    if normalize_entity:
 | 
				
			||||||
 | 
					        # wikipedia titles are always capitalized
 | 
				
			||||||
 | 
					        entity = _capitalize_first(entity.split("#")[0])
 | 
				
			||||||
 | 
					    if normalize_alias:
 | 
				
			||||||
 | 
					        alias = alias.split("#")[0]
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    if alias and entity:
 | 
				
			||||||
 | 
					        alias_dict = map_alias_to_link.get(alias, dict())
 | 
				
			||||||
 | 
					        entity_count = alias_dict.get(entity, 0)
 | 
				
			||||||
 | 
					        alias_dict[entity] = entity_count + 1
 | 
				
			||||||
 | 
					        map_alias_to_link[alias] = alias_dict
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					def get_wp_links(text):
 | 
				
			||||||
 | 
					    aliases = []
 | 
				
			||||||
 | 
					    entities = []
 | 
				
			||||||
 | 
					    normalizations = []
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    matches = link_regex.findall(text)
 | 
				
			||||||
 | 
					    for match in matches:
 | 
				
			||||||
 | 
					        match = match[2:][:-2].replace("_", " ").strip()
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					        if ns_regex.match(match):
 | 
				
			||||||
 | 
					            pass  # ignore namespaces at the beginning of the string
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					        # this is a simple [[link]], with the alias the same as the mention
 | 
				
			||||||
 | 
					        elif "|" not in match:
 | 
				
			||||||
 | 
					            aliases.append(match)
 | 
				
			||||||
 | 
					            entities.append(match)
 | 
				
			||||||
 | 
					            normalizations.append(True)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					        # in wiki format, the link is written as [[entity|alias]]
 | 
				
			||||||
 | 
					        else:
 | 
				
			||||||
 | 
					            splits = match.split("|")
 | 
				
			||||||
 | 
					            entity = splits[0].strip()
 | 
				
			||||||
 | 
					            alias = splits[1].strip()
 | 
				
			||||||
 | 
					            # specific wiki format  [[alias (specification)|]]
 | 
				
			||||||
 | 
					            if len(alias) == 0 and "(" in entity:
 | 
				
			||||||
 | 
					                alias = entity.split("(")[0]
 | 
				
			||||||
 | 
					                aliases.append(alias)
 | 
				
			||||||
 | 
					                entities.append(entity)
 | 
				
			||||||
 | 
					                normalizations.append(False)
 | 
				
			||||||
 | 
					            else:
 | 
				
			||||||
 | 
					                aliases.append(alias)
 | 
				
			||||||
 | 
					                entities.append(entity)
 | 
				
			||||||
 | 
					                normalizations.append(False)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    return aliases, entities, normalizations
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					def _capitalize_first(text):
 | 
				
			||||||
 | 
					    if not text:
 | 
				
			||||||
 | 
					        return None
 | 
				
			||||||
 | 
					    result = text[0].capitalize()
 | 
				
			||||||
 | 
					    if len(result) > 0:
 | 
				
			||||||
 | 
					        result += text[1:]
 | 
				
			||||||
 | 
					    return result
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					def write_entity_counts(prior_prob_input, count_output, to_print=False):
 | 
				
			||||||
 | 
					    # Write entity counts for quick access later
 | 
				
			||||||
 | 
					    entity_to_count = dict()
 | 
				
			||||||
 | 
					    total_count = 0
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    with open(prior_prob_input, mode='r', encoding='utf8') as prior_file:
 | 
				
			||||||
 | 
					        # skip header
 | 
				
			||||||
 | 
					        prior_file.readline()
 | 
				
			||||||
 | 
					        line = prior_file.readline()
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					        while line:
 | 
				
			||||||
 | 
					            splits = line.replace('\n', "").split(sep='|')
 | 
				
			||||||
 | 
					            # alias = splits[0]
 | 
				
			||||||
 | 
					            count = int(splits[1])
 | 
				
			||||||
 | 
					            entity = splits[2]
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					            current_count = entity_to_count.get(entity, 0)
 | 
				
			||||||
 | 
					            entity_to_count[entity] = current_count + count
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					            total_count += count
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					            line = prior_file.readline()
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    with open(count_output, mode='w', encoding='utf8') as entity_file:
 | 
				
			||||||
 | 
					        entity_file.write("entity" + "|" + "count" + "\n")
 | 
				
			||||||
 | 
					        for entity, count in entity_to_count.items():
 | 
				
			||||||
 | 
					            entity_file.write(entity + "|" + str(count) + "\n")
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    if to_print:
 | 
				
			||||||
 | 
					        for entity, count in entity_to_count.items():
 | 
				
			||||||
 | 
					            print("Entity count:", entity, count)
 | 
				
			||||||
 | 
					        print("Total count:", total_count)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					def get_all_frequencies(count_input):
 | 
				
			||||||
 | 
					    entity_to_count = dict()
 | 
				
			||||||
 | 
					    with open(count_input, 'r', encoding='utf8') as csvfile:
 | 
				
			||||||
 | 
					        csvreader = csv.reader(csvfile, delimiter='|')
 | 
				
			||||||
 | 
					        # skip header
 | 
				
			||||||
 | 
					        next(csvreader)
 | 
				
			||||||
 | 
					        for row in csvreader:
 | 
				
			||||||
 | 
					            entity_to_count[row[0]] = int(row[1])
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    return entity_to_count
 | 
				
			||||||
 | 
					
 | 
				
			||||||
| 
						 | 
					@ -51,7 +51,6 @@ def filter_spans(spans):
 | 
				
			||||||
 | 
					
 | 
				
			||||||
def extract_currency_relations(doc):
 | 
					def extract_currency_relations(doc):
 | 
				
			||||||
    # Merge entities and noun chunks into one token
 | 
					    # Merge entities and noun chunks into one token
 | 
				
			||||||
    seen_tokens = set()
 | 
					 | 
				
			||||||
    spans = list(doc.ents) + list(doc.noun_chunks)
 | 
					    spans = list(doc.ents) + list(doc.noun_chunks)
 | 
				
			||||||
    spans = filter_spans(spans)
 | 
					    spans = filter_spans(spans)
 | 
				
			||||||
    with doc.retokenize() as retokenizer:
 | 
					    with doc.retokenize() as retokenizer:
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -9,26 +9,26 @@ from spacy.kb import KnowledgeBase
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
def create_kb(vocab):
 | 
					def create_kb(vocab):
 | 
				
			||||||
    kb = KnowledgeBase(vocab=vocab)
 | 
					    kb = KnowledgeBase(vocab=vocab, entity_vector_length=1)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    # adding entities
 | 
					    # adding entities
 | 
				
			||||||
    entity_0 = "Q1004791_Douglas"
 | 
					    entity_0 = "Q1004791_Douglas"
 | 
				
			||||||
    print("adding entity", entity_0)
 | 
					    print("adding entity", entity_0)
 | 
				
			||||||
    kb.add_entity(entity=entity_0, prob=0.5)
 | 
					    kb.add_entity(entity=entity_0, prob=0.5, entity_vector=[0])
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    entity_1 = "Q42_Douglas_Adams"
 | 
					    entity_1 = "Q42_Douglas_Adams"
 | 
				
			||||||
    print("adding entity", entity_1)
 | 
					    print("adding entity", entity_1)
 | 
				
			||||||
    kb.add_entity(entity=entity_1, prob=0.5)
 | 
					    kb.add_entity(entity=entity_1, prob=0.5, entity_vector=[1])
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    entity_2 = "Q5301561_Douglas_Haig"
 | 
					    entity_2 = "Q5301561_Douglas_Haig"
 | 
				
			||||||
    print("adding entity", entity_2)
 | 
					    print("adding entity", entity_2)
 | 
				
			||||||
    kb.add_entity(entity=entity_2, prob=0.5)
 | 
					    kb.add_entity(entity=entity_2, prob=0.5, entity_vector=[2])
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    # adding aliases
 | 
					    # adding aliases
 | 
				
			||||||
    print()
 | 
					    print()
 | 
				
			||||||
    alias_0 = "Douglas"
 | 
					    alias_0 = "Douglas"
 | 
				
			||||||
    print("adding alias", alias_0)
 | 
					    print("adding alias", alias_0)
 | 
				
			||||||
    kb.add_alias(alias=alias_0, entities=[entity_0, entity_1, entity_2], probabilities=[0.1, 0.6, 0.2])
 | 
					    kb.add_alias(alias=alias_0, entities=[entity_0, entity_1, entity_2], probabilities=[0.6, 0.1, 0.2])
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    alias_1 = "Douglas Adams"
 | 
					    alias_1 = "Douglas Adams"
 | 
				
			||||||
    print("adding alias", alias_1)
 | 
					    print("adding alias", alias_1)
 | 
				
			||||||
| 
						 | 
					@ -41,8 +41,12 @@ def create_kb(vocab):
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
def add_el(kb, nlp):
 | 
					def add_el(kb, nlp):
 | 
				
			||||||
    el_pipe = nlp.create_pipe(name='entity_linker', config={"kb": kb})
 | 
					    el_pipe = nlp.create_pipe(name='entity_linker', config={"context_width": 64})
 | 
				
			||||||
 | 
					    el_pipe.set_kb(kb)
 | 
				
			||||||
    nlp.add_pipe(el_pipe, last=True)
 | 
					    nlp.add_pipe(el_pipe, last=True)
 | 
				
			||||||
 | 
					    nlp.begin_training()
 | 
				
			||||||
 | 
					    el_pipe.context_weight = 0
 | 
				
			||||||
 | 
					    el_pipe.prior_weight = 1
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    for alias in ["Douglas Adams", "Douglas"]:
 | 
					    for alias in ["Douglas Adams", "Douglas"]:
 | 
				
			||||||
        candidates = nlp.linker.kb.get_candidates(alias)
 | 
					        candidates = nlp.linker.kb.get_candidates(alias)
 | 
				
			||||||
| 
						 | 
					@ -66,6 +70,6 @@ def add_el(kb, nlp):
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
if __name__ == "__main__":
 | 
					if __name__ == "__main__":
 | 
				
			||||||
    nlp = spacy.load('en_core_web_sm')
 | 
					    my_nlp = spacy.load('en_core_web_sm')
 | 
				
			||||||
    my_kb = create_kb(nlp.vocab)
 | 
					    my_kb = create_kb(my_nlp.vocab)
 | 
				
			||||||
    add_el(my_kb, nlp)
 | 
					    add_el(my_kb, my_nlp)
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
							
								
								
									
										442
									
								
								examples/pipeline/wikidata_entity_linking.py
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										442
									
								
								examples/pipeline/wikidata_entity_linking.py
									
									
									
									
									
										Normal file
									
								
							| 
						 | 
					@ -0,0 +1,442 @@
 | 
				
			||||||
 | 
					# coding: utf-8
 | 
				
			||||||
 | 
					from __future__ import unicode_literals
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					import random
 | 
				
			||||||
 | 
					import datetime
 | 
				
			||||||
 | 
					from pathlib import Path
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					from bin.wiki_entity_linking import training_set_creator, kb_creator, wikipedia_processor as wp
 | 
				
			||||||
 | 
					from bin.wiki_entity_linking.kb_creator import DESC_WIDTH
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					import spacy
 | 
				
			||||||
 | 
					from spacy.kb import KnowledgeBase
 | 
				
			||||||
 | 
					from spacy.util import minibatch, compounding
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					"""
 | 
				
			||||||
 | 
					Demonstrate how to build a knowledge base from WikiData and run an Entity Linking algorithm.
 | 
				
			||||||
 | 
					"""
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					ROOT_DIR = Path("C:/Users/Sofie/Documents/data/")
 | 
				
			||||||
 | 
					OUTPUT_DIR = ROOT_DIR / 'wikipedia'
 | 
				
			||||||
 | 
					TRAINING_DIR = OUTPUT_DIR / 'training_data_nel'
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					PRIOR_PROB = OUTPUT_DIR / 'prior_prob.csv'
 | 
				
			||||||
 | 
					ENTITY_COUNTS = OUTPUT_DIR / 'entity_freq.csv'
 | 
				
			||||||
 | 
					ENTITY_DEFS = OUTPUT_DIR / 'entity_defs.csv'
 | 
				
			||||||
 | 
					ENTITY_DESCR = OUTPUT_DIR / 'entity_descriptions.csv'
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					KB_FILE = OUTPUT_DIR / 'kb_1' / 'kb'
 | 
				
			||||||
 | 
					NLP_1_DIR = OUTPUT_DIR / 'nlp_1'
 | 
				
			||||||
 | 
					NLP_2_DIR = OUTPUT_DIR / 'nlp_2'
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					# get latest-all.json.bz2 from https://dumps.wikimedia.org/wikidatawiki/entities/
 | 
				
			||||||
 | 
					WIKIDATA_JSON = ROOT_DIR / 'wikidata' / 'wikidata-20190304-all.json.bz2'
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					# get enwiki-latest-pages-articles-multistream.xml.bz2 from https://dumps.wikimedia.org/enwiki/latest/
 | 
				
			||||||
 | 
					ENWIKI_DUMP = ROOT_DIR / 'wikipedia' / 'enwiki-20190320-pages-articles-multistream.xml.bz2'
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					# KB construction parameters
 | 
				
			||||||
 | 
					MAX_CANDIDATES = 10
 | 
				
			||||||
 | 
					MIN_ENTITY_FREQ = 20
 | 
				
			||||||
 | 
					MIN_PAIR_OCC = 5
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					# model training parameters
 | 
				
			||||||
 | 
					EPOCHS = 10
 | 
				
			||||||
 | 
					DROPOUT = 0.5
 | 
				
			||||||
 | 
					LEARN_RATE = 0.005
 | 
				
			||||||
 | 
					L2 = 1e-6
 | 
				
			||||||
 | 
					CONTEXT_WIDTH = 128
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					def run_pipeline():
 | 
				
			||||||
 | 
					    # set the appropriate booleans to define which parts of the pipeline should be re(run)
 | 
				
			||||||
 | 
					    print("START", datetime.datetime.now())
 | 
				
			||||||
 | 
					    print()
 | 
				
			||||||
 | 
					    nlp_1 = spacy.load('en_core_web_lg')
 | 
				
			||||||
 | 
					    nlp_2 = None
 | 
				
			||||||
 | 
					    kb_2 = None
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    # one-time methods to create KB and write to file
 | 
				
			||||||
 | 
					    to_create_prior_probs = False
 | 
				
			||||||
 | 
					    to_create_entity_counts = False
 | 
				
			||||||
 | 
					    to_create_kb = False
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    # read KB back in from file
 | 
				
			||||||
 | 
					    to_read_kb = True
 | 
				
			||||||
 | 
					    to_test_kb = False
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    # create training dataset
 | 
				
			||||||
 | 
					    create_wp_training = False
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    # train the EL pipe
 | 
				
			||||||
 | 
					    train_pipe = True
 | 
				
			||||||
 | 
					    measure_performance = True
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    # test the EL pipe on a simple example
 | 
				
			||||||
 | 
					    to_test_pipeline = True
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    # write the NLP object, read back in and test again
 | 
				
			||||||
 | 
					    to_write_nlp = True
 | 
				
			||||||
 | 
					    to_read_nlp = True
 | 
				
			||||||
 | 
					    test_from_file = False
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    # STEP 1 : create prior probabilities from WP (run only once)
 | 
				
			||||||
 | 
					    if to_create_prior_probs:
 | 
				
			||||||
 | 
					        print("STEP 1: to_create_prior_probs", datetime.datetime.now())
 | 
				
			||||||
 | 
					        wp.read_wikipedia_prior_probs(wikipedia_input=ENWIKI_DUMP, prior_prob_output=PRIOR_PROB)
 | 
				
			||||||
 | 
					        print()
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    # STEP 2 : deduce entity frequencies from WP (run only once)
 | 
				
			||||||
 | 
					    if to_create_entity_counts:
 | 
				
			||||||
 | 
					        print("STEP 2: to_create_entity_counts", datetime.datetime.now())
 | 
				
			||||||
 | 
					        wp.write_entity_counts(prior_prob_input=PRIOR_PROB, count_output=ENTITY_COUNTS, to_print=False)
 | 
				
			||||||
 | 
					        print()
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    # STEP 3 : create KB and write to file (run only once)
 | 
				
			||||||
 | 
					    if to_create_kb:
 | 
				
			||||||
 | 
					        print("STEP 3a: to_create_kb", datetime.datetime.now())
 | 
				
			||||||
 | 
					        kb_1 = kb_creator.create_kb(nlp_1,
 | 
				
			||||||
 | 
					                                    max_entities_per_alias=MAX_CANDIDATES,
 | 
				
			||||||
 | 
					                                    min_entity_freq=MIN_ENTITY_FREQ,
 | 
				
			||||||
 | 
					                                    min_occ=MIN_PAIR_OCC,
 | 
				
			||||||
 | 
					                                    entity_def_output=ENTITY_DEFS,
 | 
				
			||||||
 | 
					                                    entity_descr_output=ENTITY_DESCR,
 | 
				
			||||||
 | 
					                                    count_input=ENTITY_COUNTS,
 | 
				
			||||||
 | 
					                                    prior_prob_input=PRIOR_PROB,
 | 
				
			||||||
 | 
					                                    wikidata_input=WIKIDATA_JSON)
 | 
				
			||||||
 | 
					        print("kb entities:", kb_1.get_size_entities())
 | 
				
			||||||
 | 
					        print("kb aliases:", kb_1.get_size_aliases())
 | 
				
			||||||
 | 
					        print()
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					        print("STEP 3b: write KB and NLP", datetime.datetime.now())
 | 
				
			||||||
 | 
					        kb_1.dump(KB_FILE)
 | 
				
			||||||
 | 
					        nlp_1.to_disk(NLP_1_DIR)
 | 
				
			||||||
 | 
					        print()
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    # STEP 4 : read KB back in from file
 | 
				
			||||||
 | 
					    if to_read_kb:
 | 
				
			||||||
 | 
					        print("STEP 4: to_read_kb", datetime.datetime.now())
 | 
				
			||||||
 | 
					        nlp_2 = spacy.load(NLP_1_DIR)
 | 
				
			||||||
 | 
					        kb_2 = KnowledgeBase(vocab=nlp_2.vocab, entity_vector_length=DESC_WIDTH)
 | 
				
			||||||
 | 
					        kb_2.load_bulk(KB_FILE)
 | 
				
			||||||
 | 
					        print("kb entities:", kb_2.get_size_entities())
 | 
				
			||||||
 | 
					        print("kb aliases:", kb_2.get_size_aliases())
 | 
				
			||||||
 | 
					        print()
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					        # test KB
 | 
				
			||||||
 | 
					        if to_test_kb:
 | 
				
			||||||
 | 
					            check_kb(kb_2)
 | 
				
			||||||
 | 
					            print()
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    # STEP 5: create a training dataset from WP
 | 
				
			||||||
 | 
					    if create_wp_training:
 | 
				
			||||||
 | 
					        print("STEP 5: create training dataset", datetime.datetime.now())
 | 
				
			||||||
 | 
					        training_set_creator.create_training(wikipedia_input=ENWIKI_DUMP,
 | 
				
			||||||
 | 
					                                             entity_def_input=ENTITY_DEFS,
 | 
				
			||||||
 | 
					                                             training_output=TRAINING_DIR)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    # STEP 6: create and train the entity linking pipe
 | 
				
			||||||
 | 
					    if train_pipe:
 | 
				
			||||||
 | 
					        print("STEP 6: training Entity Linking pipe", datetime.datetime.now())
 | 
				
			||||||
 | 
					        type_to_int = {label: i for i, label in enumerate(nlp_2.entity.labels)}
 | 
				
			||||||
 | 
					        print(" -analysing", len(type_to_int), "different entity types")
 | 
				
			||||||
 | 
					        el_pipe = nlp_2.create_pipe(name='entity_linker',
 | 
				
			||||||
 | 
					                                    config={"context_width": CONTEXT_WIDTH,
 | 
				
			||||||
 | 
					                                            "pretrained_vectors": nlp_2.vocab.vectors.name,
 | 
				
			||||||
 | 
					                                            "type_to_int": type_to_int})
 | 
				
			||||||
 | 
					        el_pipe.set_kb(kb_2)
 | 
				
			||||||
 | 
					        nlp_2.add_pipe(el_pipe, last=True)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					        other_pipes = [pipe for pipe in nlp_2.pipe_names if pipe != "entity_linker"]
 | 
				
			||||||
 | 
					        with nlp_2.disable_pipes(*other_pipes):  # only train Entity Linking
 | 
				
			||||||
 | 
					            optimizer = nlp_2.begin_training()
 | 
				
			||||||
 | 
					            optimizer.learn_rate = LEARN_RATE
 | 
				
			||||||
 | 
					            optimizer.L2 = L2
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					        # define the size (nr of entities) of training and dev set
 | 
				
			||||||
 | 
					        train_limit = 5000
 | 
				
			||||||
 | 
					        dev_limit = 5000
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					        train_data = training_set_creator.read_training(nlp=nlp_2,
 | 
				
			||||||
 | 
					                                                        training_dir=TRAINING_DIR,
 | 
				
			||||||
 | 
					                                                        dev=False,
 | 
				
			||||||
 | 
					                                                        limit=train_limit)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					        print("Training on", len(train_data), "articles")
 | 
				
			||||||
 | 
					        print()
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					        dev_data = training_set_creator.read_training(nlp=nlp_2,
 | 
				
			||||||
 | 
					                                                      training_dir=TRAINING_DIR,
 | 
				
			||||||
 | 
					                                                      dev=True,
 | 
				
			||||||
 | 
					                                                      limit=dev_limit)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					        print("Dev testing on", len(dev_data), "articles")
 | 
				
			||||||
 | 
					        print()
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					        if not train_data:
 | 
				
			||||||
 | 
					            print("Did not find any training data")
 | 
				
			||||||
 | 
					        else:
 | 
				
			||||||
 | 
					            for itn in range(EPOCHS):
 | 
				
			||||||
 | 
					                random.shuffle(train_data)
 | 
				
			||||||
 | 
					                losses = {}
 | 
				
			||||||
 | 
					                batches = minibatch(train_data, size=compounding(4.0, 128.0, 1.001))
 | 
				
			||||||
 | 
					                batchnr = 0
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					                with nlp_2.disable_pipes(*other_pipes):
 | 
				
			||||||
 | 
					                    for batch in batches:
 | 
				
			||||||
 | 
					                        try:
 | 
				
			||||||
 | 
					                            docs, golds = zip(*batch)
 | 
				
			||||||
 | 
					                            nlp_2.update(
 | 
				
			||||||
 | 
					                                docs,
 | 
				
			||||||
 | 
					                                golds,
 | 
				
			||||||
 | 
					                                sgd=optimizer,
 | 
				
			||||||
 | 
					                                drop=DROPOUT,
 | 
				
			||||||
 | 
					                                losses=losses,
 | 
				
			||||||
 | 
					                            )
 | 
				
			||||||
 | 
					                            batchnr += 1
 | 
				
			||||||
 | 
					                        except Exception as e:
 | 
				
			||||||
 | 
					                            print("Error updating batch:", e)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					                if batchnr > 0:
 | 
				
			||||||
 | 
					                    el_pipe.cfg["context_weight"] = 1
 | 
				
			||||||
 | 
					                    el_pipe.cfg["prior_weight"] = 1
 | 
				
			||||||
 | 
					                    dev_acc_context, dev_acc_context_dict = _measure_accuracy(dev_data, el_pipe)
 | 
				
			||||||
 | 
					                    losses['entity_linker'] = losses['entity_linker'] / batchnr
 | 
				
			||||||
 | 
					                    print("Epoch, train loss", itn, round(losses['entity_linker'], 2),
 | 
				
			||||||
 | 
					                          " / dev acc avg", round(dev_acc_context, 3))
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					        # STEP 7: measure the performance of our trained pipe on an independent dev set
 | 
				
			||||||
 | 
					        if len(dev_data) and measure_performance:
 | 
				
			||||||
 | 
					            print()
 | 
				
			||||||
 | 
					            print("STEP 7: performance measurement of Entity Linking pipe", datetime.datetime.now())
 | 
				
			||||||
 | 
					            print()
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					            counts, acc_r, acc_r_label, acc_p, acc_p_label, acc_o, acc_o_label = _measure_baselines(dev_data, kb_2)
 | 
				
			||||||
 | 
					            print("dev counts:", sorted(counts.items(), key=lambda x: x[0]))
 | 
				
			||||||
 | 
					            print("dev acc oracle:", round(acc_o, 3), [(x, round(y, 3)) for x, y in acc_o_label.items()])
 | 
				
			||||||
 | 
					            print("dev acc random:", round(acc_r, 3), [(x, round(y, 3)) for x, y in acc_r_label.items()])
 | 
				
			||||||
 | 
					            print("dev acc prior:", round(acc_p, 3), [(x, round(y, 3)) for x, y in acc_p_label.items()])
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					            # using only context
 | 
				
			||||||
 | 
					            el_pipe.cfg["context_weight"] = 1
 | 
				
			||||||
 | 
					            el_pipe.cfg["prior_weight"] = 0
 | 
				
			||||||
 | 
					            dev_acc_context, dev_acc_context_dict = _measure_accuracy(dev_data, el_pipe)
 | 
				
			||||||
 | 
					            print("dev acc context avg:", round(dev_acc_context, 3),
 | 
				
			||||||
 | 
					                  [(x, round(y, 3)) for x, y in dev_acc_context_dict.items()])
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					            # measuring combined accuracy (prior + context)
 | 
				
			||||||
 | 
					            el_pipe.cfg["context_weight"] = 1
 | 
				
			||||||
 | 
					            el_pipe.cfg["prior_weight"] = 1
 | 
				
			||||||
 | 
					            dev_acc_combo, dev_acc_combo_dict = _measure_accuracy(dev_data, el_pipe, error_analysis=False)
 | 
				
			||||||
 | 
					            print("dev acc combo avg:", round(dev_acc_combo, 3),
 | 
				
			||||||
 | 
					                  [(x, round(y, 3)) for x, y in dev_acc_combo_dict.items()])
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					        # STEP 8: apply the EL pipe on a toy example
 | 
				
			||||||
 | 
					        if to_test_pipeline:
 | 
				
			||||||
 | 
					            print()
 | 
				
			||||||
 | 
					            print("STEP 8: applying Entity Linking to toy example", datetime.datetime.now())
 | 
				
			||||||
 | 
					            print()
 | 
				
			||||||
 | 
					            run_el_toy_example(nlp=nlp_2)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					        # STEP 9: write the NLP pipeline (including entity linker) to file
 | 
				
			||||||
 | 
					        if to_write_nlp:
 | 
				
			||||||
 | 
					            print()
 | 
				
			||||||
 | 
					            print("STEP 9: testing NLP IO", datetime.datetime.now())
 | 
				
			||||||
 | 
					            print()
 | 
				
			||||||
 | 
					            print("writing to", NLP_2_DIR)
 | 
				
			||||||
 | 
					            nlp_2.to_disk(NLP_2_DIR)
 | 
				
			||||||
 | 
					            print()
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    # verify that the IO has gone correctly
 | 
				
			||||||
 | 
					    if to_read_nlp:
 | 
				
			||||||
 | 
					        print("reading from", NLP_2_DIR)
 | 
				
			||||||
 | 
					        nlp_3 = spacy.load(NLP_2_DIR)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					        print("running toy example with NLP 3")
 | 
				
			||||||
 | 
					        run_el_toy_example(nlp=nlp_3)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    # testing performance with an NLP model from file
 | 
				
			||||||
 | 
					    if test_from_file:
 | 
				
			||||||
 | 
					        nlp_2 = spacy.load(NLP_1_DIR)
 | 
				
			||||||
 | 
					        nlp_3 = spacy.load(NLP_2_DIR)
 | 
				
			||||||
 | 
					        el_pipe = nlp_3.get_pipe("entity_linker")
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					        dev_limit = 5000
 | 
				
			||||||
 | 
					        dev_data = training_set_creator.read_training(nlp=nlp_2,
 | 
				
			||||||
 | 
					                                                      training_dir=TRAINING_DIR,
 | 
				
			||||||
 | 
					                                                      dev=True,
 | 
				
			||||||
 | 
					                                                      limit=dev_limit)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					        print("Dev testing from file on", len(dev_data), "articles")
 | 
				
			||||||
 | 
					        print()
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					        dev_acc_combo, dev_acc_combo_dict = _measure_accuracy(dev_data, el_pipe=el_pipe, error_analysis=False)
 | 
				
			||||||
 | 
					        print("dev acc combo avg:", round(dev_acc_combo, 3),
 | 
				
			||||||
 | 
					              [(x, round(y, 3)) for x, y in dev_acc_combo_dict.items()])
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    print()
 | 
				
			||||||
 | 
					    print("STOP", datetime.datetime.now())
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					def _measure_accuracy(data, el_pipe=None, error_analysis=False):
 | 
				
			||||||
 | 
					    # If the docs in the data require further processing with an entity linker, set el_pipe
 | 
				
			||||||
 | 
					    correct_by_label = dict()
 | 
				
			||||||
 | 
					    incorrect_by_label = dict()
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    docs = [d for d, g in data if len(d) > 0]
 | 
				
			||||||
 | 
					    if el_pipe is not None:
 | 
				
			||||||
 | 
					        docs = list(el_pipe.pipe(docs))
 | 
				
			||||||
 | 
					    golds = [g for d, g in data if len(d) > 0]
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    for doc, gold in zip(docs, golds):
 | 
				
			||||||
 | 
					        try:
 | 
				
			||||||
 | 
					            correct_entries_per_article = dict()
 | 
				
			||||||
 | 
					            for entity in gold.links:
 | 
				
			||||||
 | 
					                start, end, gold_kb = entity
 | 
				
			||||||
 | 
					                correct_entries_per_article[str(start) + "-" + str(end)] = gold_kb
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					            for ent in doc.ents:
 | 
				
			||||||
 | 
					                ent_label = ent.label_
 | 
				
			||||||
 | 
					                pred_entity = ent.kb_id_
 | 
				
			||||||
 | 
					                start = ent.start_char
 | 
				
			||||||
 | 
					                end = ent.end_char
 | 
				
			||||||
 | 
					                gold_entity = correct_entries_per_article.get(str(start) + "-" + str(end), None)
 | 
				
			||||||
 | 
					                # the gold annotations are not complete so we can't evaluate missing annotations as 'wrong'
 | 
				
			||||||
 | 
					                if gold_entity is not None:
 | 
				
			||||||
 | 
					                    if gold_entity == pred_entity:
 | 
				
			||||||
 | 
					                        correct = correct_by_label.get(ent_label, 0)
 | 
				
			||||||
 | 
					                        correct_by_label[ent_label] = correct + 1
 | 
				
			||||||
 | 
					                    else:
 | 
				
			||||||
 | 
					                        incorrect = incorrect_by_label.get(ent_label, 0)
 | 
				
			||||||
 | 
					                        incorrect_by_label[ent_label] = incorrect + 1
 | 
				
			||||||
 | 
					                        if error_analysis:
 | 
				
			||||||
 | 
					                            print(ent.text, "in", doc)
 | 
				
			||||||
 | 
					                            print("Predicted",  pred_entity, "should have been", gold_entity)
 | 
				
			||||||
 | 
					                            print()
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					        except Exception as e:
 | 
				
			||||||
 | 
					            print("Error assessing accuracy", e)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    acc, acc_by_label = calculate_acc(correct_by_label,  incorrect_by_label)
 | 
				
			||||||
 | 
					    return acc, acc_by_label
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					def _measure_baselines(data, kb):
 | 
				
			||||||
 | 
					    # Measure 3 performance baselines: random selection, prior probabilities, and 'oracle' prediction for upper bound
 | 
				
			||||||
 | 
					    counts_by_label = dict()
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    random_correct_by_label = dict()
 | 
				
			||||||
 | 
					    random_incorrect_by_label = dict()
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    oracle_correct_by_label = dict()
 | 
				
			||||||
 | 
					    oracle_incorrect_by_label = dict()
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    prior_correct_by_label = dict()
 | 
				
			||||||
 | 
					    prior_incorrect_by_label = dict()
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    docs = [d for d, g in data if len(d) > 0]
 | 
				
			||||||
 | 
					    golds = [g for d, g in data if len(d) > 0]
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    for doc, gold in zip(docs, golds):
 | 
				
			||||||
 | 
					        try:
 | 
				
			||||||
 | 
					            correct_entries_per_article = dict()
 | 
				
			||||||
 | 
					            for entity in gold.links:
 | 
				
			||||||
 | 
					                start, end, gold_kb = entity
 | 
				
			||||||
 | 
					                correct_entries_per_article[str(start) + "-" + str(end)] = gold_kb
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					            for ent in doc.ents:
 | 
				
			||||||
 | 
					                ent_label = ent.label_
 | 
				
			||||||
 | 
					                start = ent.start_char
 | 
				
			||||||
 | 
					                end = ent.end_char
 | 
				
			||||||
 | 
					                gold_entity = correct_entries_per_article.get(str(start) + "-" + str(end), None)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					                # the gold annotations are not complete so we can't evaluate missing annotations as 'wrong'
 | 
				
			||||||
 | 
					                if gold_entity is not None:
 | 
				
			||||||
 | 
					                    counts_by_label[ent_label] = counts_by_label.get(ent_label, 0) + 1
 | 
				
			||||||
 | 
					                    candidates = kb.get_candidates(ent.text)
 | 
				
			||||||
 | 
					                    oracle_candidate = ""
 | 
				
			||||||
 | 
					                    best_candidate = ""
 | 
				
			||||||
 | 
					                    random_candidate = ""
 | 
				
			||||||
 | 
					                    if candidates:
 | 
				
			||||||
 | 
					                        scores = []
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					                        for c in candidates:
 | 
				
			||||||
 | 
					                            scores.append(c.prior_prob)
 | 
				
			||||||
 | 
					                            if c.entity_ == gold_entity:
 | 
				
			||||||
 | 
					                                oracle_candidate = c.entity_
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					                        best_index = scores.index(max(scores))
 | 
				
			||||||
 | 
					                        best_candidate = candidates[best_index].entity_
 | 
				
			||||||
 | 
					                        random_candidate = random.choice(candidates).entity_
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					                    if gold_entity == best_candidate:
 | 
				
			||||||
 | 
					                        prior_correct_by_label[ent_label] = prior_correct_by_label.get(ent_label, 0) + 1
 | 
				
			||||||
 | 
					                    else:
 | 
				
			||||||
 | 
					                        prior_incorrect_by_label[ent_label] = prior_incorrect_by_label.get(ent_label, 0) + 1
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					                    if gold_entity == random_candidate:
 | 
				
			||||||
 | 
					                        random_correct_by_label[ent_label] = random_correct_by_label.get(ent_label, 0) + 1
 | 
				
			||||||
 | 
					                    else:
 | 
				
			||||||
 | 
					                        random_incorrect_by_label[ent_label] = random_incorrect_by_label.get(ent_label, 0) + 1
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					                    if gold_entity == oracle_candidate:
 | 
				
			||||||
 | 
					                        oracle_correct_by_label[ent_label] = oracle_correct_by_label.get(ent_label, 0) + 1
 | 
				
			||||||
 | 
					                    else:
 | 
				
			||||||
 | 
					                        oracle_incorrect_by_label[ent_label] = oracle_incorrect_by_label.get(ent_label, 0) + 1
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					        except Exception as e:
 | 
				
			||||||
 | 
					            print("Error assessing accuracy", e)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    acc_prior, acc_prior_by_label = calculate_acc(prior_correct_by_label, prior_incorrect_by_label)
 | 
				
			||||||
 | 
					    acc_rand, acc_rand_by_label = calculate_acc(random_correct_by_label, random_incorrect_by_label)
 | 
				
			||||||
 | 
					    acc_oracle, acc_oracle_by_label = calculate_acc(oracle_correct_by_label, oracle_incorrect_by_label)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    return counts_by_label, acc_rand, acc_rand_by_label, acc_prior, acc_prior_by_label, acc_oracle, acc_oracle_by_label
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					def calculate_acc(correct_by_label, incorrect_by_label):
 | 
				
			||||||
 | 
					    acc_by_label = dict()
 | 
				
			||||||
 | 
					    total_correct = 0
 | 
				
			||||||
 | 
					    total_incorrect = 0
 | 
				
			||||||
 | 
					    all_keys = set()
 | 
				
			||||||
 | 
					    all_keys.update(correct_by_label.keys())
 | 
				
			||||||
 | 
					    all_keys.update(incorrect_by_label.keys())
 | 
				
			||||||
 | 
					    for label in sorted(all_keys):
 | 
				
			||||||
 | 
					        correct = correct_by_label.get(label, 0)
 | 
				
			||||||
 | 
					        incorrect = incorrect_by_label.get(label, 0)
 | 
				
			||||||
 | 
					        total_correct += correct
 | 
				
			||||||
 | 
					        total_incorrect += incorrect
 | 
				
			||||||
 | 
					        if correct == incorrect == 0:
 | 
				
			||||||
 | 
					            acc_by_label[label] = 0
 | 
				
			||||||
 | 
					        else:
 | 
				
			||||||
 | 
					            acc_by_label[label] = correct / (correct + incorrect)
 | 
				
			||||||
 | 
					    acc = 0
 | 
				
			||||||
 | 
					    if not (total_correct == total_incorrect == 0):
 | 
				
			||||||
 | 
					        acc = total_correct / (total_correct + total_incorrect)
 | 
				
			||||||
 | 
					    return acc, acc_by_label
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					def check_kb(kb):
 | 
				
			||||||
 | 
					    for mention in ("Bush", "Douglas Adams", "Homer", "Brazil", "China"):
 | 
				
			||||||
 | 
					        candidates = kb.get_candidates(mention)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					        print("generating candidates for " + mention + " :")
 | 
				
			||||||
 | 
					        for c in candidates:
 | 
				
			||||||
 | 
					            print(" ", c.prior_prob, c.alias_, "-->", c.entity_ + " (freq=" + str(c.entity_freq) + ")")
 | 
				
			||||||
 | 
					        print()
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					def run_el_toy_example(nlp):
 | 
				
			||||||
 | 
					    text = "In The Hitchhiker's Guide to the Galaxy, written by Douglas Adams, " \
 | 
				
			||||||
 | 
					           "Douglas reminds us to always bring our towel, even in China or Brazil. " \
 | 
				
			||||||
 | 
					           "The main character in Doug's novel is the man Arthur Dent, " \
 | 
				
			||||||
 | 
					           "but Douglas doesn't write about George Washington or Homer Simpson."
 | 
				
			||||||
 | 
					    doc = nlp(text)
 | 
				
			||||||
 | 
					    print(text)
 | 
				
			||||||
 | 
					    for ent in doc.ents:
 | 
				
			||||||
 | 
					        print(" ent", ent.text, ent.label_, ent.kb_id_)
 | 
				
			||||||
 | 
					    print()
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					if __name__ == "__main__":
 | 
				
			||||||
 | 
					    run_pipeline()
 | 
				
			||||||
| 
						 | 
					@ -5,6 +5,6 @@ requires = ["setuptools",
 | 
				
			||||||
            "cymem>=2.0.2,<2.1.0",
 | 
					            "cymem>=2.0.2,<2.1.0",
 | 
				
			||||||
            "preshed>=2.0.1,<2.1.0",
 | 
					            "preshed>=2.0.1,<2.1.0",
 | 
				
			||||||
            "murmurhash>=0.28.0,<1.1.0",
 | 
					            "murmurhash>=0.28.0,<1.1.0",
 | 
				
			||||||
            "thinc==7.0.0.dev6",
 | 
					            "thinc>=7.0.8,<7.1.0",
 | 
				
			||||||
            ]
 | 
					            ]
 | 
				
			||||||
build-backend = "setuptools.build_meta"
 | 
					build-backend = "setuptools.build_meta"
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -1,7 +1,7 @@
 | 
				
			||||||
# Our libraries
 | 
					# Our libraries
 | 
				
			||||||
cymem>=2.0.2,<2.1.0
 | 
					cymem>=2.0.2,<2.1.0
 | 
				
			||||||
preshed>=2.0.1,<2.1.0
 | 
					preshed>=2.0.1,<2.1.0
 | 
				
			||||||
thinc>=7.0.2,<7.1.0
 | 
					thinc>=7.0.8,<7.1.0
 | 
				
			||||||
blis>=0.2.2,<0.3.0
 | 
					blis>=0.2.2,<0.3.0
 | 
				
			||||||
murmurhash>=0.28.0,<1.1.0
 | 
					murmurhash>=0.28.0,<1.1.0
 | 
				
			||||||
wasabi>=0.2.0,<1.1.0
 | 
					wasabi>=0.2.0,<1.1.0
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
							
								
								
									
										3
									
								
								setup.py
									
									
									
									
									
								
							
							
						
						
									
										3
									
								
								setup.py
									
									
									
									
									
								
							| 
						 | 
					@ -228,7 +228,7 @@ def setup_package():
 | 
				
			||||||
                "murmurhash>=0.28.0,<1.1.0",
 | 
					                "murmurhash>=0.28.0,<1.1.0",
 | 
				
			||||||
                "cymem>=2.0.2,<2.1.0",
 | 
					                "cymem>=2.0.2,<2.1.0",
 | 
				
			||||||
                "preshed>=2.0.1,<2.1.0",
 | 
					                "preshed>=2.0.1,<2.1.0",
 | 
				
			||||||
                "thinc>=7.0.2,<7.1.0",
 | 
					                "thinc>=7.0.8,<7.1.0",
 | 
				
			||||||
                "blis>=0.2.2,<0.3.0",
 | 
					                "blis>=0.2.2,<0.3.0",
 | 
				
			||||||
                "plac<1.0.0,>=0.9.6",
 | 
					                "plac<1.0.0,>=0.9.6",
 | 
				
			||||||
                "requests>=2.13.0,<3.0.0",
 | 
					                "requests>=2.13.0,<3.0.0",
 | 
				
			||||||
| 
						 | 
					@ -246,6 +246,7 @@ def setup_package():
 | 
				
			||||||
                "cuda100": ["thinc_gpu_ops>=0.0.1,<0.1.0", "cupy-cuda100>=5.0.0b4"],
 | 
					                "cuda100": ["thinc_gpu_ops>=0.0.1,<0.1.0", "cupy-cuda100>=5.0.0b4"],
 | 
				
			||||||
                # Language tokenizers with external dependencies
 | 
					                # Language tokenizers with external dependencies
 | 
				
			||||||
                "ja": ["mecab-python3==0.7"],
 | 
					                "ja": ["mecab-python3==0.7"],
 | 
				
			||||||
 | 
					                "ko": ["natto-py==0.9.0"],
 | 
				
			||||||
            },
 | 
					            },
 | 
				
			||||||
            python_requires=">=2.7,!=3.0.*,!=3.1.*,!=3.2.*,!=3.3.*",
 | 
					            python_requires=">=2.7,!=3.0.*,!=3.1.*,!=3.2.*,!=3.3.*",
 | 
				
			||||||
            classifiers=[
 | 
					            classifiers=[
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
							
								
								
									
										59
									
								
								spacy/_ml.py
									
									
									
									
									
								
							
							
						
						
									
										59
									
								
								spacy/_ml.py
									
									
									
									
									
								
							| 
						 | 
					@ -24,7 +24,7 @@ from thinc.neural._classes.affine import _set_dimensions_if_needed
 | 
				
			||||||
import thinc.extra.load_nlp
 | 
					import thinc.extra.load_nlp
 | 
				
			||||||
 | 
					
 | 
				
			||||||
from .attrs import ID, ORTH, LOWER, NORM, PREFIX, SUFFIX, SHAPE
 | 
					from .attrs import ID, ORTH, LOWER, NORM, PREFIX, SUFFIX, SHAPE
 | 
				
			||||||
from .errors import Errors
 | 
					from .errors import Errors, user_warning, Warnings
 | 
				
			||||||
from . import util
 | 
					from . import util
 | 
				
			||||||
 | 
					
 | 
				
			||||||
try:
 | 
					try:
 | 
				
			||||||
| 
						 | 
					@ -299,7 +299,17 @@ def link_vectors_to_models(vocab):
 | 
				
			||||||
    data = ops.asarray(vectors.data)
 | 
					    data = ops.asarray(vectors.data)
 | 
				
			||||||
    # Set an entry here, so that vectors are accessed by StaticVectors
 | 
					    # Set an entry here, so that vectors are accessed by StaticVectors
 | 
				
			||||||
    # (unideal, I know)
 | 
					    # (unideal, I know)
 | 
				
			||||||
    thinc.extra.load_nlp.VECTORS[(ops.device, vectors.name)] = data
 | 
					    key = (ops.device, vectors.name)
 | 
				
			||||||
 | 
					    if key in thinc.extra.load_nlp.VECTORS:
 | 
				
			||||||
 | 
					        if thinc.extra.load_nlp.VECTORS[key].shape != data.shape:
 | 
				
			||||||
 | 
					            # This is a hack to avoid the problem in #3853. Maybe we should
 | 
				
			||||||
 | 
					            # print a warning as well?
 | 
				
			||||||
 | 
					            old_name = vectors.name
 | 
				
			||||||
 | 
					            new_name = vectors.name + "_%d" % data.shape[0]
 | 
				
			||||||
 | 
					            user_warning(Warnings.W019.format(old=old_name, new=new_name))
 | 
				
			||||||
 | 
					            vectors.name = new_name
 | 
				
			||||||
 | 
					            key = (ops.device, vectors.name)
 | 
				
			||||||
 | 
					    thinc.extra.load_nlp.VECTORS[key] = data
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
def PyTorchBiLSTM(nO, nI, depth, dropout=0.2):
 | 
					def PyTorchBiLSTM(nO, nI, depth, dropout=0.2):
 | 
				
			||||||
| 
						 | 
					@ -652,6 +662,51 @@ def build_simple_cnn_text_classifier(tok2vec, nr_class, exclusive_classes=False,
 | 
				
			||||||
    return model
 | 
					    return model
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					def build_nel_encoder(embed_width, hidden_width, ner_types, **cfg):
 | 
				
			||||||
 | 
					    # TODO proper error
 | 
				
			||||||
 | 
					    if "entity_width" not in cfg:
 | 
				
			||||||
 | 
					        raise ValueError("entity_width not found")
 | 
				
			||||||
 | 
					    if "context_width" not in cfg:
 | 
				
			||||||
 | 
					        raise ValueError("context_width not found")
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    conv_depth = cfg.get("conv_depth", 2)
 | 
				
			||||||
 | 
					    cnn_maxout_pieces = cfg.get("cnn_maxout_pieces", 3)
 | 
				
			||||||
 | 
					    pretrained_vectors = cfg.get("pretrained_vectors")  # self.nlp.vocab.vectors.name
 | 
				
			||||||
 | 
					    context_width = cfg.get("context_width")
 | 
				
			||||||
 | 
					    entity_width = cfg.get("entity_width")
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    with Model.define_operators({">>": chain, "**": clone}):
 | 
				
			||||||
 | 
					        model = (
 | 
				
			||||||
 | 
					            Affine(entity_width, entity_width + context_width + 1 + ner_types)
 | 
				
			||||||
 | 
					            >> Affine(1, entity_width, drop_factor=0.0)
 | 
				
			||||||
 | 
					            >> logistic
 | 
				
			||||||
 | 
					        )
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					        # context encoder
 | 
				
			||||||
 | 
					        tok2vec = (
 | 
				
			||||||
 | 
					            Tok2Vec(
 | 
				
			||||||
 | 
					                width=hidden_width,
 | 
				
			||||||
 | 
					                embed_size=embed_width,
 | 
				
			||||||
 | 
					                pretrained_vectors=pretrained_vectors,
 | 
				
			||||||
 | 
					                cnn_maxout_pieces=cnn_maxout_pieces,
 | 
				
			||||||
 | 
					                subword_features=True,
 | 
				
			||||||
 | 
					                conv_depth=conv_depth,
 | 
				
			||||||
 | 
					                bilstm_depth=0,
 | 
				
			||||||
 | 
					            )
 | 
				
			||||||
 | 
					            >> flatten_add_lengths
 | 
				
			||||||
 | 
					            >> Pooling(mean_pool)
 | 
				
			||||||
 | 
					            >> Residual(zero_init(Maxout(hidden_width, hidden_width)))
 | 
				
			||||||
 | 
					            >> zero_init(Affine(context_width, hidden_width))
 | 
				
			||||||
 | 
					        )
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					        model.tok2vec = tok2vec
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    model.tok2vec = tok2vec
 | 
				
			||||||
 | 
					    model.tok2vec.nO = context_width
 | 
				
			||||||
 | 
					    model.nO = 1
 | 
				
			||||||
 | 
					    return model
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
@layerize
 | 
					@layerize
 | 
				
			||||||
def flatten(seqs, drop=0.0):
 | 
					def flatten(seqs, drop=0.0):
 | 
				
			||||||
    ops = Model.ops
 | 
					    ops = Model.ops
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -4,13 +4,13 @@
 | 
				
			||||||
# fmt: off
 | 
					# fmt: off
 | 
				
			||||||
 | 
					
 | 
				
			||||||
__title__ = "spacy"
 | 
					__title__ = "spacy"
 | 
				
			||||||
__version__ = "2.1.4"
 | 
					__version__ = "2.1.5"
 | 
				
			||||||
__summary__ = "Industrial-strength Natural Language Processing (NLP) with Python and Cython"
 | 
					__summary__ = "Industrial-strength Natural Language Processing (NLP) with Python and Cython"
 | 
				
			||||||
__uri__ = "https://spacy.io"
 | 
					__uri__ = "https://spacy.io"
 | 
				
			||||||
__author__ = "Explosion AI"
 | 
					__author__ = "Explosion AI"
 | 
				
			||||||
__email__ = "contact@explosion.ai"
 | 
					__email__ = "contact@explosion.ai"
 | 
				
			||||||
__license__ = "MIT"
 | 
					__license__ = "MIT"
 | 
				
			||||||
__release__ = False
 | 
					__release__ = True
 | 
				
			||||||
 | 
					
 | 
				
			||||||
__download_url__ = "https://github.com/explosion/spacy-models/releases/download"
 | 
					__download_url__ = "https://github.com/explosion/spacy-models/releases/download"
 | 
				
			||||||
__compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json"
 | 
					__compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json"
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -82,6 +82,7 @@ cdef enum attr_id_t:
 | 
				
			||||||
    DEP
 | 
					    DEP
 | 
				
			||||||
    ENT_IOB
 | 
					    ENT_IOB
 | 
				
			||||||
    ENT_TYPE
 | 
					    ENT_TYPE
 | 
				
			||||||
 | 
					    ENT_KB_ID
 | 
				
			||||||
    HEAD
 | 
					    HEAD
 | 
				
			||||||
    SENT_START
 | 
					    SENT_START
 | 
				
			||||||
    SPACY
 | 
					    SPACY
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -84,6 +84,7 @@ IDS = {
 | 
				
			||||||
    "DEP": DEP,
 | 
					    "DEP": DEP,
 | 
				
			||||||
    "ENT_IOB": ENT_IOB,
 | 
					    "ENT_IOB": ENT_IOB,
 | 
				
			||||||
    "ENT_TYPE": ENT_TYPE,
 | 
					    "ENT_TYPE": ENT_TYPE,
 | 
				
			||||||
 | 
					    "ENT_KB_ID": ENT_KB_ID,
 | 
				
			||||||
    "HEAD": HEAD,
 | 
					    "HEAD": HEAD,
 | 
				
			||||||
    "SENT_START": SENT_START,
 | 
					    "SENT_START": SENT_START,
 | 
				
			||||||
    "SPACY": SPACY,
 | 
					    "SPACY": SPACY,
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -5,6 +5,7 @@ import plac
 | 
				
			||||||
import random
 | 
					import random
 | 
				
			||||||
import numpy
 | 
					import numpy
 | 
				
			||||||
import time
 | 
					import time
 | 
				
			||||||
 | 
					import re
 | 
				
			||||||
from collections import Counter
 | 
					from collections import Counter
 | 
				
			||||||
from pathlib import Path
 | 
					from pathlib import Path
 | 
				
			||||||
from thinc.v2v import Affine, Maxout
 | 
					from thinc.v2v import Affine, Maxout
 | 
				
			||||||
| 
						 | 
					@ -65,6 +66,13 @@ from .train import _load_pretrained_tok2vec
 | 
				
			||||||
        "t2v",
 | 
					        "t2v",
 | 
				
			||||||
        Path,
 | 
					        Path,
 | 
				
			||||||
    ),
 | 
					    ),
 | 
				
			||||||
 | 
					    epoch_start=(
 | 
				
			||||||
 | 
					        "The epoch to start counting at. Only relevant when using '--init-tok2vec' and the given weight file has been "
 | 
				
			||||||
 | 
					        "renamed. Prevents unintended overwriting of existing weight files.",
 | 
				
			||||||
 | 
					        "option",
 | 
				
			||||||
 | 
					        "es",
 | 
				
			||||||
 | 
					        int
 | 
				
			||||||
 | 
					    ),
 | 
				
			||||||
)
 | 
					)
 | 
				
			||||||
def pretrain(
 | 
					def pretrain(
 | 
				
			||||||
    texts_loc,
 | 
					    texts_loc,
 | 
				
			||||||
| 
						 | 
					@ -83,6 +91,7 @@ def pretrain(
 | 
				
			||||||
    seed=0,
 | 
					    seed=0,
 | 
				
			||||||
    n_save_every=None,
 | 
					    n_save_every=None,
 | 
				
			||||||
    init_tok2vec=None,
 | 
					    init_tok2vec=None,
 | 
				
			||||||
 | 
					    epoch_start=None,
 | 
				
			||||||
):
 | 
					):
 | 
				
			||||||
    """
 | 
					    """
 | 
				
			||||||
    Pre-train the 'token-to-vector' (tok2vec) layer of pipeline components,
 | 
					    Pre-train the 'token-to-vector' (tok2vec) layer of pipeline components,
 | 
				
			||||||
| 
						 | 
					@ -151,9 +160,29 @@ def pretrain(
 | 
				
			||||||
    if init_tok2vec is not None:
 | 
					    if init_tok2vec is not None:
 | 
				
			||||||
        components = _load_pretrained_tok2vec(nlp, init_tok2vec)
 | 
					        components = _load_pretrained_tok2vec(nlp, init_tok2vec)
 | 
				
			||||||
        msg.text("Loaded pretrained tok2vec for: {}".format(components))
 | 
					        msg.text("Loaded pretrained tok2vec for: {}".format(components))
 | 
				
			||||||
 | 
					        # Parse the epoch number from the given weight file
 | 
				
			||||||
 | 
					        model_name = re.search(r"model\d+\.bin", str(init_tok2vec))
 | 
				
			||||||
 | 
					        if model_name:
 | 
				
			||||||
 | 
					            # Default weight file name so read epoch_start from it by cutting off 'model' and '.bin'
 | 
				
			||||||
 | 
					            epoch_start = int(model_name.group(0)[5:][:-4]) + 1
 | 
				
			||||||
 | 
					        else:
 | 
				
			||||||
 | 
					            if not epoch_start:
 | 
				
			||||||
 | 
					                msg.fail(
 | 
				
			||||||
 | 
					                    "You have to use the '--epoch-start' argument when using a renamed weight file for "
 | 
				
			||||||
 | 
					                    "'--init-tok2vec'", exits=True
 | 
				
			||||||
 | 
					                )
 | 
				
			||||||
 | 
					            elif epoch_start < 0:
 | 
				
			||||||
 | 
					                msg.fail(
 | 
				
			||||||
 | 
					                    "The argument '--epoch-start' has to be greater or equal to 0. '%d' is invalid" % epoch_start,
 | 
				
			||||||
 | 
					                    exits=True
 | 
				
			||||||
 | 
					                )
 | 
				
			||||||
 | 
					    else:
 | 
				
			||||||
 | 
					        # Without '--init-tok2vec' the '--epoch-start' argument is ignored
 | 
				
			||||||
 | 
					        epoch_start = 0
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    optimizer = create_default_optimizer(model.ops)
 | 
					    optimizer = create_default_optimizer(model.ops)
 | 
				
			||||||
    tracker = ProgressTracker(frequency=10000)
 | 
					    tracker = ProgressTracker(frequency=10000)
 | 
				
			||||||
    msg.divider("Pre-training tok2vec layer")
 | 
					    msg.divider("Pre-training tok2vec layer - starting at epoch %d" % epoch_start)
 | 
				
			||||||
    row_settings = {"widths": (3, 10, 10, 6, 4), "aligns": ("r", "r", "r", "r", "r")}
 | 
					    row_settings = {"widths": (3, 10, 10, 6, 4), "aligns": ("r", "r", "r", "r", "r")}
 | 
				
			||||||
    msg.row(("#", "# Words", "Total Loss", "Loss", "w/s"), **row_settings)
 | 
					    msg.row(("#", "# Words", "Total Loss", "Loss", "w/s"), **row_settings)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
| 
						 | 
					@ -174,7 +203,7 @@ def pretrain(
 | 
				
			||||||
                file_.write(srsly.json_dumps(log) + "\n")
 | 
					                file_.write(srsly.json_dumps(log) + "\n")
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    skip_counter = 0
 | 
					    skip_counter = 0
 | 
				
			||||||
    for epoch in range(n_iter):
 | 
					    for epoch in range(epoch_start, n_iter + epoch_start):
 | 
				
			||||||
        for batch_id, batch in enumerate(
 | 
					        for batch_id, batch in enumerate(
 | 
				
			||||||
            util.minibatch_by_words(((text, None) for text in texts), size=batch_size)
 | 
					            util.minibatch_by_words(((text, None) for text in texts), size=batch_size)
 | 
				
			||||||
        ):
 | 
					        ):
 | 
				
			||||||
| 
						 | 
					@ -272,7 +301,7 @@ def get_vectors_loss(ops, docs, prediction, objective="L2"):
 | 
				
			||||||
    elif objective == "cosine":
 | 
					    elif objective == "cosine":
 | 
				
			||||||
        loss, d_target = get_cossim_loss(prediction, target)
 | 
					        loss, d_target = get_cossim_loss(prediction, target)
 | 
				
			||||||
    else:
 | 
					    else:
 | 
				
			||||||
        raise ValueError(Errors.E139.format(loss_func=objective))
 | 
					        raise ValueError(Errors.E142.format(loss_func=objective))
 | 
				
			||||||
    return loss, d_target
 | 
					    return loss, d_target
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -82,6 +82,8 @@ class Warnings(object):
 | 
				
			||||||
            "parallel inference via multiprocessing.")
 | 
					            "parallel inference via multiprocessing.")
 | 
				
			||||||
    W017 = ("Alias '{alias}' already exists in the Knowledge base.")
 | 
					    W017 = ("Alias '{alias}' already exists in the Knowledge base.")
 | 
				
			||||||
    W018 = ("Entity '{entity}' already exists in the Knowledge base.")
 | 
					    W018 = ("Entity '{entity}' already exists in the Knowledge base.")
 | 
				
			||||||
 | 
					    W019 = ("Changing vectors name from {old} to {new}, to avoid clash with "
 | 
				
			||||||
 | 
					            "previously loaded vectors. See Issue #3853.")
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
@add_codes
 | 
					@add_codes
 | 
				
			||||||
| 
						 | 
					@ -399,7 +401,11 @@ class Errors(object):
 | 
				
			||||||
    E138 = ("Invalid JSONL format for raw text '{text}'. Make sure the input includes either the "
 | 
					    E138 = ("Invalid JSONL format for raw text '{text}'. Make sure the input includes either the "
 | 
				
			||||||
            "`text` or `tokens` key. For more info, see the docs:\n"
 | 
					            "`text` or `tokens` key. For more info, see the docs:\n"
 | 
				
			||||||
            "https://spacy.io/api/cli#pretrain-jsonl")
 | 
					            "https://spacy.io/api/cli#pretrain-jsonl")
 | 
				
			||||||
    E139 = ("Unsupported loss_function '{loss_func}'. Use either 'L2' or 'cosine'")
 | 
					    E139 = ("Knowledge base for component '{name}' not initialized. Did you forget to call set_kb()?")
 | 
				
			||||||
 | 
					    E140 = ("The list of entities, prior probabilities and entity vectors should be of equal length.")
 | 
				
			||||||
 | 
					    E141 = ("Entity vectors should be of length {required} instead of the provided {found}.")
 | 
				
			||||||
 | 
					    E142 = ("Unsupported loss_function '{loss_func}'. Use either 'L2' or 'cosine'")
 | 
				
			||||||
 | 
					    E143 = ("Labels for component '{name}' not initialized. Did you forget to call add_label()?")
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
@add_codes
 | 
					@add_codes
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -31,6 +31,7 @@ cdef class GoldParse:
 | 
				
			||||||
    cdef public list ents
 | 
					    cdef public list ents
 | 
				
			||||||
    cdef public dict brackets
 | 
					    cdef public dict brackets
 | 
				
			||||||
    cdef public object cats
 | 
					    cdef public object cats
 | 
				
			||||||
 | 
					    cdef public list links
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    cdef readonly list cand_to_gold
 | 
					    cdef readonly list cand_to_gold
 | 
				
			||||||
    cdef readonly list gold_to_cand
 | 
					    cdef readonly list gold_to_cand
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -427,7 +427,7 @@ cdef class GoldParse:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    def __init__(self, doc, annot_tuples=None, words=None, tags=None,
 | 
					    def __init__(self, doc, annot_tuples=None, words=None, tags=None,
 | 
				
			||||||
                 heads=None, deps=None, entities=None, make_projective=False,
 | 
					                 heads=None, deps=None, entities=None, make_projective=False,
 | 
				
			||||||
                 cats=None, **_):
 | 
					                 cats=None, links=None, **_):
 | 
				
			||||||
        """Create a GoldParse.
 | 
					        """Create a GoldParse.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
        doc (Doc): The document the annotations refer to.
 | 
					        doc (Doc): The document the annotations refer to.
 | 
				
			||||||
| 
						 | 
					@ -450,6 +450,8 @@ cdef class GoldParse:
 | 
				
			||||||
            examples of a label to have the value 0.0. Labels not in the
 | 
					            examples of a label to have the value 0.0. Labels not in the
 | 
				
			||||||
            dictionary are treated as missing - the gradient for those labels
 | 
					            dictionary are treated as missing - the gradient for those labels
 | 
				
			||||||
            will be zero.
 | 
					            will be zero.
 | 
				
			||||||
 | 
					        links (iterable): A sequence of `(start_char, end_char, kb_id)` tuples,
 | 
				
			||||||
 | 
					            representing the external ID of an entity in a knowledge base.
 | 
				
			||||||
        RETURNS (GoldParse): The newly constructed object.
 | 
					        RETURNS (GoldParse): The newly constructed object.
 | 
				
			||||||
        """
 | 
					        """
 | 
				
			||||||
        if words is None:
 | 
					        if words is None:
 | 
				
			||||||
| 
						 | 
					@ -485,6 +487,7 @@ cdef class GoldParse:
 | 
				
			||||||
        self.c.ner = <Transition*>self.mem.alloc(len(doc), sizeof(Transition))
 | 
					        self.c.ner = <Transition*>self.mem.alloc(len(doc), sizeof(Transition))
 | 
				
			||||||
 | 
					
 | 
				
			||||||
        self.cats = {} if cats is None else dict(cats)
 | 
					        self.cats = {} if cats is None else dict(cats)
 | 
				
			||||||
 | 
					        self.links = links
 | 
				
			||||||
        self.words = [None] * len(doc)
 | 
					        self.words = [None] * len(doc)
 | 
				
			||||||
        self.tags = [None] * len(doc)
 | 
					        self.tags = [None] * len(doc)
 | 
				
			||||||
        self.heads = [None] * len(doc)
 | 
					        self.heads = [None] * len(doc)
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
							
								
								
									
										170
									
								
								spacy/kb.pxd
									
									
									
									
									
								
							
							
						
						
									
										170
									
								
								spacy/kb.pxd
									
									
									
									
									
								
							| 
						 | 
					@ -1,53 +1,27 @@
 | 
				
			||||||
"""Knowledge-base for entity or concept linking."""
 | 
					"""Knowledge-base for entity or concept linking."""
 | 
				
			||||||
from cymem.cymem cimport Pool
 | 
					from cymem.cymem cimport Pool
 | 
				
			||||||
from preshed.maps cimport PreshMap
 | 
					from preshed.maps cimport PreshMap
 | 
				
			||||||
 | 
					
 | 
				
			||||||
from libcpp.vector cimport vector
 | 
					from libcpp.vector cimport vector
 | 
				
			||||||
from libc.stdint cimport int32_t, int64_t
 | 
					from libc.stdint cimport int32_t, int64_t
 | 
				
			||||||
 | 
					from libc.stdio cimport FILE
 | 
				
			||||||
 | 
					
 | 
				
			||||||
from spacy.vocab cimport Vocab
 | 
					from spacy.vocab cimport Vocab
 | 
				
			||||||
from .typedefs cimport hash_t
 | 
					from .typedefs cimport hash_t
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					from .structs cimport KBEntryC, AliasC
 | 
				
			||||||
# Internal struct, for storage and disambiguation. This isn't what we return
 | 
					ctypedef vector[KBEntryC] entry_vec
 | 
				
			||||||
# to the user as the answer to "here's your entity". It's the minimum number
 | 
					ctypedef vector[AliasC] alias_vec
 | 
				
			||||||
# of bits we need to keep track of the answers.
 | 
					ctypedef vector[float] float_vec
 | 
				
			||||||
cdef struct _EntryC:
 | 
					ctypedef vector[float_vec] float_matrix
 | 
				
			||||||
 | 
					 | 
				
			||||||
    # The hash of this entry's unique ID and name in the kB
 | 
					 | 
				
			||||||
    hash_t entity_hash
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
    # Allows retrieval of one or more vectors.
 | 
					 | 
				
			||||||
    # Each element of vector_rows should be an index into a vectors table.
 | 
					 | 
				
			||||||
    # Every entry should have the same number of vectors, so we can avoid storing
 | 
					 | 
				
			||||||
    # the number of vectors in each knowledge-base struct
 | 
					 | 
				
			||||||
    int32_t* vector_rows
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
    # Allows retrieval of a struct of non-vector features. We could make this a
 | 
					 | 
				
			||||||
    # pointer, but we have 32 bits left over in the struct after prob, so we'd
 | 
					 | 
				
			||||||
    # like this to only be 32 bits. We can also set this to -1, for the common
 | 
					 | 
				
			||||||
    # case where there are no features.
 | 
					 | 
				
			||||||
    int32_t feats_row
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
    # log probability of entity, based on corpus frequency
 | 
					 | 
				
			||||||
    float prob
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
# Each alias struct stores a list of Entry pointers with their prior probabilities
 | 
					 | 
				
			||||||
# for this specific mention/alias.
 | 
					 | 
				
			||||||
cdef struct _AliasC:
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
    # All entry candidates for this alias
 | 
					 | 
				
			||||||
    vector[int64_t] entry_indices
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
    # Prior probability P(entity|alias) - should sum up to (at most) 1.
 | 
					 | 
				
			||||||
    vector[float] probs
 | 
					 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
# Object used by the Entity Linker that summarizes one entity-alias candidate combination.
 | 
					# Object used by the Entity Linker that summarizes one entity-alias candidate combination.
 | 
				
			||||||
cdef class Candidate:
 | 
					cdef class Candidate:
 | 
				
			||||||
 | 
					 | 
				
			||||||
    cdef readonly KnowledgeBase kb
 | 
					    cdef readonly KnowledgeBase kb
 | 
				
			||||||
    cdef hash_t entity_hash
 | 
					    cdef hash_t entity_hash
 | 
				
			||||||
 | 
					    cdef float entity_freq
 | 
				
			||||||
 | 
					    cdef vector[float] entity_vector
 | 
				
			||||||
    cdef hash_t alias_hash
 | 
					    cdef hash_t alias_hash
 | 
				
			||||||
    cdef float prior_prob
 | 
					    cdef float prior_prob
 | 
				
			||||||
 | 
					
 | 
				
			||||||
| 
						 | 
					@ -55,9 +29,10 @@ cdef class Candidate:
 | 
				
			||||||
cdef class KnowledgeBase:
 | 
					cdef class KnowledgeBase:
 | 
				
			||||||
    cdef Pool mem
 | 
					    cdef Pool mem
 | 
				
			||||||
    cpdef readonly Vocab vocab
 | 
					    cpdef readonly Vocab vocab
 | 
				
			||||||
 | 
					    cdef int64_t entity_vector_length
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    # This maps 64bit keys (hash of unique entity string)
 | 
					    # This maps 64bit keys (hash of unique entity string)
 | 
				
			||||||
    # to 64bit values (position of the _EntryC struct in the _entries vector).
 | 
					    # to 64bit values (position of the _KBEntryC struct in the _entries vector).
 | 
				
			||||||
    # The PreshMap is pretty space efficient, as it uses open addressing. So
 | 
					    # The PreshMap is pretty space efficient, as it uses open addressing. So
 | 
				
			||||||
    # the only overhead is the vacancy rate, which is approximately 30%.
 | 
					    # the only overhead is the vacancy rate, which is approximately 30%.
 | 
				
			||||||
    cdef PreshMap _entry_index
 | 
					    cdef PreshMap _entry_index
 | 
				
			||||||
| 
						 | 
					@ -66,7 +41,7 @@ cdef class KnowledgeBase:
 | 
				
			||||||
    # over allocation.
 | 
					    # over allocation.
 | 
				
			||||||
    # In total we end up with (N*128*1.3)+(N*128*1.3) bits for N entries.
 | 
					    # In total we end up with (N*128*1.3)+(N*128*1.3) bits for N entries.
 | 
				
			||||||
    # Storing 1m entries would take 41.6mb under this scheme.
 | 
					    # Storing 1m entries would take 41.6mb under this scheme.
 | 
				
			||||||
    cdef vector[_EntryC] _entries
 | 
					    cdef entry_vec _entries
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    # This maps 64bit keys (hash of unique alias string)
 | 
					    # This maps 64bit keys (hash of unique alias string)
 | 
				
			||||||
    # to 64bit values (position of the _AliasC struct in the _aliases_table vector).
 | 
					    # to 64bit values (position of the _AliasC struct in the _aliases_table vector).
 | 
				
			||||||
| 
						 | 
					@ -76,7 +51,7 @@ cdef class KnowledgeBase:
 | 
				
			||||||
    # should be P(entity | mention), which is pretty important to know.
 | 
					    # should be P(entity | mention), which is pretty important to know.
 | 
				
			||||||
    # We can pack both pieces of information into a 64-bit value, to keep things
 | 
					    # We can pack both pieces of information into a 64-bit value, to keep things
 | 
				
			||||||
    # efficient.
 | 
					    # efficient.
 | 
				
			||||||
    cdef vector[_AliasC] _aliases_table
 | 
					    cdef alias_vec _aliases_table
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    # This is the part which might take more space: storing various
 | 
					    # This is the part which might take more space: storing various
 | 
				
			||||||
    # categorical features for the entries, and storing vectors for disambiguation
 | 
					    # categorical features for the entries, and storing vectors for disambiguation
 | 
				
			||||||
| 
						 | 
					@ -87,7 +62,7 @@ cdef class KnowledgeBase:
 | 
				
			||||||
    # model, that embeds different features of the entities into vectors. We'll
 | 
					    # model, that embeds different features of the entities into vectors. We'll
 | 
				
			||||||
    # still want some per-entity features, like the Wikipedia text or entity
 | 
					    # still want some per-entity features, like the Wikipedia text or entity
 | 
				
			||||||
    # co-occurrence. Hopefully those vectors can be narrow, e.g. 64 dimensions.
 | 
					    # co-occurrence. Hopefully those vectors can be narrow, e.g. 64 dimensions.
 | 
				
			||||||
    cdef object _vectors_table
 | 
					    cdef float_matrix _vectors_table
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    # It's very useful to track categorical features, at least for output, even
 | 
					    # It's very useful to track categorical features, at least for output, even
 | 
				
			||||||
    # if they're not useful in the model itself. For instance, we should be
 | 
					    # if they're not useful in the model itself. For instance, we should be
 | 
				
			||||||
| 
						 | 
					@ -96,53 +71,102 @@ cdef class KnowledgeBase:
 | 
				
			||||||
    # optional data, we can let users configure a DB as the backend for this.
 | 
					    # optional data, we can let users configure a DB as the backend for this.
 | 
				
			||||||
    cdef object _features_table
 | 
					    cdef object _features_table
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    cdef inline int64_t c_add_vector(self, vector[float] entity_vector) nogil:
 | 
				
			||||||
 | 
					        """Add an entity vector to the vectors table."""
 | 
				
			||||||
 | 
					        cdef int64_t new_index = self._vectors_table.size()
 | 
				
			||||||
 | 
					        self._vectors_table.push_back(entity_vector)
 | 
				
			||||||
 | 
					        return new_index
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    cdef inline int64_t c_add_entity(self, hash_t entity_hash, float prob,
 | 
					    cdef inline int64_t c_add_entity(self, hash_t entity_hash, float prob,
 | 
				
			||||||
                                     int32_t* vector_rows, int feats_row):
 | 
					                                     int32_t vector_index, int feats_row) nogil:
 | 
				
			||||||
        """Add an entry to the knowledge base."""
 | 
					        """Add an entry to the vector of entries.
 | 
				
			||||||
        # This is what we'll map the hash key to. It's where the entry will sit
 | 
					        After calling this method, make sure to update also the _entry_index using the return value"""
 | 
				
			||||||
 | 
					        # This is what we'll map the entity hash key to. It's where the entry will sit
 | 
				
			||||||
        # in the vector of entries, so we can get it later.
 | 
					        # in the vector of entries, so we can get it later.
 | 
				
			||||||
        cdef int64_t new_index = self._entries.size()
 | 
					        cdef int64_t new_index = self._entries.size()
 | 
				
			||||||
        self._entries.push_back(
 | 
					
 | 
				
			||||||
            _EntryC(
 | 
					        # Avoid struct initializer to enable nogil, cf https://github.com/cython/cython/issues/1642
 | 
				
			||||||
                entity_hash=entity_hash,
 | 
					        cdef KBEntryC entry
 | 
				
			||||||
                vector_rows=vector_rows,
 | 
					        entry.entity_hash = entity_hash
 | 
				
			||||||
                feats_row=feats_row,
 | 
					        entry.vector_index = vector_index
 | 
				
			||||||
                prob=prob
 | 
					        entry.feats_row = feats_row
 | 
				
			||||||
            ))
 | 
					        entry.prob = prob
 | 
				
			||||||
        self._entry_index[entity_hash] = new_index
 | 
					
 | 
				
			||||||
 | 
					        self._entries.push_back(entry)
 | 
				
			||||||
        return new_index
 | 
					        return new_index
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    cdef inline int64_t c_add_aliases(self, hash_t alias_hash, vector[int64_t] entry_indices, vector[float] probs):
 | 
					    cdef inline int64_t c_add_aliases(self, hash_t alias_hash, vector[int64_t] entry_indices, vector[float] probs) nogil:
 | 
				
			||||||
        """Connect a mention to a list of potential entities with their prior probabilities ."""
 | 
					        """Connect a mention to a list of potential entities with their prior probabilities .
 | 
				
			||||||
 | 
					        After calling this method, make sure to update also the _alias_index using the return value"""
 | 
				
			||||||
 | 
					        # This is what we'll map the alias hash key to. It's where the alias will be defined
 | 
				
			||||||
 | 
					        # in the vector of aliases.
 | 
				
			||||||
        cdef int64_t new_index = self._aliases_table.size()
 | 
					        cdef int64_t new_index = self._aliases_table.size()
 | 
				
			||||||
 | 
					
 | 
				
			||||||
        self._aliases_table.push_back(
 | 
					        # Avoid struct initializer to enable nogil
 | 
				
			||||||
            _AliasC(
 | 
					        cdef AliasC alias
 | 
				
			||||||
                entry_indices=entry_indices,
 | 
					        alias.entry_indices = entry_indices
 | 
				
			||||||
                probs=probs
 | 
					        alias.probs = probs
 | 
				
			||||||
            ))
 | 
					
 | 
				
			||||||
        self._alias_index[alias_hash] = new_index
 | 
					        self._aliases_table.push_back(alias)
 | 
				
			||||||
        return new_index
 | 
					        return new_index
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    cdef inline _create_empty_vectors(self):
 | 
					    cdef inline void _create_empty_vectors(self, hash_t dummy_hash) nogil:
 | 
				
			||||||
        """ 
 | 
					        """ 
 | 
				
			||||||
        Making sure the first element of each vector is a dummy,
 | 
					        Initializing the vectors and making sure the first element of each vector is a dummy,
 | 
				
			||||||
        because the PreshMap maps pointing to indices in these vectors can not contain 0 as value
 | 
					        because the PreshMap maps pointing to indices in these vectors can not contain 0 as value
 | 
				
			||||||
        cf. https://github.com/explosion/preshed/issues/17
 | 
					        cf. https://github.com/explosion/preshed/issues/17
 | 
				
			||||||
        """
 | 
					        """
 | 
				
			||||||
        cdef int32_t dummy_value = 0
 | 
					        cdef int32_t dummy_value = 0
 | 
				
			||||||
        self.vocab.strings.add("")
 | 
					
 | 
				
			||||||
        self._entries.push_back(
 | 
					        # Avoid struct initializer to enable nogil
 | 
				
			||||||
            _EntryC(
 | 
					        cdef KBEntryC entry
 | 
				
			||||||
                entity_hash=self.vocab.strings[""],
 | 
					        entry.entity_hash = dummy_hash
 | 
				
			||||||
                vector_rows=&dummy_value,
 | 
					        entry.vector_index = dummy_value
 | 
				
			||||||
                feats_row=dummy_value,
 | 
					        entry.feats_row = dummy_value
 | 
				
			||||||
                prob=dummy_value
 | 
					        entry.prob = dummy_value
 | 
				
			||||||
            ))
 | 
					
 | 
				
			||||||
        self._aliases_table.push_back(
 | 
					        # Avoid struct initializer to enable nogil
 | 
				
			||||||
            _AliasC(
 | 
					        cdef vector[int64_t] dummy_entry_indices
 | 
				
			||||||
                entry_indices=[dummy_value],
 | 
					        dummy_entry_indices.push_back(0)
 | 
				
			||||||
                probs=[dummy_value]
 | 
					        cdef vector[float] dummy_probs
 | 
				
			||||||
            ))
 | 
					        dummy_probs.push_back(0)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					        cdef AliasC alias
 | 
				
			||||||
 | 
					        alias.entry_indices = dummy_entry_indices
 | 
				
			||||||
 | 
					        alias.probs = dummy_probs
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					        self._entries.push_back(entry)
 | 
				
			||||||
 | 
					        self._aliases_table.push_back(alias)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    cpdef load_bulk(self, loc)
 | 
				
			||||||
 | 
					    cpdef set_entities(self, entity_list, prob_list, vector_list)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					cdef class Writer:
 | 
				
			||||||
 | 
					    cdef FILE* _fp
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    cdef int write_header(self, int64_t nr_entries, int64_t entity_vector_length) except -1
 | 
				
			||||||
 | 
					    cdef int write_vector_element(self, float element) except -1
 | 
				
			||||||
 | 
					    cdef int write_entry(self, hash_t entry_hash, float entry_prob, int32_t vector_index) except -1
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    cdef int write_alias_length(self, int64_t alias_length) except -1
 | 
				
			||||||
 | 
					    cdef int write_alias_header(self, hash_t alias_hash, int64_t candidate_length) except -1
 | 
				
			||||||
 | 
					    cdef int write_alias(self, int64_t entry_index, float prob) except -1
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    cdef int _write(self, void* value, size_t size) except -1
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					cdef class Reader:
 | 
				
			||||||
 | 
					    cdef FILE* _fp
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    cdef int read_header(self, int64_t* nr_entries, int64_t* entity_vector_length) except -1
 | 
				
			||||||
 | 
					    cdef int read_vector_element(self, float* element) except -1
 | 
				
			||||||
 | 
					    cdef int read_entry(self, hash_t* entity_hash, float* prob, int32_t* vector_index) except -1
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    cdef int read_alias_length(self, int64_t* alias_length) except -1
 | 
				
			||||||
 | 
					    cdef int read_alias_header(self, hash_t* alias_hash, int64_t* candidate_length) except -1
 | 
				
			||||||
 | 
					    cdef int read_alias(self, int64_t* entry_index, float* prob) except -1
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    cdef int _read(self, void* value, size_t size) except -1
 | 
				
			||||||
 | 
					
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
							
								
								
									
										397
									
								
								spacy/kb.pyx
									
									
									
									
									
								
							
							
						
						
									
										397
									
								
								spacy/kb.pyx
									
									
									
									
									
								
							| 
						 | 
					@ -1,13 +1,30 @@
 | 
				
			||||||
 | 
					# cython: infer_types=True
 | 
				
			||||||
# cython: profile=True
 | 
					# cython: profile=True
 | 
				
			||||||
# coding: utf8
 | 
					# coding: utf8
 | 
				
			||||||
from spacy.errors import Errors, Warnings, user_warning
 | 
					from spacy.errors import Errors, Warnings, user_warning
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					from pathlib import Path
 | 
				
			||||||
 | 
					from cymem.cymem cimport Pool
 | 
				
			||||||
 | 
					from preshed.maps cimport PreshMap
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					from cpython.exc cimport PyErr_SetFromErrno
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					from libc.stdio cimport fopen, fclose, fread, fwrite, feof, fseek
 | 
				
			||||||
 | 
					from libc.stdint cimport int32_t, int64_t
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					from .typedefs cimport hash_t
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					from os import path
 | 
				
			||||||
 | 
					from libcpp.vector cimport vector
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
cdef class Candidate:
 | 
					cdef class Candidate:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    def __init__(self, KnowledgeBase kb, entity_hash, alias_hash, prior_prob):
 | 
					    def __init__(self, KnowledgeBase kb, entity_hash, entity_freq, entity_vector, alias_hash, prior_prob):
 | 
				
			||||||
        self.kb = kb
 | 
					        self.kb = kb
 | 
				
			||||||
        self.entity_hash = entity_hash
 | 
					        self.entity_hash = entity_hash
 | 
				
			||||||
 | 
					        self.entity_freq = entity_freq
 | 
				
			||||||
 | 
					        self.entity_vector = entity_vector
 | 
				
			||||||
        self.alias_hash = alias_hash
 | 
					        self.alias_hash = alias_hash
 | 
				
			||||||
        self.prior_prob = prior_prob
 | 
					        self.prior_prob = prior_prob
 | 
				
			||||||
 | 
					
 | 
				
			||||||
| 
						 | 
					@ -19,7 +36,7 @@ cdef class Candidate:
 | 
				
			||||||
    @property
 | 
					    @property
 | 
				
			||||||
    def entity_(self):
 | 
					    def entity_(self):
 | 
				
			||||||
        """RETURNS (unicode): ID/name of this entity in the KB"""
 | 
					        """RETURNS (unicode): ID/name of this entity in the KB"""
 | 
				
			||||||
        return self.kb.vocab.strings[self.entity]
 | 
					        return self.kb.vocab.strings[self.entity_hash]
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    @property
 | 
					    @property
 | 
				
			||||||
    def alias(self):
 | 
					    def alias(self):
 | 
				
			||||||
| 
						 | 
					@ -29,7 +46,15 @@ cdef class Candidate:
 | 
				
			||||||
    @property
 | 
					    @property
 | 
				
			||||||
    def alias_(self):
 | 
					    def alias_(self):
 | 
				
			||||||
        """RETURNS (unicode): ID of the original alias"""
 | 
					        """RETURNS (unicode): ID of the original alias"""
 | 
				
			||||||
        return self.kb.vocab.strings[self.alias]
 | 
					        return self.kb.vocab.strings[self.alias_hash]
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    @property
 | 
				
			||||||
 | 
					    def entity_freq(self):
 | 
				
			||||||
 | 
					        return self.entity_freq
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    @property
 | 
				
			||||||
 | 
					    def entity_vector(self):
 | 
				
			||||||
 | 
					        return self.entity_vector
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    @property
 | 
					    @property
 | 
				
			||||||
    def prior_prob(self):
 | 
					    def prior_prob(self):
 | 
				
			||||||
| 
						 | 
					@ -38,26 +63,41 @@ cdef class Candidate:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
cdef class KnowledgeBase:
 | 
					cdef class KnowledgeBase:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    def __init__(self, Vocab vocab):
 | 
					    def __init__(self, Vocab vocab, entity_vector_length):
 | 
				
			||||||
        self.vocab = vocab
 | 
					        self.vocab = vocab
 | 
				
			||||||
 | 
					        self.mem = Pool()
 | 
				
			||||||
 | 
					        self.entity_vector_length = entity_vector_length
 | 
				
			||||||
 | 
					
 | 
				
			||||||
        self._entry_index = PreshMap()
 | 
					        self._entry_index = PreshMap()
 | 
				
			||||||
        self._alias_index = PreshMap()
 | 
					        self._alias_index = PreshMap()
 | 
				
			||||||
        self.mem = Pool()
 | 
					
 | 
				
			||||||
        self._create_empty_vectors()
 | 
					        self.vocab.strings.add("")
 | 
				
			||||||
 | 
					        self._create_empty_vectors(dummy_hash=self.vocab.strings[""])
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    @property
 | 
				
			||||||
 | 
					    def entity_vector_length(self):
 | 
				
			||||||
 | 
					        """RETURNS (uint64): length of the entity vectors"""
 | 
				
			||||||
 | 
					        return self.entity_vector_length
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    def __len__(self):
 | 
					    def __len__(self):
 | 
				
			||||||
        return self.get_size_entities()
 | 
					        return self.get_size_entities()
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    def get_size_entities(self):
 | 
					    def get_size_entities(self):
 | 
				
			||||||
        return self._entries.size() - 1  # not counting dummy element on index 0
 | 
					        return len(self._entry_index)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    def get_entity_strings(self):
 | 
				
			||||||
 | 
					        return [self.vocab.strings[x] for x in self._entry_index]
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    def get_size_aliases(self):
 | 
					    def get_size_aliases(self):
 | 
				
			||||||
        return self._aliases_table.size() - 1 # not counting dummy element on index 0
 | 
					        return len(self._alias_index)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    def add_entity(self, unicode entity, float prob=0.5, vectors=None, features=None):
 | 
					    def get_alias_strings(self):
 | 
				
			||||||
 | 
					        return [self.vocab.strings[x] for x in self._alias_index]
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    def add_entity(self, unicode entity, float prob, vector[float] entity_vector):
 | 
				
			||||||
        """
 | 
					        """
 | 
				
			||||||
        Add an entity to the KB.
 | 
					        Add an entity to the KB, optionally specifying its log probability based on corpus frequency
 | 
				
			||||||
        Return the hash of the entity ID at the end
 | 
					        Return the hash of the entity ID/name at the end.
 | 
				
			||||||
        """
 | 
					        """
 | 
				
			||||||
        cdef hash_t entity_hash = self.vocab.strings.add(entity)
 | 
					        cdef hash_t entity_hash = self.vocab.strings.add(entity)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
| 
						 | 
					@ -66,40 +106,72 @@ cdef class KnowledgeBase:
 | 
				
			||||||
            user_warning(Warnings.W018.format(entity=entity))
 | 
					            user_warning(Warnings.W018.format(entity=entity))
 | 
				
			||||||
            return
 | 
					            return
 | 
				
			||||||
 | 
					
 | 
				
			||||||
        cdef int32_t dummy_value = 342
 | 
					        # Raise an error if the provided entity vector is not of the correct length
 | 
				
			||||||
        self.c_add_entity(entity_hash=entity_hash, prob=prob,
 | 
					        if len(entity_vector) != self.entity_vector_length:
 | 
				
			||||||
                          vector_rows=&dummy_value, feats_row=dummy_value)
 | 
					            raise ValueError(Errors.E141.format(found=len(entity_vector), required=self.entity_vector_length))
 | 
				
			||||||
        # TODO self._vectors_table.get_pointer(vectors),
 | 
					
 | 
				
			||||||
        # self._features_table.get(features))
 | 
					        vector_index = self.c_add_vector(entity_vector=entity_vector)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					        new_index = self.c_add_entity(entity_hash=entity_hash,
 | 
				
			||||||
 | 
					                                      prob=prob,
 | 
				
			||||||
 | 
					                                      vector_index=vector_index,
 | 
				
			||||||
 | 
					                                      feats_row=-1)  # Features table currently not implemented
 | 
				
			||||||
 | 
					        self._entry_index[entity_hash] = new_index
 | 
				
			||||||
 | 
					
 | 
				
			||||||
        return entity_hash
 | 
					        return entity_hash
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    cpdef set_entities(self, entity_list, prob_list, vector_list):
 | 
				
			||||||
 | 
					        if len(entity_list) != len(prob_list) or len(entity_list) != len(vector_list):
 | 
				
			||||||
 | 
					            raise ValueError(Errors.E140)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					        nr_entities = len(entity_list)
 | 
				
			||||||
 | 
					        self._entry_index = PreshMap(nr_entities+1)
 | 
				
			||||||
 | 
					        self._entries = entry_vec(nr_entities+1)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					        i = 0
 | 
				
			||||||
 | 
					        cdef KBEntryC entry
 | 
				
			||||||
 | 
					        while i < nr_entities:
 | 
				
			||||||
 | 
					            entity_vector = vector_list[i]
 | 
				
			||||||
 | 
					            if len(entity_vector) != self.entity_vector_length:
 | 
				
			||||||
 | 
					                raise ValueError(Errors.E141.format(found=len(entity_vector), required=self.entity_vector_length))
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					            entity_hash = self.vocab.strings.add(entity_list[i])
 | 
				
			||||||
 | 
					            entry.entity_hash = entity_hash
 | 
				
			||||||
 | 
					            entry.prob = prob_list[i]
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					            vector_index = self.c_add_vector(entity_vector=vector_list[i])
 | 
				
			||||||
 | 
					            entry.vector_index = vector_index
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					            entry.feats_row = -1   # Features table currently not implemented
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					            self._entries[i+1] = entry
 | 
				
			||||||
 | 
					            self._entry_index[entity_hash] = i+1
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					            i += 1
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    def add_alias(self, unicode alias, entities, probabilities):
 | 
					    def add_alias(self, unicode alias, entities, probabilities):
 | 
				
			||||||
        """
 | 
					        """
 | 
				
			||||||
        For a given alias, add its potential entities and prior probabilies to the KB.
 | 
					        For a given alias, add its potential entities and prior probabilies to the KB.
 | 
				
			||||||
        Return the alias_hash at the end
 | 
					        Return the alias_hash at the end
 | 
				
			||||||
        """
 | 
					        """
 | 
				
			||||||
 | 
					 | 
				
			||||||
        # Throw an error if the length of entities and probabilities are not the same
 | 
					        # Throw an error if the length of entities and probabilities are not the same
 | 
				
			||||||
        if not len(entities) == len(probabilities):
 | 
					        if not len(entities) == len(probabilities):
 | 
				
			||||||
            raise ValueError(Errors.E132.format(alias=alias,
 | 
					            raise ValueError(Errors.E132.format(alias=alias,
 | 
				
			||||||
                                                entities_length=len(entities),
 | 
					                                                entities_length=len(entities),
 | 
				
			||||||
                                                probabilities_length=len(probabilities)))
 | 
					                                                probabilities_length=len(probabilities)))
 | 
				
			||||||
 | 
					
 | 
				
			||||||
        # Throw an error if the probabilities sum up to more than 1
 | 
					        # Throw an error if the probabilities sum up to more than 1 (allow for some rounding errors)
 | 
				
			||||||
        prob_sum = sum(probabilities)
 | 
					        prob_sum = sum(probabilities)
 | 
				
			||||||
        if prob_sum > 1:
 | 
					        if prob_sum > 1.00001:
 | 
				
			||||||
            raise ValueError(Errors.E133.format(alias=alias, sum=prob_sum))
 | 
					            raise ValueError(Errors.E133.format(alias=alias, sum=prob_sum))
 | 
				
			||||||
 | 
					
 | 
				
			||||||
        cdef hash_t alias_hash = self.vocab.strings.add(alias)
 | 
					        cdef hash_t alias_hash = self.vocab.strings.add(alias)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
        # Return if this alias was added before
 | 
					        # Check whether this alias was added before
 | 
				
			||||||
        if alias_hash in self._alias_index:
 | 
					        if alias_hash in self._alias_index:
 | 
				
			||||||
            user_warning(Warnings.W017.format(alias=alias))
 | 
					            user_warning(Warnings.W017.format(alias=alias))
 | 
				
			||||||
            return
 | 
					            return
 | 
				
			||||||
 | 
					
 | 
				
			||||||
        cdef hash_t entity_hash
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
        cdef vector[int64_t] entry_indices
 | 
					        cdef vector[int64_t] entry_indices
 | 
				
			||||||
        cdef vector[float] probs
 | 
					        cdef vector[float] probs
 | 
				
			||||||
 | 
					
 | 
				
			||||||
| 
						 | 
					@ -112,20 +184,295 @@ cdef class KnowledgeBase:
 | 
				
			||||||
            entry_indices.push_back(int(entry_index))
 | 
					            entry_indices.push_back(int(entry_index))
 | 
				
			||||||
            probs.push_back(float(prob))
 | 
					            probs.push_back(float(prob))
 | 
				
			||||||
 | 
					
 | 
				
			||||||
        self.c_add_aliases(alias_hash=alias_hash, entry_indices=entry_indices, probs=probs)
 | 
					        new_index = self.c_add_aliases(alias_hash=alias_hash, entry_indices=entry_indices, probs=probs)
 | 
				
			||||||
 | 
					        self._alias_index[alias_hash] = new_index
 | 
				
			||||||
 | 
					
 | 
				
			||||||
        return alias_hash
 | 
					        return alias_hash
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					 | 
				
			||||||
    def get_candidates(self, unicode alias):
 | 
					    def get_candidates(self, unicode alias):
 | 
				
			||||||
        """ TODO: where to put this functionality ?"""
 | 
					 | 
				
			||||||
        cdef hash_t alias_hash = self.vocab.strings[alias]
 | 
					        cdef hash_t alias_hash = self.vocab.strings[alias]
 | 
				
			||||||
        alias_index = <int64_t>self._alias_index.get(alias_hash)
 | 
					        alias_index = <int64_t>self._alias_index.get(alias_hash)
 | 
				
			||||||
        alias_entry = self._aliases_table[alias_index]
 | 
					        alias_entry = self._aliases_table[alias_index]
 | 
				
			||||||
 | 
					
 | 
				
			||||||
        return [Candidate(kb=self,
 | 
					        return [Candidate(kb=self,
 | 
				
			||||||
                          entity_hash=self._entries[entry_index].entity_hash,
 | 
					                          entity_hash=self._entries[entry_index].entity_hash,
 | 
				
			||||||
 | 
					                          entity_freq=self._entries[entry_index].prob,
 | 
				
			||||||
 | 
					                          entity_vector=self._vectors_table[self._entries[entry_index].vector_index],
 | 
				
			||||||
                          alias_hash=alias_hash,
 | 
					                          alias_hash=alias_hash,
 | 
				
			||||||
                          prior_prob=prob)
 | 
					                          prior_prob=prob)
 | 
				
			||||||
                for (entry_index, prob) in zip(alias_entry.entry_indices, alias_entry.probs)
 | 
					                for (entry_index, prob) in zip(alias_entry.entry_indices, alias_entry.probs)
 | 
				
			||||||
                if entry_index != 0]
 | 
					                if entry_index != 0]
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    def dump(self, loc):
 | 
				
			||||||
 | 
					        cdef Writer writer = Writer(loc)
 | 
				
			||||||
 | 
					        writer.write_header(self.get_size_entities(), self.entity_vector_length)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					        # dumping the entity vectors in their original order
 | 
				
			||||||
 | 
					        i = 0
 | 
				
			||||||
 | 
					        for entity_vector in self._vectors_table:
 | 
				
			||||||
 | 
					            for element in entity_vector:
 | 
				
			||||||
 | 
					                writer.write_vector_element(element)
 | 
				
			||||||
 | 
					            i = i+1
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					        # dumping the entry records in the order in which they are in the _entries vector.
 | 
				
			||||||
 | 
					        # index 0 is a dummy object not stored in the _entry_index and can be ignored.
 | 
				
			||||||
 | 
					        i = 1
 | 
				
			||||||
 | 
					        for entry_hash, entry_index in sorted(self._entry_index.items(), key=lambda x: x[1]):
 | 
				
			||||||
 | 
					            entry = self._entries[entry_index]
 | 
				
			||||||
 | 
					            assert entry.entity_hash == entry_hash
 | 
				
			||||||
 | 
					            assert entry_index == i
 | 
				
			||||||
 | 
					            writer.write_entry(entry.entity_hash, entry.prob, entry.vector_index)
 | 
				
			||||||
 | 
					            i = i+1
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					        writer.write_alias_length(self.get_size_aliases())
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					        # dumping the aliases in the order in which they are in the _alias_index vector.
 | 
				
			||||||
 | 
					        # index 0 is a dummy object not stored in the _aliases_table and can be ignored.
 | 
				
			||||||
 | 
					        i = 1
 | 
				
			||||||
 | 
					        for alias_hash, alias_index in sorted(self._alias_index.items(), key=lambda x: x[1]):
 | 
				
			||||||
 | 
					            alias = self._aliases_table[alias_index]
 | 
				
			||||||
 | 
					            assert alias_index == i
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					            candidate_length = len(alias.entry_indices)
 | 
				
			||||||
 | 
					            writer.write_alias_header(alias_hash, candidate_length)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					            for j in range(0, candidate_length):
 | 
				
			||||||
 | 
					                writer.write_alias(alias.entry_indices[j], alias.probs[j])
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					            i = i+1
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					        writer.close()
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    cpdef load_bulk(self, loc):
 | 
				
			||||||
 | 
					        cdef hash_t entity_hash
 | 
				
			||||||
 | 
					        cdef hash_t alias_hash
 | 
				
			||||||
 | 
					        cdef int64_t entry_index
 | 
				
			||||||
 | 
					        cdef float prob
 | 
				
			||||||
 | 
					        cdef int32_t vector_index
 | 
				
			||||||
 | 
					        cdef KBEntryC entry
 | 
				
			||||||
 | 
					        cdef AliasC alias
 | 
				
			||||||
 | 
					        cdef float vector_element
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					        cdef Reader reader = Reader(loc)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					        # STEP 0: load header and initialize KB
 | 
				
			||||||
 | 
					        cdef int64_t nr_entities
 | 
				
			||||||
 | 
					        cdef int64_t entity_vector_length
 | 
				
			||||||
 | 
					        reader.read_header(&nr_entities, &entity_vector_length)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					        self.entity_vector_length = entity_vector_length
 | 
				
			||||||
 | 
					        self._entry_index = PreshMap(nr_entities+1)
 | 
				
			||||||
 | 
					        self._entries = entry_vec(nr_entities+1)
 | 
				
			||||||
 | 
					        self._vectors_table = float_matrix(nr_entities+1)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					        # STEP 1: load entity vectors
 | 
				
			||||||
 | 
					        cdef int i = 0
 | 
				
			||||||
 | 
					        cdef int j = 0
 | 
				
			||||||
 | 
					        while i < nr_entities:
 | 
				
			||||||
 | 
					            entity_vector = float_vec(entity_vector_length)
 | 
				
			||||||
 | 
					            j = 0
 | 
				
			||||||
 | 
					            while j < entity_vector_length:
 | 
				
			||||||
 | 
					                reader.read_vector_element(&vector_element)
 | 
				
			||||||
 | 
					                entity_vector[j] = vector_element
 | 
				
			||||||
 | 
					                j = j+1
 | 
				
			||||||
 | 
					            self._vectors_table[i] = entity_vector
 | 
				
			||||||
 | 
					            i = i+1
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					        # STEP 2: load entities
 | 
				
			||||||
 | 
					        # we assume that the entity data was written in sequence
 | 
				
			||||||
 | 
					        # index 0 is a dummy object not stored in the _entry_index and can be ignored.
 | 
				
			||||||
 | 
					        i = 1
 | 
				
			||||||
 | 
					        while i <= nr_entities:
 | 
				
			||||||
 | 
					            reader.read_entry(&entity_hash, &prob, &vector_index)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					            entry.entity_hash = entity_hash
 | 
				
			||||||
 | 
					            entry.prob = prob
 | 
				
			||||||
 | 
					            entry.vector_index = vector_index
 | 
				
			||||||
 | 
					            entry.feats_row = -1    # Features table currently not implemented
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					            self._entries[i] = entry
 | 
				
			||||||
 | 
					            self._entry_index[entity_hash] = i
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					            i += 1
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					        # check that all entities were read in properly
 | 
				
			||||||
 | 
					        assert nr_entities == self.get_size_entities()
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					        # STEP 3: load aliases
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					        cdef int64_t nr_aliases
 | 
				
			||||||
 | 
					        reader.read_alias_length(&nr_aliases)
 | 
				
			||||||
 | 
					        self._alias_index = PreshMap(nr_aliases+1)
 | 
				
			||||||
 | 
					        self._aliases_table = alias_vec(nr_aliases+1)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					        cdef int64_t nr_candidates
 | 
				
			||||||
 | 
					        cdef vector[int64_t] entry_indices
 | 
				
			||||||
 | 
					        cdef vector[float] probs
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					        i = 1
 | 
				
			||||||
 | 
					        # we assume the alias data was written in sequence
 | 
				
			||||||
 | 
					        # index 0 is a dummy object not stored in the _entry_index and can be ignored.
 | 
				
			||||||
 | 
					        while i <= nr_aliases:
 | 
				
			||||||
 | 
					            reader.read_alias_header(&alias_hash, &nr_candidates)
 | 
				
			||||||
 | 
					            entry_indices = vector[int64_t](nr_candidates)
 | 
				
			||||||
 | 
					            probs = vector[float](nr_candidates)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					            for j in range(0, nr_candidates):
 | 
				
			||||||
 | 
					                reader.read_alias(&entry_index, &prob)
 | 
				
			||||||
 | 
					                entry_indices[j] = entry_index
 | 
				
			||||||
 | 
					                probs[j] = prob
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					            alias.entry_indices = entry_indices
 | 
				
			||||||
 | 
					            alias.probs = probs
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					            self._aliases_table[i] = alias
 | 
				
			||||||
 | 
					            self._alias_index[alias_hash] = i
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					            i += 1
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					        # check that all aliases were read in properly
 | 
				
			||||||
 | 
					        assert nr_aliases == self.get_size_aliases()
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					cdef class Writer:
 | 
				
			||||||
 | 
					    def __init__(self, object loc):
 | 
				
			||||||
 | 
					        if path.exists(loc):
 | 
				
			||||||
 | 
					            assert not path.isdir(loc), "%s is directory." % loc
 | 
				
			||||||
 | 
					        if isinstance(loc, Path):
 | 
				
			||||||
 | 
					            loc = bytes(loc)
 | 
				
			||||||
 | 
					        cdef bytes bytes_loc = loc.encode('utf8') if type(loc) == unicode else loc
 | 
				
			||||||
 | 
					        self._fp = fopen(<char*>bytes_loc, 'wb')
 | 
				
			||||||
 | 
					        assert self._fp != NULL
 | 
				
			||||||
 | 
					        fseek(self._fp, 0, 0)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    def close(self):
 | 
				
			||||||
 | 
					        cdef size_t status = fclose(self._fp)
 | 
				
			||||||
 | 
					        assert status == 0
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    cdef int write_header(self, int64_t nr_entries, int64_t entity_vector_length) except -1:
 | 
				
			||||||
 | 
					        self._write(&nr_entries, sizeof(nr_entries))
 | 
				
			||||||
 | 
					        self._write(&entity_vector_length, sizeof(entity_vector_length))
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    cdef int write_vector_element(self, float element) except -1:
 | 
				
			||||||
 | 
					        self._write(&element, sizeof(element))
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    cdef int write_entry(self, hash_t entry_hash, float entry_prob, int32_t vector_index) except -1:
 | 
				
			||||||
 | 
					        self._write(&entry_hash, sizeof(entry_hash))
 | 
				
			||||||
 | 
					        self._write(&entry_prob, sizeof(entry_prob))
 | 
				
			||||||
 | 
					        self._write(&vector_index, sizeof(vector_index))
 | 
				
			||||||
 | 
					        # Features table currently not implemented and not written to file
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    cdef int write_alias_length(self, int64_t alias_length) except -1:
 | 
				
			||||||
 | 
					        self._write(&alias_length, sizeof(alias_length))
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    cdef int write_alias_header(self, hash_t alias_hash, int64_t candidate_length) except -1:
 | 
				
			||||||
 | 
					        self._write(&alias_hash, sizeof(alias_hash))
 | 
				
			||||||
 | 
					        self._write(&candidate_length, sizeof(candidate_length))
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    cdef int write_alias(self, int64_t entry_index, float prob) except -1:
 | 
				
			||||||
 | 
					        self._write(&entry_index, sizeof(entry_index))
 | 
				
			||||||
 | 
					        self._write(&prob, sizeof(prob))
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    cdef int _write(self, void* value, size_t size) except -1:
 | 
				
			||||||
 | 
					        status = fwrite(value, size, 1, self._fp)
 | 
				
			||||||
 | 
					        assert status == 1, status
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					cdef class Reader:
 | 
				
			||||||
 | 
					    def __init__(self, object loc):
 | 
				
			||||||
 | 
					        assert path.exists(loc)
 | 
				
			||||||
 | 
					        assert not path.isdir(loc)
 | 
				
			||||||
 | 
					        if isinstance(loc, Path):
 | 
				
			||||||
 | 
					            loc = bytes(loc)
 | 
				
			||||||
 | 
					        cdef bytes bytes_loc = loc.encode('utf8') if type(loc) == unicode else loc
 | 
				
			||||||
 | 
					        self._fp = fopen(<char*>bytes_loc, 'rb')
 | 
				
			||||||
 | 
					        if not self._fp:
 | 
				
			||||||
 | 
					            PyErr_SetFromErrno(IOError)
 | 
				
			||||||
 | 
					        status = fseek(self._fp, 0, 0)  # this can be 0 if there is no header
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    def __dealloc__(self):
 | 
				
			||||||
 | 
					        fclose(self._fp)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    cdef int read_header(self, int64_t* nr_entries, int64_t* entity_vector_length) except -1:
 | 
				
			||||||
 | 
					        status = self._read(nr_entries, sizeof(int64_t))
 | 
				
			||||||
 | 
					        if status < 1:
 | 
				
			||||||
 | 
					            if feof(self._fp):
 | 
				
			||||||
 | 
					                return 0  # end of file
 | 
				
			||||||
 | 
					            raise IOError("error reading header from input file")
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					        status = self._read(entity_vector_length, sizeof(int64_t))
 | 
				
			||||||
 | 
					        if status < 1:
 | 
				
			||||||
 | 
					            if feof(self._fp):
 | 
				
			||||||
 | 
					                return 0  # end of file
 | 
				
			||||||
 | 
					            raise IOError("error reading header from input file")
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    cdef int read_vector_element(self, float* element) except -1:
 | 
				
			||||||
 | 
					        status = self._read(element, sizeof(float))
 | 
				
			||||||
 | 
					        if status < 1:
 | 
				
			||||||
 | 
					            if feof(self._fp):
 | 
				
			||||||
 | 
					                return 0  # end of file
 | 
				
			||||||
 | 
					            raise IOError("error reading entity vector from input file")
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    cdef int read_entry(self, hash_t* entity_hash, float* prob, int32_t* vector_index) except -1:
 | 
				
			||||||
 | 
					        status = self._read(entity_hash, sizeof(hash_t))
 | 
				
			||||||
 | 
					        if status < 1:
 | 
				
			||||||
 | 
					            if feof(self._fp):
 | 
				
			||||||
 | 
					                return 0  # end of file
 | 
				
			||||||
 | 
					            raise IOError("error reading entity hash from input file")
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					        status = self._read(prob, sizeof(float))
 | 
				
			||||||
 | 
					        if status < 1:
 | 
				
			||||||
 | 
					            if feof(self._fp):
 | 
				
			||||||
 | 
					                return 0  # end of file
 | 
				
			||||||
 | 
					            raise IOError("error reading entity prob from input file")
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					        status = self._read(vector_index, sizeof(int32_t))
 | 
				
			||||||
 | 
					        if status < 1:
 | 
				
			||||||
 | 
					            if feof(self._fp):
 | 
				
			||||||
 | 
					                return 0  # end of file
 | 
				
			||||||
 | 
					            raise IOError("error reading entity vector from input file")
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					        if feof(self._fp):
 | 
				
			||||||
 | 
					            return 0
 | 
				
			||||||
 | 
					        else:
 | 
				
			||||||
 | 
					            return 1
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    cdef int read_alias_length(self, int64_t* alias_length) except -1:
 | 
				
			||||||
 | 
					        status = self._read(alias_length, sizeof(int64_t))
 | 
				
			||||||
 | 
					        if status < 1:
 | 
				
			||||||
 | 
					            if feof(self._fp):
 | 
				
			||||||
 | 
					                return 0  # end of file
 | 
				
			||||||
 | 
					            raise IOError("error reading alias length from input file")
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    cdef int read_alias_header(self, hash_t* alias_hash, int64_t* candidate_length) except -1:
 | 
				
			||||||
 | 
					        status = self._read(alias_hash, sizeof(hash_t))
 | 
				
			||||||
 | 
					        if status < 1:
 | 
				
			||||||
 | 
					            if feof(self._fp):
 | 
				
			||||||
 | 
					                return 0  # end of file
 | 
				
			||||||
 | 
					            raise IOError("error reading alias hash from input file")
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					        status = self._read(candidate_length, sizeof(int64_t))
 | 
				
			||||||
 | 
					        if status < 1:
 | 
				
			||||||
 | 
					            if feof(self._fp):
 | 
				
			||||||
 | 
					                return 0  # end of file
 | 
				
			||||||
 | 
					            raise IOError("error reading candidate length from input file")
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    cdef int read_alias(self, int64_t* entry_index, float* prob) except -1:
 | 
				
			||||||
 | 
					        status = self._read(entry_index, sizeof(int64_t))
 | 
				
			||||||
 | 
					        if status < 1:
 | 
				
			||||||
 | 
					            if feof(self._fp):
 | 
				
			||||||
 | 
					                return 0  # end of file
 | 
				
			||||||
 | 
					            raise IOError("error reading entry index for alias from input file")
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					        status = self._read(prob, sizeof(float))
 | 
				
			||||||
 | 
					        if status < 1:
 | 
				
			||||||
 | 
					            if feof(self._fp):
 | 
				
			||||||
 | 
					                return 0  # end of file
 | 
				
			||||||
 | 
					            raise IOError("error reading prob for entity/alias from input file")
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    cdef int _read(self, void* value, size_t size) except -1:
 | 
				
			||||||
 | 
					        status = fread(value, size, 1, self._fp)
 | 
				
			||||||
 | 
					        return status
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -9,6 +9,8 @@ _bengali = r"\u0980-\u09FF"
 | 
				
			||||||
 | 
					
 | 
				
			||||||
_hebrew = r"\u0591-\u05F4\uFB1D-\uFB4F"
 | 
					_hebrew = r"\u0591-\u05F4\uFB1D-\uFB4F"
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					_hindi = r"\u0900-\u097F"
 | 
				
			||||||
 | 
					
 | 
				
			||||||
# Latin standard
 | 
					# Latin standard
 | 
				
			||||||
_latin_u_standard = r"A-Z"
 | 
					_latin_u_standard = r"A-Z"
 | 
				
			||||||
_latin_l_standard = r"a-z"
 | 
					_latin_l_standard = r"a-z"
 | 
				
			||||||
| 
						 | 
					@ -193,7 +195,7 @@ _ukrainian = r"а-щюяіїєґА-ЩЮЯІЇЄҐ"
 | 
				
			||||||
_upper = LATIN_UPPER + _russian_upper + _tatar_upper + _greek_upper + _ukrainian_upper
 | 
					_upper = LATIN_UPPER + _russian_upper + _tatar_upper + _greek_upper + _ukrainian_upper
 | 
				
			||||||
_lower = LATIN_LOWER + _russian_lower + _tatar_lower + _greek_lower + _ukrainian_lower
 | 
					_lower = LATIN_LOWER + _russian_lower + _tatar_lower + _greek_lower + _ukrainian_lower
 | 
				
			||||||
 | 
					
 | 
				
			||||||
_uncased = _bengali + _hebrew + _persian + _sinhala
 | 
					_uncased = _bengali + _hebrew + _persian + _sinhala + _hindi
 | 
				
			||||||
 | 
					
 | 
				
			||||||
ALPHA = group_chars(LATIN + _russian + _tatar + _greek + _ukrainian + _uncased)
 | 
					ALPHA = group_chars(LATIN + _russian + _tatar + _greek + _ukrainian + _uncased)
 | 
				
			||||||
ALPHA_LOWER = group_chars(_lower + _uncased)
 | 
					ALPHA_LOWER = group_chars(_lower + _uncased)
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -5,7 +5,7 @@ from __future__ import unicode_literals
 | 
				
			||||||
"""
 | 
					"""
 | 
				
			||||||
Example sentences to test spaCy and its language models.
 | 
					Example sentences to test spaCy and its language models.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
>>> from spacy.lang.en.examples import sentences
 | 
					>>> from spacy.lang.id.examples import sentences
 | 
				
			||||||
>>> docs = nlp.pipe(sentences)
 | 
					>>> docs = nlp.pipe(sentences)
 | 
				
			||||||
"""
 | 
					"""
 | 
				
			||||||
 | 
					
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
							
								
								
									
										120
									
								
								spacy/lang/ko/__init__.py
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										120
									
								
								spacy/lang/ko/__init__.py
									
									
									
									
									
										Normal file
									
								
							| 
						 | 
					@ -0,0 +1,120 @@
 | 
				
			||||||
 | 
					# encoding: utf8
 | 
				
			||||||
 | 
					from __future__ import unicode_literals, print_function
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					import re
 | 
				
			||||||
 | 
					import sys
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					from .stop_words import STOP_WORDS
 | 
				
			||||||
 | 
					from .tag_map import TAG_MAP
 | 
				
			||||||
 | 
					from ...attrs import LANG
 | 
				
			||||||
 | 
					from ...language import Language
 | 
				
			||||||
 | 
					from ...tokens import Doc
 | 
				
			||||||
 | 
					from ...compat import copy_reg
 | 
				
			||||||
 | 
					from ...util import DummyTokenizer
 | 
				
			||||||
 | 
					from ...compat import is_python3, is_python_pre_3_5
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					is_python_post_3_7 = is_python3 and sys.version_info[1] >= 7
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					# fmt: off
 | 
				
			||||||
 | 
					if is_python_pre_3_5:
 | 
				
			||||||
 | 
					    from collections import namedtuple
 | 
				
			||||||
 | 
					    Morpheme = namedtuple("Morpheme", "surface lemma tag")
 | 
				
			||||||
 | 
					elif is_python_post_3_7:
 | 
				
			||||||
 | 
					    from dataclasses import dataclass
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    @dataclass(frozen=True)
 | 
				
			||||||
 | 
					    class Morpheme:
 | 
				
			||||||
 | 
					        surface: str
 | 
				
			||||||
 | 
					        lemma: str
 | 
				
			||||||
 | 
					        tag: str
 | 
				
			||||||
 | 
					else:
 | 
				
			||||||
 | 
					    from typing import NamedTuple
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    class Morpheme(NamedTuple):
 | 
				
			||||||
 | 
					        surface: str
 | 
				
			||||||
 | 
					        lemma: str
 | 
				
			||||||
 | 
					        tag: str
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					def try_mecab_import():
 | 
				
			||||||
 | 
					    try:
 | 
				
			||||||
 | 
					        from natto import MeCab
 | 
				
			||||||
 | 
					        return MeCab
 | 
				
			||||||
 | 
					    except ImportError:
 | 
				
			||||||
 | 
					        raise ImportError(
 | 
				
			||||||
 | 
					            "Korean support requires [mecab-ko](https://bitbucket.org/eunjeon/mecab-ko/src/master/README.md), "
 | 
				
			||||||
 | 
					            "[mecab-ko-dic](https://bitbucket.org/eunjeon/mecab-ko-dic), "
 | 
				
			||||||
 | 
					            "and [natto-py](https://github.com/buruzaemon/natto-py)"
 | 
				
			||||||
 | 
					        )
 | 
				
			||||||
 | 
					# fmt: on
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					def check_spaces(text, tokens):
 | 
				
			||||||
 | 
					    token_pattern = re.compile(r"\s?".join(f"({t})" for t in tokens))
 | 
				
			||||||
 | 
					    m = token_pattern.match(text)
 | 
				
			||||||
 | 
					    if m is not None:
 | 
				
			||||||
 | 
					        for i in range(1, m.lastindex):
 | 
				
			||||||
 | 
					            yield m.end(i) < m.start(i + 1)
 | 
				
			||||||
 | 
					        yield False
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					class KoreanTokenizer(DummyTokenizer):
 | 
				
			||||||
 | 
					    def __init__(self, cls, nlp=None):
 | 
				
			||||||
 | 
					        self.vocab = nlp.vocab if nlp is not None else cls.create_vocab(nlp)
 | 
				
			||||||
 | 
					        self.Tokenizer = try_mecab_import()
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    def __call__(self, text):
 | 
				
			||||||
 | 
					        dtokens = list(self.detailed_tokens(text))
 | 
				
			||||||
 | 
					        surfaces = [dt.surface for dt in dtokens]
 | 
				
			||||||
 | 
					        doc = Doc(self.vocab, words=surfaces, spaces=list(check_spaces(text, surfaces)))
 | 
				
			||||||
 | 
					        for token, dtoken in zip(doc, dtokens):
 | 
				
			||||||
 | 
					            first_tag, sep, eomi_tags = dtoken.tag.partition("+")
 | 
				
			||||||
 | 
					            token.tag_ = first_tag  # stem(어간) or pre-final(선어말 어미)
 | 
				
			||||||
 | 
					            token.lemma_ = dtoken.lemma
 | 
				
			||||||
 | 
					        doc.user_data["full_tags"] = [dt.tag for dt in dtokens]
 | 
				
			||||||
 | 
					        return doc
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    def detailed_tokens(self, text):
 | 
				
			||||||
 | 
					        # 품사 태그(POS)[0], 의미 부류(semantic class)[1],	종성 유무(jongseong)[2], 읽기(reading)[3],
 | 
				
			||||||
 | 
					        # 타입(type)[4], 첫번째 품사(start pos)[5],	마지막 품사(end pos)[6], 표현(expression)[7], *
 | 
				
			||||||
 | 
					        with self.Tokenizer("-F%f[0],%f[7]") as tokenizer:
 | 
				
			||||||
 | 
					            for node in tokenizer.parse(text, as_nodes=True):
 | 
				
			||||||
 | 
					                if node.is_eos():
 | 
				
			||||||
 | 
					                    break
 | 
				
			||||||
 | 
					                surface = node.surface
 | 
				
			||||||
 | 
					                feature = node.feature
 | 
				
			||||||
 | 
					                tag, _, expr = feature.partition(",")
 | 
				
			||||||
 | 
					                lemma, _, remainder = expr.partition("/")
 | 
				
			||||||
 | 
					                if lemma == "*":
 | 
				
			||||||
 | 
					                    lemma = surface
 | 
				
			||||||
 | 
					                yield Morpheme(surface, lemma, tag)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					class KoreanDefaults(Language.Defaults):
 | 
				
			||||||
 | 
					    lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
 | 
				
			||||||
 | 
					    lex_attr_getters[LANG] = lambda _text: "ko"
 | 
				
			||||||
 | 
					    stop_words = STOP_WORDS
 | 
				
			||||||
 | 
					    tag_map = TAG_MAP
 | 
				
			||||||
 | 
					    writing_system = {"direction": "ltr", "has_case": False, "has_letters": False}
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    @classmethod
 | 
				
			||||||
 | 
					    def create_tokenizer(cls, nlp=None):
 | 
				
			||||||
 | 
					        return KoreanTokenizer(cls, nlp)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					class Korean(Language):
 | 
				
			||||||
 | 
					    lang = "ko"
 | 
				
			||||||
 | 
					    Defaults = KoreanDefaults
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    def make_doc(self, text):
 | 
				
			||||||
 | 
					        return self.tokenizer(text)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					def pickle_korean(instance):
 | 
				
			||||||
 | 
					    return Korean, tuple()
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					copy_reg.pickle(Korean, pickle_korean)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					__all__ = ["Korean"]
 | 
				
			||||||
							
								
								
									
										15
									
								
								spacy/lang/ko/examples.py
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										15
									
								
								spacy/lang/ko/examples.py
									
									
									
									
									
										Normal file
									
								
							| 
						 | 
					@ -0,0 +1,15 @@
 | 
				
			||||||
 | 
					# coding: utf8
 | 
				
			||||||
 | 
					from __future__ import unicode_literals
 | 
				
			||||||
 | 
					"""
 | 
				
			||||||
 | 
					Example sentences to test spaCy and its language models.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					>>> from spacy.lang.ko.examples import sentences
 | 
				
			||||||
 | 
					>>> docs = nlp.pipe(sentences)
 | 
				
			||||||
 | 
					"""
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					sentences = [
 | 
				
			||||||
 | 
					    "애플이 영국의 신생 기업을 10억 달러에 구매를 고려중이다.",
 | 
				
			||||||
 | 
					    "자동 운전 자동차의 손해 배상 책임에 자동차 메이커에 일정한 부담을 요구하겠다.",
 | 
				
			||||||
 | 
					    "자동 배달 로봇이 보도를 주행하는 것을 샌프란시스코시가 금지를 검토중이라고 합니다.",
 | 
				
			||||||
 | 
					    "런던은 영국의 수도이자 가장 큰 도시입니다."
 | 
				
			||||||
 | 
					]
 | 
				
			||||||
							
								
								
									
										68
									
								
								spacy/lang/ko/stop_words.py
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										68
									
								
								spacy/lang/ko/stop_words.py
									
									
									
									
									
										Normal file
									
								
							| 
						 | 
					@ -0,0 +1,68 @@
 | 
				
			||||||
 | 
					# coding: utf8
 | 
				
			||||||
 | 
					from __future__ import unicode_literals
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					STOP_WORDS = set("""
 | 
				
			||||||
 | 
					이
 | 
				
			||||||
 | 
					있
 | 
				
			||||||
 | 
					하
 | 
				
			||||||
 | 
					것
 | 
				
			||||||
 | 
					들
 | 
				
			||||||
 | 
					그
 | 
				
			||||||
 | 
					되
 | 
				
			||||||
 | 
					수
 | 
				
			||||||
 | 
					이
 | 
				
			||||||
 | 
					보
 | 
				
			||||||
 | 
					않
 | 
				
			||||||
 | 
					없
 | 
				
			||||||
 | 
					나
 | 
				
			||||||
 | 
					주
 | 
				
			||||||
 | 
					아니
 | 
				
			||||||
 | 
					등
 | 
				
			||||||
 | 
					같
 | 
				
			||||||
 | 
					때
 | 
				
			||||||
 | 
					년
 | 
				
			||||||
 | 
					가
 | 
				
			||||||
 | 
					한
 | 
				
			||||||
 | 
					지
 | 
				
			||||||
 | 
					오
 | 
				
			||||||
 | 
					말
 | 
				
			||||||
 | 
					일
 | 
				
			||||||
 | 
					그렇
 | 
				
			||||||
 | 
					위하
 | 
				
			||||||
 | 
					때문
 | 
				
			||||||
 | 
					그것
 | 
				
			||||||
 | 
					두
 | 
				
			||||||
 | 
					말하
 | 
				
			||||||
 | 
					알
 | 
				
			||||||
 | 
					그러나
 | 
				
			||||||
 | 
					받
 | 
				
			||||||
 | 
					못하
 | 
				
			||||||
 | 
					일
 | 
				
			||||||
 | 
					그런
 | 
				
			||||||
 | 
					또
 | 
				
			||||||
 | 
					더
 | 
				
			||||||
 | 
					많
 | 
				
			||||||
 | 
					그리고
 | 
				
			||||||
 | 
					좋
 | 
				
			||||||
 | 
					크
 | 
				
			||||||
 | 
					시키
 | 
				
			||||||
 | 
					그러
 | 
				
			||||||
 | 
					하나
 | 
				
			||||||
 | 
					살
 | 
				
			||||||
 | 
					데
 | 
				
			||||||
 | 
					안
 | 
				
			||||||
 | 
					어떤
 | 
				
			||||||
 | 
					번
 | 
				
			||||||
 | 
					나
 | 
				
			||||||
 | 
					다른
 | 
				
			||||||
 | 
					어떻
 | 
				
			||||||
 | 
					들
 | 
				
			||||||
 | 
					이렇
 | 
				
			||||||
 | 
					점
 | 
				
			||||||
 | 
					싶
 | 
				
			||||||
 | 
					말
 | 
				
			||||||
 | 
					좀
 | 
				
			||||||
 | 
					원
 | 
				
			||||||
 | 
					잘
 | 
				
			||||||
 | 
					놓
 | 
				
			||||||
 | 
					""".split())
 | 
				
			||||||
							
								
								
									
										59
									
								
								spacy/lang/ko/tag_map.py
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										59
									
								
								spacy/lang/ko/tag_map.py
									
									
									
									
									
										Normal file
									
								
							| 
						 | 
					@ -0,0 +1,59 @@
 | 
				
			||||||
 | 
					# encoding: utf8
 | 
				
			||||||
 | 
					from __future__ import unicode_literals
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					from ...symbols import POS, PUNCT, INTJ, X, SYM, ADJ, AUX, ADP, CONJ, NOUN, PRON
 | 
				
			||||||
 | 
					from ...symbols import VERB, ADV, PROPN, NUM, DET
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					# 은전한닢(mecab-ko-dic)의 품사 태그를 universal pos tag로 대응시킴
 | 
				
			||||||
 | 
					# https://docs.google.com/spreadsheets/d/1-9blXKjtjeKZqsf4NzHeYJCrr49-nXeRF6D80udfcwY/edit#gid=589544265
 | 
				
			||||||
 | 
					# https://universaldependencies.org/u/pos/
 | 
				
			||||||
 | 
					TAG_MAP = {
 | 
				
			||||||
 | 
					    # J.{1,2} 조사
 | 
				
			||||||
 | 
					    "JKS": {POS: ADP},
 | 
				
			||||||
 | 
					    "JKC": {POS: ADP},
 | 
				
			||||||
 | 
					    "JKG": {POS: ADP},
 | 
				
			||||||
 | 
					    "JKO": {POS: ADP},
 | 
				
			||||||
 | 
					    "JKB": {POS: ADP},
 | 
				
			||||||
 | 
					    "JKV": {POS: ADP},
 | 
				
			||||||
 | 
					    "JKQ": {POS: ADP},
 | 
				
			||||||
 | 
					    "JX": {POS: ADP},  # 보조사
 | 
				
			||||||
 | 
					    "JC": {POS: CONJ},  # 접속 조사
 | 
				
			||||||
 | 
					    "MAJ": {POS: CONJ},  # 접속 부사
 | 
				
			||||||
 | 
					    "MAG": {POS: ADV},  # 일반 부사
 | 
				
			||||||
 | 
					    "MM": {POS: DET},  # 관형사
 | 
				
			||||||
 | 
					    "XPN": {POS: X},  # 접두사
 | 
				
			||||||
 | 
					    # XS. 접미사
 | 
				
			||||||
 | 
					    "XSN": {POS: X},
 | 
				
			||||||
 | 
					    "XSV": {POS: X},
 | 
				
			||||||
 | 
					    "XSA": {POS: X},
 | 
				
			||||||
 | 
					    "XR": {POS: X},  # 어근
 | 
				
			||||||
 | 
					    # E.{1,2} 어미
 | 
				
			||||||
 | 
					    "EP": {POS: X},
 | 
				
			||||||
 | 
					    "EF": {POS: X},
 | 
				
			||||||
 | 
					    "EC": {POS: X},
 | 
				
			||||||
 | 
					    "ETN": {POS: X},
 | 
				
			||||||
 | 
					    "ETM": {POS: X},
 | 
				
			||||||
 | 
					    "IC": {POS: INTJ},  # 감탄사
 | 
				
			||||||
 | 
					    "VV": {POS: VERB},  # 동사
 | 
				
			||||||
 | 
					    "VA": {POS: ADJ},  # 형용사
 | 
				
			||||||
 | 
					    "VX": {POS: AUX},  # 보조 용언
 | 
				
			||||||
 | 
					    "VCP": {POS: ADP},  # 긍정 지정사(이다)
 | 
				
			||||||
 | 
					    "VCN": {POS: ADJ},  # 부정 지정사(아니다)
 | 
				
			||||||
 | 
					    "NNG": {POS: NOUN},  # 일반 명사(general noun)
 | 
				
			||||||
 | 
					    "NNB": {POS: NOUN},  # 의존 명사
 | 
				
			||||||
 | 
					    "NNBC": {POS: NOUN},  # 의존 명사(단위: unit)
 | 
				
			||||||
 | 
					    "NNP": {POS: PROPN},  # 고유 명사(proper noun)
 | 
				
			||||||
 | 
					    "NP": {POS: PRON},  # 대명사
 | 
				
			||||||
 | 
					    "NR": {POS: NUM},  # 수사(numerals)
 | 
				
			||||||
 | 
					    "SN": {POS: NUM},  # 숫자
 | 
				
			||||||
 | 
					    # S.{1,2} 부호
 | 
				
			||||||
 | 
					    # 문장 부호
 | 
				
			||||||
 | 
					    "SF": {POS: PUNCT},  # period or other EOS marker
 | 
				
			||||||
 | 
					    "SE": {POS: PUNCT},
 | 
				
			||||||
 | 
					    "SC": {POS: PUNCT},  # comma, etc.
 | 
				
			||||||
 | 
					    "SSO": {POS: PUNCT},  # open bracket
 | 
				
			||||||
 | 
					    "SSC": {POS: PUNCT},  # close bracket
 | 
				
			||||||
 | 
					    "SY": {POS: SYM},  # 기타 기호
 | 
				
			||||||
 | 
					    "SL": {POS: X},  # 외국어
 | 
				
			||||||
 | 
					    "SH": {POS: X},  # 한자
 | 
				
			||||||
 | 
					}
 | 
				
			||||||
| 
						 | 
					@ -1,15 +1,37 @@
 | 
				
			||||||
# coding: utf8
 | 
					# coding: utf8
 | 
				
			||||||
from __future__ import unicode_literals
 | 
					from __future__ import unicode_literals
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
 | 
				
			||||||
from .stop_words import STOP_WORDS
 | 
					from .stop_words import STOP_WORDS
 | 
				
			||||||
 | 
					from .lex_attrs import LEX_ATTRS
 | 
				
			||||||
 | 
					from .tag_map import TAG_MAP
 | 
				
			||||||
 | 
					from .lemmatizer import LOOKUP
 | 
				
			||||||
 | 
					from .morph_rules import MORPH_RULES
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					from ..tokenizer_exceptions import BASE_EXCEPTIONS
 | 
				
			||||||
 | 
					from ..norm_exceptions import BASE_NORMS
 | 
				
			||||||
from ...language import Language
 | 
					from ...language import Language
 | 
				
			||||||
from ...attrs import LANG
 | 
					from ...attrs import LANG, NORM
 | 
				
			||||||
 | 
					from ...util import update_exc, add_lookups
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					def _return_lt(_):
 | 
				
			||||||
 | 
					    return "lt"
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
class LithuanianDefaults(Language.Defaults):
 | 
					class LithuanianDefaults(Language.Defaults):
 | 
				
			||||||
    lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
 | 
					    lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
 | 
				
			||||||
    lex_attr_getters[LANG] = lambda text: "lt"
 | 
					    lex_attr_getters[LANG] = _return_lt
 | 
				
			||||||
 | 
					    lex_attr_getters[NORM] = add_lookups(
 | 
				
			||||||
 | 
					        Language.Defaults.lex_attr_getters[NORM], BASE_NORMS
 | 
				
			||||||
 | 
					    )
 | 
				
			||||||
 | 
					    lex_attr_getters.update(LEX_ATTRS)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
 | 
				
			||||||
    stop_words = STOP_WORDS
 | 
					    stop_words = STOP_WORDS
 | 
				
			||||||
 | 
					    tag_map = TAG_MAP
 | 
				
			||||||
 | 
					    morph_rules = MORPH_RULES
 | 
				
			||||||
 | 
					    lemma_lookup = LOOKUP
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
class Lithuanian(Language):
 | 
					class Lithuanian(Language):
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
							
								
								
									
										22
									
								
								spacy/lang/lt/examples.py
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										22
									
								
								spacy/lang/lt/examples.py
									
									
									
									
									
										Normal file
									
								
							| 
						 | 
					@ -0,0 +1,22 @@
 | 
				
			||||||
 | 
					# coding: utf8
 | 
				
			||||||
 | 
					from __future__ import unicode_literals
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					"""
 | 
				
			||||||
 | 
					Example sentences to test spaCy and its language models.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					>>> from spacy.lang.lt.examples import sentences
 | 
				
			||||||
 | 
					>>> docs = nlp.pipe(sentences)
 | 
				
			||||||
 | 
					"""
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					sentences = [
 | 
				
			||||||
 | 
					    "Jaunikis pirmąją vestuvinę naktį iškeitė į areštinės gultą",
 | 
				
			||||||
 | 
					    "Bepiločiai automobiliai išnaikins vairavimo mokyklas, autoservisus ir eismo nelaimes",
 | 
				
			||||||
 | 
					    "Vilniuje galvojama uždrausti naudoti skėčius",
 | 
				
			||||||
 | 
					    "Londonas yra didelis miestas Jungtinėje Karalystėje",
 | 
				
			||||||
 | 
					    "Kur tu?",
 | 
				
			||||||
 | 
					    "Kas yra Prancūzijos prezidentas?",
 | 
				
			||||||
 | 
					    "Kokia yra Jungtinių Amerikos Valstijų sostinė?",
 | 
				
			||||||
 | 
					    "Kada gimė Dalia Grybauskaitė?",
 | 
				
			||||||
 | 
					]
 | 
				
			||||||
							
								
								
									
										234227
									
								
								spacy/lang/lt/lemmatizer.py
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										234227
									
								
								spacy/lang/lt/lemmatizer.py
									
									
									
									
									
										Normal file
									
								
							
										
											
												File diff suppressed because it is too large
												Load Diff
											
										
									
								
							
							
								
								
									
										1153
									
								
								spacy/lang/lt/lex_attrs.py
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										1153
									
								
								spacy/lang/lt/lex_attrs.py
									
									
									
									
									
										Normal file
									
								
							
										
											
												File diff suppressed because it is too large
												Load Diff
											
										
									
								
							
							
								
								
									
										3075
									
								
								spacy/lang/lt/morph_rules.py
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										3075
									
								
								spacy/lang/lt/morph_rules.py
									
									
									
									
									
										Normal file
									
								
							
										
											
												File diff suppressed because it is too large
												Load Diff
											
										
									
								
							
										
											
												File diff suppressed because it is too large
												Load Diff
											
										
									
								
							
							
								
								
									
										4798
									
								
								spacy/lang/lt/tag_map.py
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										4798
									
								
								spacy/lang/lt/tag_map.py
									
									
									
									
									
										Normal file
									
								
							
										
											
												File diff suppressed because it is too large
												Load Diff
											
										
									
								
							
							
								
								
									
										268
									
								
								spacy/lang/lt/tokenizer_exceptions.py
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										268
									
								
								spacy/lang/lt/tokenizer_exceptions.py
									
									
									
									
									
										Normal file
									
								
							| 
						 | 
					@ -0,0 +1,268 @@
 | 
				
			||||||
 | 
					# coding: utf8
 | 
				
			||||||
 | 
					from __future__ import unicode_literals
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					from ...symbols import ORTH
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					_exc = {}
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					for orth in [
 | 
				
			||||||
 | 
					    "G.",
 | 
				
			||||||
 | 
					    "J. E.",
 | 
				
			||||||
 | 
					    "J. Em.",
 | 
				
			||||||
 | 
					    "J.E.",
 | 
				
			||||||
 | 
					    "J.Em.",
 | 
				
			||||||
 | 
					    "K.",
 | 
				
			||||||
 | 
					    "N.",
 | 
				
			||||||
 | 
					    "V.",
 | 
				
			||||||
 | 
					    "Vt.",
 | 
				
			||||||
 | 
					    "a.",
 | 
				
			||||||
 | 
					    "a.k.",
 | 
				
			||||||
 | 
					    "a.s.",
 | 
				
			||||||
 | 
					    "adv.",
 | 
				
			||||||
 | 
					    "akad.",
 | 
				
			||||||
 | 
					    "aklg.",
 | 
				
			||||||
 | 
					    "akt.",
 | 
				
			||||||
 | 
					    "al.",
 | 
				
			||||||
 | 
					    "ang.",
 | 
				
			||||||
 | 
					    "angl.",
 | 
				
			||||||
 | 
					    "aps.",
 | 
				
			||||||
 | 
					    "apskr.",
 | 
				
			||||||
 | 
					    "apyg.",
 | 
				
			||||||
 | 
					    "arbat.",
 | 
				
			||||||
 | 
					    "asist.",
 | 
				
			||||||
 | 
					    "asm.",
 | 
				
			||||||
 | 
					    "asm.k.",
 | 
				
			||||||
 | 
					    "asmv.",
 | 
				
			||||||
 | 
					    "atk.",
 | 
				
			||||||
 | 
					    "atsak.",
 | 
				
			||||||
 | 
					    "atsisk.",
 | 
				
			||||||
 | 
					    "atsisk.sąsk.",
 | 
				
			||||||
 | 
					    "atv.",
 | 
				
			||||||
 | 
					    "aut.",
 | 
				
			||||||
 | 
					    "avd.",
 | 
				
			||||||
 | 
					    "b.k.",
 | 
				
			||||||
 | 
					    "baud.",
 | 
				
			||||||
 | 
					    "biol.",
 | 
				
			||||||
 | 
					    "bkl.",
 | 
				
			||||||
 | 
					    "bot.",
 | 
				
			||||||
 | 
					    "bt.",
 | 
				
			||||||
 | 
					    "buv.",
 | 
				
			||||||
 | 
					    "ch.",
 | 
				
			||||||
 | 
					    "chem.",
 | 
				
			||||||
 | 
					    "corp.",
 | 
				
			||||||
 | 
					    "d.",
 | 
				
			||||||
 | 
					    "dab.",
 | 
				
			||||||
 | 
					    "dail.",
 | 
				
			||||||
 | 
					    "dek.",
 | 
				
			||||||
 | 
					    "deš.",
 | 
				
			||||||
 | 
					    "dir.",
 | 
				
			||||||
 | 
					    "dirig.",
 | 
				
			||||||
 | 
					    "doc.",
 | 
				
			||||||
 | 
					    "dol.",
 | 
				
			||||||
 | 
					    "dr.",
 | 
				
			||||||
 | 
					    "drp.",
 | 
				
			||||||
 | 
					    "dvit.",
 | 
				
			||||||
 | 
					    "dėst.",
 | 
				
			||||||
 | 
					    "dš.",
 | 
				
			||||||
 | 
					    "dž.",
 | 
				
			||||||
 | 
					    "e.b.",
 | 
				
			||||||
 | 
					    "e.bankas",
 | 
				
			||||||
 | 
					    "e.p.",
 | 
				
			||||||
 | 
					    "e.parašas",
 | 
				
			||||||
 | 
					    "e.paštas",
 | 
				
			||||||
 | 
					    "e.v.",
 | 
				
			||||||
 | 
					    "e.valdžia",
 | 
				
			||||||
 | 
					    "egz.",
 | 
				
			||||||
 | 
					    "eil.",
 | 
				
			||||||
 | 
					    "ekon.",
 | 
				
			||||||
 | 
					    "el.",
 | 
				
			||||||
 | 
					    "el.bankas",
 | 
				
			||||||
 | 
					    "el.p.",
 | 
				
			||||||
 | 
					    "el.parašas",
 | 
				
			||||||
 | 
					    "el.paštas",
 | 
				
			||||||
 | 
					    "el.valdžia",
 | 
				
			||||||
 | 
					    "etc.",
 | 
				
			||||||
 | 
					    "ež.",
 | 
				
			||||||
 | 
					    "fak.",
 | 
				
			||||||
 | 
					    "faks.",
 | 
				
			||||||
 | 
					    "feat.",
 | 
				
			||||||
 | 
					    "filol.",
 | 
				
			||||||
 | 
					    "filos.",
 | 
				
			||||||
 | 
					    "g.",
 | 
				
			||||||
 | 
					    "gen.",
 | 
				
			||||||
 | 
					    "geol.",
 | 
				
			||||||
 | 
					    "gerb.",
 | 
				
			||||||
 | 
					    "gim.",
 | 
				
			||||||
 | 
					    "gr.",
 | 
				
			||||||
 | 
					    "gv.",
 | 
				
			||||||
 | 
					    "gyd.",
 | 
				
			||||||
 | 
					    "gyv.",
 | 
				
			||||||
 | 
					    "habil.",
 | 
				
			||||||
 | 
					    "inc.",
 | 
				
			||||||
 | 
					    "insp.",
 | 
				
			||||||
 | 
					    "inž.",
 | 
				
			||||||
 | 
					    "ir pan.",
 | 
				
			||||||
 | 
					    "ir t. t.",
 | 
				
			||||||
 | 
					    "isp.",
 | 
				
			||||||
 | 
					    "istor.",
 | 
				
			||||||
 | 
					    "it.",
 | 
				
			||||||
 | 
					    "just.",
 | 
				
			||||||
 | 
					    "k.",
 | 
				
			||||||
 | 
					    "k. a.",
 | 
				
			||||||
 | 
					    "k.a.",
 | 
				
			||||||
 | 
					    "kab.",
 | 
				
			||||||
 | 
					    "kand.",
 | 
				
			||||||
 | 
					    "kart.",
 | 
				
			||||||
 | 
					    "kat.",
 | 
				
			||||||
 | 
					    "ketv.",
 | 
				
			||||||
 | 
					    "kh.",
 | 
				
			||||||
 | 
					    "kl.",
 | 
				
			||||||
 | 
					    "kln.",
 | 
				
			||||||
 | 
					    "km.",
 | 
				
			||||||
 | 
					    "kn.",
 | 
				
			||||||
 | 
					    "koresp.",
 | 
				
			||||||
 | 
					    "kpt.",
 | 
				
			||||||
 | 
					    "kr.",
 | 
				
			||||||
 | 
					    "kt.",
 | 
				
			||||||
 | 
					    "kub.",
 | 
				
			||||||
 | 
					    "kun.",
 | 
				
			||||||
 | 
					    "kv.",
 | 
				
			||||||
 | 
					    "kyš.",
 | 
				
			||||||
 | 
					    "l. e. p.",
 | 
				
			||||||
 | 
					    "l.e.p.",
 | 
				
			||||||
 | 
					    "lenk.",
 | 
				
			||||||
 | 
					    "liet.",
 | 
				
			||||||
 | 
					    "lot.",
 | 
				
			||||||
 | 
					    "lt.",
 | 
				
			||||||
 | 
					    "ltd.",
 | 
				
			||||||
 | 
					    "ltn.",
 | 
				
			||||||
 | 
					    "m.",
 | 
				
			||||||
 | 
					    "m.e..",
 | 
				
			||||||
 | 
					    "m.m.",
 | 
				
			||||||
 | 
					    "mat.",
 | 
				
			||||||
 | 
					    "med.",
 | 
				
			||||||
 | 
					    "mgnt.",
 | 
				
			||||||
 | 
					    "mgr.",
 | 
				
			||||||
 | 
					    "min.",
 | 
				
			||||||
 | 
					    "mjr.",
 | 
				
			||||||
 | 
					    "ml.",
 | 
				
			||||||
 | 
					    "mln.",
 | 
				
			||||||
 | 
					    "mlrd.",
 | 
				
			||||||
 | 
					    "mob.",
 | 
				
			||||||
 | 
					    "mok.",
 | 
				
			||||||
 | 
					    "moksl.",
 | 
				
			||||||
 | 
					    "mokyt.",
 | 
				
			||||||
 | 
					    "mot.",
 | 
				
			||||||
 | 
					    "mr.",
 | 
				
			||||||
 | 
					    "mst.",
 | 
				
			||||||
 | 
					    "mstl.",
 | 
				
			||||||
 | 
					    "mėn.",
 | 
				
			||||||
 | 
					    "nkt.",
 | 
				
			||||||
 | 
					    "no.",
 | 
				
			||||||
 | 
					    "nr.",
 | 
				
			||||||
 | 
					    "ntk.",
 | 
				
			||||||
 | 
					    "nuotr.",
 | 
				
			||||||
 | 
					    "op.",
 | 
				
			||||||
 | 
					    "org.",
 | 
				
			||||||
 | 
					    "orig.",
 | 
				
			||||||
 | 
					    "p.",
 | 
				
			||||||
 | 
					    "p.d.",
 | 
				
			||||||
 | 
					    "p.m.e.",
 | 
				
			||||||
 | 
					    "p.s.",
 | 
				
			||||||
 | 
					    "pab.",
 | 
				
			||||||
 | 
					    "pan.",
 | 
				
			||||||
 | 
					    "past.",
 | 
				
			||||||
 | 
					    "pav.",
 | 
				
			||||||
 | 
					    "pavad.",
 | 
				
			||||||
 | 
					    "per.",
 | 
				
			||||||
 | 
					    "perd.",
 | 
				
			||||||
 | 
					    "pirm.",
 | 
				
			||||||
 | 
					    "pl.",
 | 
				
			||||||
 | 
					    "plg.",
 | 
				
			||||||
 | 
					    "plk.",
 | 
				
			||||||
 | 
					    "pr.",
 | 
				
			||||||
 | 
					    "pr.Kr.",
 | 
				
			||||||
 | 
					    "pranc.",
 | 
				
			||||||
 | 
					    "proc.",
 | 
				
			||||||
 | 
					    "prof.",
 | 
				
			||||||
 | 
					    "prom.",
 | 
				
			||||||
 | 
					    "prot.",
 | 
				
			||||||
 | 
					    "psl.",
 | 
				
			||||||
 | 
					    "pss.",
 | 
				
			||||||
 | 
					    "pvz.",
 | 
				
			||||||
 | 
					    "pšt.",
 | 
				
			||||||
 | 
					    "r.",
 | 
				
			||||||
 | 
					    "raj.",
 | 
				
			||||||
 | 
					    "red.",
 | 
				
			||||||
 | 
					    "rez.",
 | 
				
			||||||
 | 
					    "rež.",
 | 
				
			||||||
 | 
					    "rus.",
 | 
				
			||||||
 | 
					    "rš.",
 | 
				
			||||||
 | 
					    "s.",
 | 
				
			||||||
 | 
					    "sav.",
 | 
				
			||||||
 | 
					    "saviv.",
 | 
				
			||||||
 | 
					    "sek.",
 | 
				
			||||||
 | 
					    "sekr.",
 | 
				
			||||||
 | 
					    "sen.",
 | 
				
			||||||
 | 
					    "sh.",
 | 
				
			||||||
 | 
					    "sk.",
 | 
				
			||||||
 | 
					    "skg.",
 | 
				
			||||||
 | 
					    "skv.",
 | 
				
			||||||
 | 
					    "skyr.",
 | 
				
			||||||
 | 
					    "sp.",
 | 
				
			||||||
 | 
					    "spec.",
 | 
				
			||||||
 | 
					    "sr.",
 | 
				
			||||||
 | 
					    "st.",
 | 
				
			||||||
 | 
					    "str.",
 | 
				
			||||||
 | 
					    "stud.",
 | 
				
			||||||
 | 
					    "sąs.",
 | 
				
			||||||
 | 
					    "t.",
 | 
				
			||||||
 | 
					    "t. p.",
 | 
				
			||||||
 | 
					    "t. y.",
 | 
				
			||||||
 | 
					    "t.p.",
 | 
				
			||||||
 | 
					    "t.t.",
 | 
				
			||||||
 | 
					    "t.y.",
 | 
				
			||||||
 | 
					    "techn.",
 | 
				
			||||||
 | 
					    "tel.",
 | 
				
			||||||
 | 
					    "teol.",
 | 
				
			||||||
 | 
					    "th.",
 | 
				
			||||||
 | 
					    "tir.",
 | 
				
			||||||
 | 
					    "trit.",
 | 
				
			||||||
 | 
					    "trln.",
 | 
				
			||||||
 | 
					    "tšk.",
 | 
				
			||||||
 | 
					    "tūks.",
 | 
				
			||||||
 | 
					    "tūkst.",
 | 
				
			||||||
 | 
					    "up.",
 | 
				
			||||||
 | 
					    "upl.",
 | 
				
			||||||
 | 
					    "v.s.",
 | 
				
			||||||
 | 
					    "vad.",
 | 
				
			||||||
 | 
					    "val.",
 | 
				
			||||||
 | 
					    "valg.",
 | 
				
			||||||
 | 
					    "ved.",
 | 
				
			||||||
 | 
					    "vert.",
 | 
				
			||||||
 | 
					    "vet.",
 | 
				
			||||||
 | 
					    "vid.",
 | 
				
			||||||
 | 
					    "virš.",
 | 
				
			||||||
 | 
					    "vlsč.",
 | 
				
			||||||
 | 
					    "vnt.",
 | 
				
			||||||
 | 
					    "vok.",
 | 
				
			||||||
 | 
					    "vs.",
 | 
				
			||||||
 | 
					    "vtv.",
 | 
				
			||||||
 | 
					    "vv.",
 | 
				
			||||||
 | 
					    "vyr.",
 | 
				
			||||||
 | 
					    "vyresn.",
 | 
				
			||||||
 | 
					    "zool.",
 | 
				
			||||||
 | 
					    "Įn",
 | 
				
			||||||
 | 
					    "įl.",
 | 
				
			||||||
 | 
					    "š.m.",
 | 
				
			||||||
 | 
					    "šnek.",
 | 
				
			||||||
 | 
					    "šv.",
 | 
				
			||||||
 | 
					    "švč.",
 | 
				
			||||||
 | 
					    "ž.ū.",
 | 
				
			||||||
 | 
					    "žin.",
 | 
				
			||||||
 | 
					    "žml.",
 | 
				
			||||||
 | 
					    "žr.",
 | 
				
			||||||
 | 
					]:
 | 
				
			||||||
 | 
					    _exc[orth] = [{ORTH: orth}]
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					TOKENIZER_EXCEPTIONS = _exc
 | 
				
			||||||
| 
						 | 
					@ -22,6 +22,7 @@ NOUN_RULES = [
 | 
				
			||||||
VERB_RULES = [
 | 
					VERB_RULES = [
 | 
				
			||||||
    ["er", "e"],  # vasker -> vaske
 | 
					    ["er", "e"],  # vasker -> vaske
 | 
				
			||||||
    ["et", "e"],  # vasket -> vaske
 | 
					    ["et", "e"],  # vasket -> vaske
 | 
				
			||||||
 | 
					    ["a", "e"],  # vaska -> vaske
 | 
				
			||||||
    ["es", "e"],  # vaskes -> vaske
 | 
					    ["es", "e"],  # vaskes -> vaske
 | 
				
			||||||
    ["te", "e"],  # stekte -> steke
 | 
					    ["te", "e"],  # stekte -> steke
 | 
				
			||||||
    ["år", "å"],  # får -> få
 | 
					    ["år", "å"],  # får -> få
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -10,7 +10,15 @@ _exc = {}
 | 
				
			||||||
for exc_data in [
 | 
					for exc_data in [
 | 
				
			||||||
    {ORTH: "jan.", LEMMA: "januar"},
 | 
					    {ORTH: "jan.", LEMMA: "januar"},
 | 
				
			||||||
    {ORTH: "feb.", LEMMA: "februar"},
 | 
					    {ORTH: "feb.", LEMMA: "februar"},
 | 
				
			||||||
 | 
					    {ORTH: "mar.", LEMMA: "mars"},
 | 
				
			||||||
 | 
					    {ORTH: "apr.", LEMMA: "april"},
 | 
				
			||||||
 | 
					    {ORTH: "jun.", LEMMA: "juni"},
 | 
				
			||||||
    {ORTH: "jul.", LEMMA: "juli"},
 | 
					    {ORTH: "jul.", LEMMA: "juli"},
 | 
				
			||||||
 | 
					    {ORTH: "aug.", LEMMA: "august"},
 | 
				
			||||||
 | 
					    {ORTH: "sep.", LEMMA: "september"},
 | 
				
			||||||
 | 
					    {ORTH: "okt.", LEMMA: "oktober"},
 | 
				
			||||||
 | 
					    {ORTH: "nov.", LEMMA: "november"},
 | 
				
			||||||
 | 
					    {ORTH: "des.", LEMMA: "desember"},
 | 
				
			||||||
]:
 | 
					]:
 | 
				
			||||||
    _exc[exc_data[ORTH]] = [exc_data]
 | 
					    _exc[exc_data[ORTH]] = [exc_data]
 | 
				
			||||||
 | 
					
 | 
				
			||||||
| 
						 | 
					@ -18,11 +26,13 @@ for exc_data in [
 | 
				
			||||||
for orth in [
 | 
					for orth in [
 | 
				
			||||||
    "adm.dir.",
 | 
					    "adm.dir.",
 | 
				
			||||||
    "a.m.",
 | 
					    "a.m.",
 | 
				
			||||||
 | 
					    "andelsnr",
 | 
				
			||||||
    "Aq.",
 | 
					    "Aq.",
 | 
				
			||||||
    "b.c.",
 | 
					    "b.c.",
 | 
				
			||||||
    "bl.a.",
 | 
					    "bl.a.",
 | 
				
			||||||
    "bla.",
 | 
					    "bla.",
 | 
				
			||||||
    "bm.",
 | 
					    "bm.",
 | 
				
			||||||
 | 
					    "bnr.",
 | 
				
			||||||
    "bto.",
 | 
					    "bto.",
 | 
				
			||||||
    "ca.",
 | 
					    "ca.",
 | 
				
			||||||
    "cand.mag.",
 | 
					    "cand.mag.",
 | 
				
			||||||
| 
						 | 
					@ -41,6 +51,7 @@ for orth in [
 | 
				
			||||||
    "el.",
 | 
					    "el.",
 | 
				
			||||||
    "e.l.",
 | 
					    "e.l.",
 | 
				
			||||||
    "et.",
 | 
					    "et.",
 | 
				
			||||||
 | 
					    "etc.",
 | 
				
			||||||
    "etg.",
 | 
					    "etg.",
 | 
				
			||||||
    "ev.",
 | 
					    "ev.",
 | 
				
			||||||
    "evt.",
 | 
					    "evt.",
 | 
				
			||||||
| 
						 | 
					@ -76,6 +87,7 @@ for orth in [
 | 
				
			||||||
    "kgl.res.",
 | 
					    "kgl.res.",
 | 
				
			||||||
    "kl.",
 | 
					    "kl.",
 | 
				
			||||||
    "komm.",
 | 
					    "komm.",
 | 
				
			||||||
 | 
					    "kr.",
 | 
				
			||||||
    "kst.",
 | 
					    "kst.",
 | 
				
			||||||
    "lø.",
 | 
					    "lø.",
 | 
				
			||||||
    "ma.",
 | 
					    "ma.",
 | 
				
			||||||
| 
						 | 
					@ -106,6 +118,7 @@ for orth in [
 | 
				
			||||||
    "o.l.",
 | 
					    "o.l.",
 | 
				
			||||||
    "on.",
 | 
					    "on.",
 | 
				
			||||||
    "op.",
 | 
					    "op.",
 | 
				
			||||||
 | 
					    "org."
 | 
				
			||||||
    "osv.",
 | 
					    "osv.",
 | 
				
			||||||
    "ovf.",
 | 
					    "ovf.",
 | 
				
			||||||
    "p.",
 | 
					    "p.",
 | 
				
			||||||
| 
						 | 
					@ -130,6 +143,7 @@ for orth in [
 | 
				
			||||||
    "sep.",
 | 
					    "sep.",
 | 
				
			||||||
    "siviling.",
 | 
					    "siviling.",
 | 
				
			||||||
    "sms.",
 | 
					    "sms.",
 | 
				
			||||||
 | 
					    "snr.",
 | 
				
			||||||
    "spm.",
 | 
					    "spm.",
 | 
				
			||||||
    "sr.",
 | 
					    "sr.",
 | 
				
			||||||
    "sst.",
 | 
					    "sst.",
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
							
								
								
									
										18
									
								
								spacy/lang/sq/examples.py
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										18
									
								
								spacy/lang/sq/examples.py
									
									
									
									
									
										Normal file
									
								
							| 
						 | 
					@ -0,0 +1,18 @@
 | 
				
			||||||
 | 
					# coding: utf8
 | 
				
			||||||
 | 
					from __future__ import unicode_literals
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					"""
 | 
				
			||||||
 | 
					Example sentences to test spaCy and its language models.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					>>> from spacy.lang.sq.examples import sentences
 | 
				
			||||||
 | 
					>>> docs = nlp.pipe(sentences)
 | 
				
			||||||
 | 
					"""
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					sentences = [
 | 
				
			||||||
 | 
					    "Apple po shqyrton blerjen e nje shoqërie të U.K. për 1 miliard dollarë",
 | 
				
			||||||
 | 
					    "Makinat autonome ndryshojnë përgjegjësinë e sigurimit ndaj prodhuesve",
 | 
				
			||||||
 | 
					    "San Francisko konsideron ndalimin e robotëve të shpërndarjes",
 | 
				
			||||||
 | 
					    "Londra është një qytet i madh në Mbretërinë e Bashkuar.",
 | 
				
			||||||
 | 
					]
 | 
				
			||||||
| 
						 | 
					@ -262,13 +262,13 @@ cdef find_matches(TokenPatternC** patterns, int n, Doc doc, extensions=None,
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
cdef attr_t get_ent_id(const TokenPatternC* pattern) nogil:
 | 
					cdef attr_t get_ent_id(const TokenPatternC* pattern) nogil:
 | 
				
			||||||
 | 
					    # There have been a few bugs here.
 | 
				
			||||||
    # The code was originally designed to always have pattern[1].attrs.value
 | 
					    # The code was originally designed to always have pattern[1].attrs.value
 | 
				
			||||||
    # be the ent_id when we get to the end of a pattern. However, Issue #2671
 | 
					    # be the ent_id when we get to the end of a pattern. However, Issue #2671
 | 
				
			||||||
    # showed this wasn't the case when we had a reject-and-continue before a
 | 
					    # showed this wasn't the case when we had a reject-and-continue before a
 | 
				
			||||||
    # match. I still don't really understand what's going on here, but this
 | 
					    # match.
 | 
				
			||||||
    # workaround does resolve the issue.
 | 
					    # The patch to #2671 was wrong though, which came up in #3839.
 | 
				
			||||||
    while pattern.attrs.attr != ID and \
 | 
					    while pattern.attrs.attr != ID:
 | 
				
			||||||
            (pattern.nr_attr > 0 or pattern.nr_extra_attr > 0 or pattern.nr_py > 0):
 | 
					 | 
				
			||||||
        pattern += 1
 | 
					        pattern += 1
 | 
				
			||||||
    return pattern.attrs.value
 | 
					    return pattern.attrs.value
 | 
				
			||||||
 | 
					
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -1,15 +1,17 @@
 | 
				
			||||||
# coding: utf8
 | 
					# coding: utf8
 | 
				
			||||||
from __future__ import unicode_literals
 | 
					from __future__ import unicode_literals
 | 
				
			||||||
 | 
					
 | 
				
			||||||
from collections import defaultdict
 | 
					from collections import defaultdict, OrderedDict
 | 
				
			||||||
import srsly
 | 
					import srsly
 | 
				
			||||||
 | 
					
 | 
				
			||||||
from ..errors import Errors
 | 
					from ..errors import Errors
 | 
				
			||||||
from ..compat import basestring_
 | 
					from ..compat import basestring_
 | 
				
			||||||
from ..util import ensure_path
 | 
					from ..util import ensure_path, to_disk, from_disk
 | 
				
			||||||
from ..tokens import Span
 | 
					from ..tokens import Span
 | 
				
			||||||
from ..matcher import Matcher, PhraseMatcher
 | 
					from ..matcher import Matcher, PhraseMatcher
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					DEFAULT_ENT_ID_SEP = "||"
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
class EntityRuler(object):
 | 
					class EntityRuler(object):
 | 
				
			||||||
    """The EntityRuler lets you add spans to the `Doc.ents` using token-based
 | 
					    """The EntityRuler lets you add spans to the `Doc.ents` using token-based
 | 
				
			||||||
| 
						 | 
					@ -24,7 +26,7 @@ class EntityRuler(object):
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    name = "entity_ruler"
 | 
					    name = "entity_ruler"
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    def __init__(self, nlp, **cfg):
 | 
					    def __init__(self, nlp, phrase_matcher_attr=None, **cfg):
 | 
				
			||||||
        """Initialize the entitiy ruler. If patterns are supplied here, they
 | 
					        """Initialize the entitiy ruler. If patterns are supplied here, they
 | 
				
			||||||
        need to be a list of dictionaries with a `"label"` and `"pattern"`
 | 
					        need to be a list of dictionaries with a `"label"` and `"pattern"`
 | 
				
			||||||
        key. A pattern can either be a token pattern (list) or a phrase pattern
 | 
					        key. A pattern can either be a token pattern (list) or a phrase pattern
 | 
				
			||||||
| 
						 | 
					@ -32,6 +34,8 @@ class EntityRuler(object):
 | 
				
			||||||
 | 
					
 | 
				
			||||||
        nlp (Language): The shared nlp object to pass the vocab to the matchers
 | 
					        nlp (Language): The shared nlp object to pass the vocab to the matchers
 | 
				
			||||||
            and process phrase patterns.
 | 
					            and process phrase patterns.
 | 
				
			||||||
 | 
					        phrase_matcher_attr (int / unicode): Token attribute to match on, passed
 | 
				
			||||||
 | 
					            to the internal PhraseMatcher as `attr`
 | 
				
			||||||
        patterns (iterable): Optional patterns to load in.
 | 
					        patterns (iterable): Optional patterns to load in.
 | 
				
			||||||
        overwrite_ents (bool): If existing entities are present, e.g. entities
 | 
					        overwrite_ents (bool): If existing entities are present, e.g. entities
 | 
				
			||||||
            added by the model, overwrite them by matches if necessary.
 | 
					            added by the model, overwrite them by matches if necessary.
 | 
				
			||||||
| 
						 | 
					@ -47,8 +51,15 @@ class EntityRuler(object):
 | 
				
			||||||
        self.token_patterns = defaultdict(list)
 | 
					        self.token_patterns = defaultdict(list)
 | 
				
			||||||
        self.phrase_patterns = defaultdict(list)
 | 
					        self.phrase_patterns = defaultdict(list)
 | 
				
			||||||
        self.matcher = Matcher(nlp.vocab)
 | 
					        self.matcher = Matcher(nlp.vocab)
 | 
				
			||||||
 | 
					        if phrase_matcher_attr is not None:
 | 
				
			||||||
 | 
					            self.phrase_matcher_attr = phrase_matcher_attr
 | 
				
			||||||
 | 
					            self.phrase_matcher = PhraseMatcher(
 | 
				
			||||||
 | 
					                nlp.vocab, attr=self.phrase_matcher_attr
 | 
				
			||||||
 | 
					            )
 | 
				
			||||||
 | 
					        else:
 | 
				
			||||||
 | 
					            self.phrase_matcher_attr = None
 | 
				
			||||||
            self.phrase_matcher = PhraseMatcher(nlp.vocab)
 | 
					            self.phrase_matcher = PhraseMatcher(nlp.vocab)
 | 
				
			||||||
        self.ent_id_sep = cfg.get("ent_id_sep", "||")
 | 
					        self.ent_id_sep = cfg.get("ent_id_sep", DEFAULT_ENT_ID_SEP)
 | 
				
			||||||
        patterns = cfg.get("patterns")
 | 
					        patterns = cfg.get("patterns")
 | 
				
			||||||
        if patterns is not None:
 | 
					        if patterns is not None:
 | 
				
			||||||
            self.add_patterns(patterns)
 | 
					            self.add_patterns(patterns)
 | 
				
			||||||
| 
						 | 
					@ -212,8 +223,18 @@ class EntityRuler(object):
 | 
				
			||||||
 | 
					
 | 
				
			||||||
        DOCS: https://spacy.io/api/entityruler#from_bytes
 | 
					        DOCS: https://spacy.io/api/entityruler#from_bytes
 | 
				
			||||||
        """
 | 
					        """
 | 
				
			||||||
        patterns = srsly.msgpack_loads(patterns_bytes)
 | 
					        cfg = srsly.msgpack_loads(patterns_bytes)
 | 
				
			||||||
        self.add_patterns(patterns)
 | 
					        if isinstance(cfg, dict):
 | 
				
			||||||
 | 
					            self.add_patterns(cfg.get("patterns", cfg))
 | 
				
			||||||
 | 
					            self.overwrite = cfg.get("overwrite", False)
 | 
				
			||||||
 | 
					            self.phrase_matcher_attr = cfg.get("phrase_matcher_attr", None)
 | 
				
			||||||
 | 
					            if self.phrase_matcher_attr is not None:
 | 
				
			||||||
 | 
					                self.phrase_matcher = PhraseMatcher(
 | 
				
			||||||
 | 
					                    self.nlp.vocab, attr=self.phrase_matcher_attr
 | 
				
			||||||
 | 
					                )
 | 
				
			||||||
 | 
					            self.ent_id_sep = cfg.get("ent_id_sep", DEFAULT_ENT_ID_SEP)
 | 
				
			||||||
 | 
					        else:
 | 
				
			||||||
 | 
					            self.add_patterns(cfg)
 | 
				
			||||||
        return self
 | 
					        return self
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    def to_bytes(self, **kwargs):
 | 
					    def to_bytes(self, **kwargs):
 | 
				
			||||||
| 
						 | 
					@ -223,7 +244,16 @@ class EntityRuler(object):
 | 
				
			||||||
 | 
					
 | 
				
			||||||
        DOCS: https://spacy.io/api/entityruler#to_bytes
 | 
					        DOCS: https://spacy.io/api/entityruler#to_bytes
 | 
				
			||||||
        """
 | 
					        """
 | 
				
			||||||
        return srsly.msgpack_dumps(self.patterns)
 | 
					
 | 
				
			||||||
 | 
					        serial = OrderedDict(
 | 
				
			||||||
 | 
					            (
 | 
				
			||||||
 | 
					                ("overwrite", self.overwrite),
 | 
				
			||||||
 | 
					                ("ent_id_sep", self.ent_id_sep),
 | 
				
			||||||
 | 
					                ("phrase_matcher_attr", self.phrase_matcher_attr),
 | 
				
			||||||
 | 
					                ("patterns", self.patterns),
 | 
				
			||||||
 | 
					            )
 | 
				
			||||||
 | 
					        )
 | 
				
			||||||
 | 
					        return srsly.msgpack_dumps(serial)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    def from_disk(self, path, **kwargs):
 | 
					    def from_disk(self, path, **kwargs):
 | 
				
			||||||
        """Load the entity ruler from a file. Expects a file containing
 | 
					        """Load the entity ruler from a file. Expects a file containing
 | 
				
			||||||
| 
						 | 
					@ -236,21 +266,52 @@ class EntityRuler(object):
 | 
				
			||||||
        DOCS: https://spacy.io/api/entityruler#from_disk
 | 
					        DOCS: https://spacy.io/api/entityruler#from_disk
 | 
				
			||||||
        """
 | 
					        """
 | 
				
			||||||
        path = ensure_path(path)
 | 
					        path = ensure_path(path)
 | 
				
			||||||
        path = path.with_suffix(".jsonl")
 | 
					        depr_patterns_path = path.with_suffix(".jsonl")
 | 
				
			||||||
        patterns = srsly.read_jsonl(path)
 | 
					        if depr_patterns_path.is_file():
 | 
				
			||||||
 | 
					            patterns = srsly.read_jsonl(depr_patterns_path)
 | 
				
			||||||
            self.add_patterns(patterns)
 | 
					            self.add_patterns(patterns)
 | 
				
			||||||
 | 
					        else:
 | 
				
			||||||
 | 
					            cfg = {}
 | 
				
			||||||
 | 
					            deserializers = {
 | 
				
			||||||
 | 
					                "patterns": lambda p: self.add_patterns(
 | 
				
			||||||
 | 
					                    srsly.read_jsonl(p.with_suffix(".jsonl"))
 | 
				
			||||||
 | 
					                ),
 | 
				
			||||||
 | 
					                "cfg": lambda p: cfg.update(srsly.read_json(p)),
 | 
				
			||||||
 | 
					            }
 | 
				
			||||||
 | 
					            from_disk(path, deserializers, {})
 | 
				
			||||||
 | 
					            self.overwrite = cfg.get("overwrite", False)
 | 
				
			||||||
 | 
					            self.phrase_matcher_attr = cfg.get("phrase_matcher_attr")
 | 
				
			||||||
 | 
					            self.ent_id_sep = cfg.get("ent_id_sep", DEFAULT_ENT_ID_SEP)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					            if self.phrase_matcher_attr is not None:
 | 
				
			||||||
 | 
					                self.phrase_matcher = PhraseMatcher(
 | 
				
			||||||
 | 
					                    self.nlp.vocab, attr=self.phrase_matcher_attr
 | 
				
			||||||
 | 
					                )
 | 
				
			||||||
        return self
 | 
					        return self
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    def to_disk(self, path, **kwargs):
 | 
					    def to_disk(self, path, **kwargs):
 | 
				
			||||||
        """Save the entity ruler patterns to a directory. The patterns will be
 | 
					        """Save the entity ruler patterns to a directory. The patterns will be
 | 
				
			||||||
        saved as newline-delimited JSON (JSONL).
 | 
					        saved as newline-delimited JSON (JSONL).
 | 
				
			||||||
 | 
					
 | 
				
			||||||
        path (unicode / Path): The JSONL file to load.
 | 
					        path (unicode / Path): The JSONL file to save.
 | 
				
			||||||
        **kwargs: Other config paramters, mostly for consistency.
 | 
					        **kwargs: Other config paramters, mostly for consistency.
 | 
				
			||||||
        RETURNS (EntityRuler): The loaded entity ruler.
 | 
					        RETURNS (EntityRuler): The loaded entity ruler.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
        DOCS: https://spacy.io/api/entityruler#to_disk
 | 
					        DOCS: https://spacy.io/api/entityruler#to_disk
 | 
				
			||||||
        """
 | 
					        """
 | 
				
			||||||
        path = ensure_path(path)
 | 
					        path = ensure_path(path)
 | 
				
			||||||
        path = path.with_suffix(".jsonl")
 | 
					        cfg = {
 | 
				
			||||||
 | 
					            "overwrite": self.overwrite,
 | 
				
			||||||
 | 
					            "phrase_matcher_attr": self.phrase_matcher_attr,
 | 
				
			||||||
 | 
					            "ent_id_sep": self.ent_id_sep,
 | 
				
			||||||
 | 
					        }
 | 
				
			||||||
 | 
					        serializers = {
 | 
				
			||||||
 | 
					            "patterns": lambda p: srsly.write_jsonl(
 | 
				
			||||||
 | 
					                p.with_suffix(".jsonl"), self.patterns
 | 
				
			||||||
 | 
					            ),
 | 
				
			||||||
 | 
					            "cfg": lambda p: srsly.write_json(p, cfg),
 | 
				
			||||||
 | 
					        }
 | 
				
			||||||
 | 
					        if path.suffix == ".jsonl":  # user wants to save only JSONL
 | 
				
			||||||
            srsly.write_jsonl(path, self.patterns)
 | 
					            srsly.write_jsonl(path, self.patterns)
 | 
				
			||||||
 | 
					        else:
 | 
				
			||||||
 | 
					            to_disk(path, serializers, {})
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -3,16 +3,18 @@
 | 
				
			||||||
# coding: utf8
 | 
					# coding: utf8
 | 
				
			||||||
from __future__ import unicode_literals
 | 
					from __future__ import unicode_literals
 | 
				
			||||||
 | 
					
 | 
				
			||||||
cimport numpy as np
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
import numpy
 | 
					import numpy
 | 
				
			||||||
import srsly
 | 
					import srsly
 | 
				
			||||||
 | 
					import random
 | 
				
			||||||
from collections import OrderedDict
 | 
					from collections import OrderedDict
 | 
				
			||||||
from thinc.api import chain
 | 
					from thinc.api import chain
 | 
				
			||||||
from thinc.v2v import Affine, Maxout, Softmax
 | 
					from thinc.v2v import Affine, Maxout, Softmax
 | 
				
			||||||
from thinc.misc import LayerNorm
 | 
					from thinc.misc import LayerNorm
 | 
				
			||||||
from thinc.neural.util import to_categorical, copy_array
 | 
					from thinc.neural.util import to_categorical
 | 
				
			||||||
 | 
					from thinc.neural.util import get_array_module
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					from spacy.kb import KnowledgeBase
 | 
				
			||||||
 | 
					from ..cli.pretrain import get_cossim_loss
 | 
				
			||||||
from .functions import merge_subtokens
 | 
					from .functions import merge_subtokens
 | 
				
			||||||
from ..tokens.doc cimport Doc
 | 
					from ..tokens.doc cimport Doc
 | 
				
			||||||
from ..syntax.nn_parser cimport Parser
 | 
					from ..syntax.nn_parser cimport Parser
 | 
				
			||||||
| 
						 | 
					@ -24,9 +26,9 @@ from ..vocab cimport Vocab
 | 
				
			||||||
from ..syntax import nonproj
 | 
					from ..syntax import nonproj
 | 
				
			||||||
from ..attrs import POS, ID
 | 
					from ..attrs import POS, ID
 | 
				
			||||||
from ..parts_of_speech import X
 | 
					from ..parts_of_speech import X
 | 
				
			||||||
from .._ml import Tok2Vec, build_tagger_model
 | 
					from .._ml import Tok2Vec, build_tagger_model, cosine
 | 
				
			||||||
from .._ml import build_text_classifier, build_simple_cnn_text_classifier
 | 
					from .._ml import build_text_classifier, build_simple_cnn_text_classifier
 | 
				
			||||||
from .._ml import build_bow_text_classifier
 | 
					from .._ml import build_bow_text_classifier, build_nel_encoder
 | 
				
			||||||
from .._ml import link_vectors_to_models, zero_init, flatten
 | 
					from .._ml import link_vectors_to_models, zero_init, flatten
 | 
				
			||||||
from .._ml import masked_language_model, create_default_optimizer
 | 
					from .._ml import masked_language_model, create_default_optimizer
 | 
				
			||||||
from ..errors import Errors, TempErrors
 | 
					from ..errors import Errors, TempErrors
 | 
				
			||||||
| 
						 | 
					@ -229,7 +231,7 @@ class Tensorizer(Pipe):
 | 
				
			||||||
 | 
					
 | 
				
			||||||
        vocab (Vocab): A `Vocab` instance. The model must share the same
 | 
					        vocab (Vocab): A `Vocab` instance. The model must share the same
 | 
				
			||||||
            `Vocab` instance with the `Doc` objects it will process.
 | 
					            `Vocab` instance with the `Doc` objects it will process.
 | 
				
			||||||
        model (Model): A `Model` instance or `True` allocate one later.
 | 
					        model (Model): A `Model` instance or `True` to allocate one later.
 | 
				
			||||||
        **cfg: Config parameters.
 | 
					        **cfg: Config parameters.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
        EXAMPLE:
 | 
					        EXAMPLE:
 | 
				
			||||||
| 
						 | 
					@ -294,7 +296,7 @@ class Tensorizer(Pipe):
 | 
				
			||||||
 | 
					
 | 
				
			||||||
        docs (iterable): A batch of `Doc` objects.
 | 
					        docs (iterable): A batch of `Doc` objects.
 | 
				
			||||||
        golds (iterable): A batch of `GoldParse` objects.
 | 
					        golds (iterable): A batch of `GoldParse` objects.
 | 
				
			||||||
        drop (float): The droput rate.
 | 
					        drop (float): The dropout rate.
 | 
				
			||||||
        sgd (callable): An optimizer.
 | 
					        sgd (callable): An optimizer.
 | 
				
			||||||
        RETURNS (dict): Results from the update.
 | 
					        RETURNS (dict): Results from the update.
 | 
				
			||||||
        """
 | 
					        """
 | 
				
			||||||
| 
						 | 
					@ -386,7 +388,7 @@ class Tagger(Pipe):
 | 
				
			||||||
    def predict(self, docs):
 | 
					    def predict(self, docs):
 | 
				
			||||||
        self.require_model()
 | 
					        self.require_model()
 | 
				
			||||||
        if not any(len(doc) for doc in docs):
 | 
					        if not any(len(doc) for doc in docs):
 | 
				
			||||||
            # Handle case where there are no tokens in any docs.
 | 
					            # Handle cases where there are no tokens in any docs.
 | 
				
			||||||
            n_labels = len(self.labels)
 | 
					            n_labels = len(self.labels)
 | 
				
			||||||
            guesses = [self.model.ops.allocate((0, n_labels)) for doc in docs]
 | 
					            guesses = [self.model.ops.allocate((0, n_labels)) for doc in docs]
 | 
				
			||||||
            tokvecs = self.model.ops.allocate((0, self.model.tok2vec.nO))
 | 
					            tokvecs = self.model.ops.allocate((0, self.model.tok2vec.nO))
 | 
				
			||||||
| 
						 | 
					@ -900,6 +902,11 @@ class TextCategorizer(Pipe):
 | 
				
			||||||
    def labels(self):
 | 
					    def labels(self):
 | 
				
			||||||
        return tuple(self.cfg.setdefault("labels", []))
 | 
					        return tuple(self.cfg.setdefault("labels", []))
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    def require_labels(self):
 | 
				
			||||||
 | 
					        """Raise an error if the component's model has no labels defined."""
 | 
				
			||||||
 | 
					        if not self.labels:
 | 
				
			||||||
 | 
					            raise ValueError(Errors.E143.format(name=self.name))
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    @labels.setter
 | 
					    @labels.setter
 | 
				
			||||||
    def labels(self, value):
 | 
					    def labels(self, value):
 | 
				
			||||||
        self.cfg["labels"] = tuple(value)
 | 
					        self.cfg["labels"] = tuple(value)
 | 
				
			||||||
| 
						 | 
					@ -929,6 +936,7 @@ class TextCategorizer(Pipe):
 | 
				
			||||||
                doc.cats[label] = float(scores[i, j])
 | 
					                doc.cats[label] = float(scores[i, j])
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    def update(self, docs, golds, state=None, drop=0., sgd=None, losses=None):
 | 
					    def update(self, docs, golds, state=None, drop=0., sgd=None, losses=None):
 | 
				
			||||||
 | 
					        self.require_model()
 | 
				
			||||||
        scores, bp_scores = self.model.begin_update(docs, drop=drop)
 | 
					        scores, bp_scores = self.model.begin_update(docs, drop=drop)
 | 
				
			||||||
        loss, d_scores = self.get_loss(docs, golds, scores)
 | 
					        loss, d_scores = self.get_loss(docs, golds, scores)
 | 
				
			||||||
        bp_scores(d_scores, sgd=sgd)
 | 
					        bp_scores(d_scores, sgd=sgd)
 | 
				
			||||||
| 
						 | 
					@ -983,6 +991,7 @@ class TextCategorizer(Pipe):
 | 
				
			||||||
    def begin_training(self, get_gold_tuples=lambda: [], pipeline=None, sgd=None, **kwargs):
 | 
					    def begin_training(self, get_gold_tuples=lambda: [], pipeline=None, sgd=None, **kwargs):
 | 
				
			||||||
        if self.model is True:
 | 
					        if self.model is True:
 | 
				
			||||||
            self.cfg["pretrained_vectors"] = kwargs.get("pretrained_vectors")
 | 
					            self.cfg["pretrained_vectors"] = kwargs.get("pretrained_vectors")
 | 
				
			||||||
 | 
					            self.require_labels()
 | 
				
			||||||
            self.model = self.Model(len(self.labels), **self.cfg)
 | 
					            self.model = self.Model(len(self.labels), **self.cfg)
 | 
				
			||||||
            link_vectors_to_models(self.vocab)
 | 
					            link_vectors_to_models(self.vocab)
 | 
				
			||||||
        if sgd is None:
 | 
					        if sgd is None:
 | 
				
			||||||
| 
						 | 
					@ -1001,7 +1010,7 @@ cdef class DependencyParser(Parser):
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    @property
 | 
					    @property
 | 
				
			||||||
    def postprocesses(self):
 | 
					    def postprocesses(self):
 | 
				
			||||||
        return [nonproj.deprojectivize, merge_subtokens]
 | 
					        return [nonproj.deprojectivize]
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    def add_multitask_objective(self, target):
 | 
					    def add_multitask_objective(self, target):
 | 
				
			||||||
        if target == "cloze":
 | 
					        if target == "cloze":
 | 
				
			||||||
| 
						 | 
					@ -1063,52 +1072,252 @@ cdef class EntityRecognizer(Parser):
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
class EntityLinker(Pipe):
 | 
					class EntityLinker(Pipe):
 | 
				
			||||||
 | 
					    """Pipeline component for named entity linking.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    DOCS: TODO
 | 
				
			||||||
 | 
					    """
 | 
				
			||||||
    name = 'entity_linker'
 | 
					    name = 'entity_linker'
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    @classmethod
 | 
					    @classmethod
 | 
				
			||||||
    def Model(cls, nr_class=1, **cfg):
 | 
					    def Model(cls, **cfg):
 | 
				
			||||||
        # TODO: non-dummy EL implementation
 | 
					        embed_width = cfg.get("embed_width", 300)
 | 
				
			||||||
        return None
 | 
					        hidden_width = cfg.get("hidden_width", 128)
 | 
				
			||||||
 | 
					        type_to_int = cfg.get("type_to_int", dict())
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    def __init__(self, model=True, **cfg):
 | 
					        model = build_nel_encoder(embed_width=embed_width, hidden_width=hidden_width, ner_types=len(type_to_int), **cfg)
 | 
				
			||||||
        self.model = False
 | 
					        return model
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    def __init__(self, vocab, **cfg):
 | 
				
			||||||
 | 
					        self.vocab = vocab
 | 
				
			||||||
 | 
					        self.model = True
 | 
				
			||||||
 | 
					        self.kb = None
 | 
				
			||||||
        self.cfg = dict(cfg)
 | 
					        self.cfg = dict(cfg)
 | 
				
			||||||
        self.kb = self.cfg["kb"]
 | 
					        self.sgd_context = None
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    def set_kb(self, kb):
 | 
				
			||||||
 | 
					        self.kb = kb
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    def require_model(self):
 | 
				
			||||||
 | 
					        # Raise an error if the component's model is not initialized.
 | 
				
			||||||
 | 
					        if getattr(self, "model", None) in (None, True, False):
 | 
				
			||||||
 | 
					            raise ValueError(Errors.E109.format(name=self.name))
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    def require_kb(self):
 | 
				
			||||||
 | 
					        # Raise an error if the knowledge base is not initialized.
 | 
				
			||||||
 | 
					        if getattr(self, "kb", None) in (None, True, False):
 | 
				
			||||||
 | 
					            raise ValueError(Errors.E139.format(name=self.name))
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    def begin_training(self, get_gold_tuples=lambda: [], pipeline=None, sgd=None, **kwargs):
 | 
				
			||||||
 | 
					        self.require_kb()
 | 
				
			||||||
 | 
					        self.cfg["entity_width"] = self.kb.entity_vector_length
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					        if self.model is True:
 | 
				
			||||||
 | 
					            self.model = self.Model(**self.cfg)
 | 
				
			||||||
 | 
					            self.sgd_context = self.create_optimizer()
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					        if sgd is None:
 | 
				
			||||||
 | 
					            sgd = self.create_optimizer()
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					        return sgd
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    def update(self, docs, golds, state=None, drop=0.0, sgd=None, losses=None):
 | 
				
			||||||
 | 
					        self.require_model()
 | 
				
			||||||
 | 
					        self.require_kb()
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					        if losses is not None:
 | 
				
			||||||
 | 
					            losses.setdefault(self.name, 0.0)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					        if not docs or not golds:
 | 
				
			||||||
 | 
					            return 0
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					        if len(docs) != len(golds):
 | 
				
			||||||
 | 
					            raise ValueError(Errors.E077.format(value="EL training", n_docs=len(docs),
 | 
				
			||||||
 | 
					                                                n_golds=len(golds)))
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					        if isinstance(docs, Doc):
 | 
				
			||||||
 | 
					            docs = [docs]
 | 
				
			||||||
 | 
					            golds = [golds]
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					        context_docs = []
 | 
				
			||||||
 | 
					        entity_encodings = []
 | 
				
			||||||
 | 
					        cats = []
 | 
				
			||||||
 | 
					        priors = []
 | 
				
			||||||
 | 
					        type_vectors = []
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					        type_to_int = self.cfg.get("type_to_int", dict())
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					        for doc, gold in zip(docs, golds):
 | 
				
			||||||
 | 
					            ents_by_offset = dict()
 | 
				
			||||||
 | 
					            for ent in doc.ents:
 | 
				
			||||||
 | 
					                ents_by_offset[str(ent.start_char) + "_" + str(ent.end_char)] = ent
 | 
				
			||||||
 | 
					            for entity in gold.links:
 | 
				
			||||||
 | 
					                start, end, gold_kb = entity
 | 
				
			||||||
 | 
					                mention = doc.text[start:end]
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					                gold_ent = ents_by_offset[str(ent.start_char) + "_" + str(ent.end_char)]
 | 
				
			||||||
 | 
					                assert gold_ent is not None
 | 
				
			||||||
 | 
					                type_vector = [0 for i in range(len(type_to_int))]
 | 
				
			||||||
 | 
					                if len(type_to_int) > 0:
 | 
				
			||||||
 | 
					                    type_vector[type_to_int[gold_ent.label_]] = 1
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					                candidates = self.kb.get_candidates(mention)
 | 
				
			||||||
 | 
					                random.shuffle(candidates)
 | 
				
			||||||
 | 
					                nr_neg = 0
 | 
				
			||||||
 | 
					                for c in candidates:
 | 
				
			||||||
 | 
					                    kb_id = c.entity_
 | 
				
			||||||
 | 
					                    entity_encoding = c.entity_vector
 | 
				
			||||||
 | 
					                    entity_encodings.append(entity_encoding)
 | 
				
			||||||
 | 
					                    context_docs.append(doc)
 | 
				
			||||||
 | 
					                    type_vectors.append(type_vector)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					                    if self.cfg.get("prior_weight", 1) > 0:
 | 
				
			||||||
 | 
					                        priors.append([c.prior_prob])
 | 
				
			||||||
 | 
					                    else:
 | 
				
			||||||
 | 
					                        priors.append([0])
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					                    if kb_id == gold_kb:
 | 
				
			||||||
 | 
					                        cats.append([1])
 | 
				
			||||||
 | 
					                    else:
 | 
				
			||||||
 | 
					                        nr_neg += 1
 | 
				
			||||||
 | 
					                        cats.append([0])
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					        if len(entity_encodings) > 0:
 | 
				
			||||||
 | 
					            assert len(priors) == len(entity_encodings) == len(context_docs) == len(cats) == len(type_vectors)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					            context_encodings, bp_context = self.model.tok2vec.begin_update(context_docs, drop=drop)
 | 
				
			||||||
 | 
					            entity_encodings = self.model.ops.asarray(entity_encodings, dtype="float32")
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					            mention_encodings = [list(context_encodings[i]) + list(entity_encodings[i]) + priors[i] + type_vectors[i]
 | 
				
			||||||
 | 
					                                 for i in range(len(entity_encodings))]
 | 
				
			||||||
 | 
					            pred, bp_mention = self.model.begin_update(self.model.ops.asarray(mention_encodings, dtype="float32"), drop=drop)
 | 
				
			||||||
 | 
					            cats = self.model.ops.asarray(cats, dtype="float32")
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					            loss, d_scores = self.get_loss(prediction=pred, golds=cats, docs=None)
 | 
				
			||||||
 | 
					            mention_gradient = bp_mention(d_scores, sgd=sgd)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					            context_gradients = [list(x[0:self.cfg.get("context_width")]) for x in mention_gradient]
 | 
				
			||||||
 | 
					            bp_context(self.model.ops.asarray(context_gradients, dtype="float32"), sgd=self.sgd_context)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					            if losses is not None:
 | 
				
			||||||
 | 
					                losses[self.name] += loss
 | 
				
			||||||
 | 
					            return loss
 | 
				
			||||||
 | 
					        return 0
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    def get_loss(self, docs, golds, prediction):
 | 
				
			||||||
 | 
					        d_scores = (prediction - golds)
 | 
				
			||||||
 | 
					        loss = (d_scores ** 2).sum()
 | 
				
			||||||
 | 
					        loss = loss / len(golds)
 | 
				
			||||||
 | 
					        return loss, d_scores
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    def get_loss_old(self, docs, golds, scores):
 | 
				
			||||||
 | 
					        # this loss function assumes we're only using positive examples
 | 
				
			||||||
 | 
					        loss, gradients = get_cossim_loss(yh=scores, y=golds)
 | 
				
			||||||
 | 
					        loss = loss / len(golds)
 | 
				
			||||||
 | 
					        return loss, gradients
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    def __call__(self, doc):
 | 
					    def __call__(self, doc):
 | 
				
			||||||
        self.set_annotations([doc], scores=None, tensors=None)
 | 
					        entities, kb_ids = self.predict([doc])
 | 
				
			||||||
 | 
					        self.set_annotations([doc], entities, kb_ids)
 | 
				
			||||||
        return doc
 | 
					        return doc
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    def pipe(self, stream, batch_size=128, n_threads=-1):
 | 
					    def pipe(self, stream, batch_size=128, n_threads=-1):
 | 
				
			||||||
        """Apply the pipe to a stream of documents.
 | 
					 | 
				
			||||||
        Both __call__ and pipe should delegate to the `predict()`
 | 
					 | 
				
			||||||
        and `set_annotations()` methods.
 | 
					 | 
				
			||||||
        """
 | 
					 | 
				
			||||||
        for docs in util.minibatch(stream, size=batch_size):
 | 
					        for docs in util.minibatch(stream, size=batch_size):
 | 
				
			||||||
            docs = list(docs)
 | 
					            docs = list(docs)
 | 
				
			||||||
            self.set_annotations(docs, scores=None, tensors=None)
 | 
					            entities, kb_ids = self.predict(docs)
 | 
				
			||||||
 | 
					            self.set_annotations(docs, entities, kb_ids)
 | 
				
			||||||
            yield from docs
 | 
					            yield from docs
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    def set_annotations(self, docs, scores, tensors=None):
 | 
					    def predict(self, docs):
 | 
				
			||||||
        """
 | 
					        self.require_model()
 | 
				
			||||||
        Currently implemented as taking the KB entry with highest prior probability for each named entity
 | 
					        self.require_kb()
 | 
				
			||||||
        TODO: actually use context etc
 | 
					
 | 
				
			||||||
        """
 | 
					        final_entities = []
 | 
				
			||||||
 | 
					        final_kb_ids = []
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					        if not docs:
 | 
				
			||||||
 | 
					            return final_entities, final_kb_ids
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					        if isinstance(docs, Doc):
 | 
				
			||||||
 | 
					            docs = [docs]
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					        context_encodings = self.model.tok2vec(docs)
 | 
				
			||||||
 | 
					        xp = get_array_module(context_encodings)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					        type_to_int = self.cfg.get("type_to_int", dict())
 | 
				
			||||||
 | 
					
 | 
				
			||||||
        for i, doc in enumerate(docs):
 | 
					        for i, doc in enumerate(docs):
 | 
				
			||||||
 | 
					            if len(doc) > 0:
 | 
				
			||||||
 | 
					                context_encoding = context_encodings[i]
 | 
				
			||||||
                for ent in doc.ents:
 | 
					                for ent in doc.ents:
 | 
				
			||||||
 | 
					                    type_vector = [0 for i in range(len(type_to_int))]
 | 
				
			||||||
 | 
					                    if len(type_to_int) > 0:
 | 
				
			||||||
 | 
					                        type_vector[type_to_int[ent.label_]] = 1
 | 
				
			||||||
 | 
					
 | 
				
			||||||
                    candidates = self.kb.get_candidates(ent.text)
 | 
					                    candidates = self.kb.get_candidates(ent.text)
 | 
				
			||||||
                    if candidates:
 | 
					                    if candidates:
 | 
				
			||||||
                    best_candidate = max(candidates, key=lambda c: c.prior_prob)
 | 
					                        random.shuffle(candidates)
 | 
				
			||||||
                    for token in ent:
 | 
					 | 
				
			||||||
                        token.ent_kb_id_ = best_candidate.entity_
 | 
					 | 
				
			||||||
 | 
					
 | 
				
			||||||
    def get_loss(self, docs, golds, scores):
 | 
					                        # this will set the prior probabilities to 0 (just like in training) if their weight is 0
 | 
				
			||||||
        # TODO
 | 
					                        prior_probs = xp.asarray([[c.prior_prob] for c in candidates])
 | 
				
			||||||
        pass
 | 
					                        prior_probs *= self.cfg.get("prior_weight", 1)
 | 
				
			||||||
 | 
					                        scores = prior_probs
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					                        if self.cfg.get("context_weight", 1) > 0:
 | 
				
			||||||
 | 
					                            entity_encodings = xp.asarray([c.entity_vector for c in candidates])
 | 
				
			||||||
 | 
					                            assert len(entity_encodings) == len(prior_probs)
 | 
				
			||||||
 | 
					                            mention_encodings = [list(context_encoding) + list(entity_encodings[i])
 | 
				
			||||||
 | 
					                                                 + list(prior_probs[i]) + type_vector
 | 
				
			||||||
 | 
					                                                 for i in range(len(entity_encodings))]
 | 
				
			||||||
 | 
					                            scores = self.model(self.model.ops.asarray(mention_encodings, dtype="float32"))
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					                        # TODO: thresholding
 | 
				
			||||||
 | 
					                        best_index = scores.argmax()
 | 
				
			||||||
 | 
					                        best_candidate = candidates[best_index]
 | 
				
			||||||
 | 
					                        final_entities.append(ent)
 | 
				
			||||||
 | 
					                        final_kb_ids.append(best_candidate.entity_)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					        return final_entities, final_kb_ids
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    def set_annotations(self, docs, entities, kb_ids=None):
 | 
				
			||||||
 | 
					        for entity, kb_id in zip(entities, kb_ids):
 | 
				
			||||||
 | 
					            for token in entity:
 | 
				
			||||||
 | 
					                token.ent_kb_id_ = kb_id
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    def to_disk(self, path, exclude=tuple(), **kwargs):
 | 
				
			||||||
 | 
					        serialize = OrderedDict()
 | 
				
			||||||
 | 
					        serialize["cfg"] = lambda p: srsly.write_json(p, self.cfg)
 | 
				
			||||||
 | 
					        serialize["vocab"] = lambda p: self.vocab.to_disk(p)
 | 
				
			||||||
 | 
					        serialize["kb"] = lambda p: self.kb.dump(p)
 | 
				
			||||||
 | 
					        if self.model not in (None, True, False):
 | 
				
			||||||
 | 
					            serialize["model"] = lambda p: p.open("wb").write(self.model.to_bytes())
 | 
				
			||||||
 | 
					        exclude = util.get_serialization_exclude(serialize, exclude, kwargs)
 | 
				
			||||||
 | 
					        util.to_disk(path, serialize, exclude)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    def from_disk(self, path, exclude=tuple(), **kwargs):
 | 
				
			||||||
 | 
					        def load_model(p):
 | 
				
			||||||
 | 
					             if self.model is True:
 | 
				
			||||||
 | 
					                self.model = self.Model(**self.cfg)
 | 
				
			||||||
 | 
					             self.model.from_bytes(p.open("rb").read())
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					        def load_kb(p):
 | 
				
			||||||
 | 
					            kb = KnowledgeBase(vocab=self.vocab, entity_vector_length=self.cfg["entity_width"])
 | 
				
			||||||
 | 
					            kb.load_bulk(p)
 | 
				
			||||||
 | 
					            self.set_kb(kb)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					        deserialize = OrderedDict()
 | 
				
			||||||
 | 
					        deserialize["cfg"] = lambda p: self.cfg.update(_load_cfg(p))
 | 
				
			||||||
 | 
					        deserialize["vocab"] = lambda p: self.vocab.from_disk(p)
 | 
				
			||||||
 | 
					        deserialize["kb"] = load_kb
 | 
				
			||||||
 | 
					        deserialize["model"] = load_model
 | 
				
			||||||
 | 
					        exclude = util.get_serialization_exclude(deserialize, exclude, kwargs)
 | 
				
			||||||
 | 
					        util.from_disk(path, deserialize, exclude)
 | 
				
			||||||
 | 
					        return self
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    def rehearse(self, docs, sgd=None, losses=None, **config):
 | 
				
			||||||
 | 
					        raise NotImplementedError
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    def add_label(self, label):
 | 
					    def add_label(self, label):
 | 
				
			||||||
        # TODO
 | 
					        raise NotImplementedError
 | 
				
			||||||
        pass
 | 
					 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
class Sentencizer(object):
 | 
					class Sentencizer(object):
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -52,6 +52,7 @@ class Scorer(object):
 | 
				
			||||||
        self.labelled = PRFScore()
 | 
					        self.labelled = PRFScore()
 | 
				
			||||||
        self.tags = PRFScore()
 | 
					        self.tags = PRFScore()
 | 
				
			||||||
        self.ner = PRFScore()
 | 
					        self.ner = PRFScore()
 | 
				
			||||||
 | 
					        self.ner_per_ents = dict()
 | 
				
			||||||
        self.eval_punct = eval_punct
 | 
					        self.eval_punct = eval_punct
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    @property
 | 
					    @property
 | 
				
			||||||
| 
						 | 
					@ -91,6 +92,15 @@ class Scorer(object):
 | 
				
			||||||
        """RETURNS (float): Named entity accuracy (F-score)."""
 | 
					        """RETURNS (float): Named entity accuracy (F-score)."""
 | 
				
			||||||
        return self.ner.fscore * 100
 | 
					        return self.ner.fscore * 100
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    @property
 | 
				
			||||||
 | 
					    def ents_per_type(self):
 | 
				
			||||||
 | 
					        """RETURNS (dict): Scores per entity label.
 | 
				
			||||||
 | 
					        """
 | 
				
			||||||
 | 
					        return {
 | 
				
			||||||
 | 
					            k: {"p": v.precision * 100, "r": v.recall * 100, "f": v.fscore * 100}
 | 
				
			||||||
 | 
					            for k, v in self.ner_per_ents.items()
 | 
				
			||||||
 | 
					        }
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    @property
 | 
					    @property
 | 
				
			||||||
    def scores(self):
 | 
					    def scores(self):
 | 
				
			||||||
        """RETURNS (dict): All scores with keys `uas`, `las`, `ents_p`,
 | 
					        """RETURNS (dict): All scores with keys `uas`, `las`, `ents_p`,
 | 
				
			||||||
| 
						 | 
					@ -102,6 +112,7 @@ class Scorer(object):
 | 
				
			||||||
            "ents_p": self.ents_p,
 | 
					            "ents_p": self.ents_p,
 | 
				
			||||||
            "ents_r": self.ents_r,
 | 
					            "ents_r": self.ents_r,
 | 
				
			||||||
            "ents_f": self.ents_f,
 | 
					            "ents_f": self.ents_f,
 | 
				
			||||||
 | 
					            "ents_per_type": self.ents_per_type,
 | 
				
			||||||
            "tags_acc": self.tags_acc,
 | 
					            "tags_acc": self.tags_acc,
 | 
				
			||||||
            "token_acc": self.token_acc,
 | 
					            "token_acc": self.token_acc,
 | 
				
			||||||
        }
 | 
					        }
 | 
				
			||||||
| 
						 | 
					@ -149,13 +160,31 @@ class Scorer(object):
 | 
				
			||||||
                    cand_deps.add((gold_i, gold_head, token.dep_.lower()))
 | 
					                    cand_deps.add((gold_i, gold_head, token.dep_.lower()))
 | 
				
			||||||
        if "-" not in [token[-1] for token in gold.orig_annot]:
 | 
					        if "-" not in [token[-1] for token in gold.orig_annot]:
 | 
				
			||||||
            cand_ents = set()
 | 
					            cand_ents = set()
 | 
				
			||||||
 | 
					            current_ent = {k.label_: set() for k in doc.ents}
 | 
				
			||||||
 | 
					            current_gold = {k.label_: set() for k in doc.ents}
 | 
				
			||||||
            for ent in doc.ents:
 | 
					            for ent in doc.ents:
 | 
				
			||||||
 | 
					                if ent.label_ not in self.ner_per_ents:
 | 
				
			||||||
 | 
					                    self.ner_per_ents[ent.label_] = PRFScore()
 | 
				
			||||||
                first = gold.cand_to_gold[ent.start]
 | 
					                first = gold.cand_to_gold[ent.start]
 | 
				
			||||||
                last = gold.cand_to_gold[ent.end - 1]
 | 
					                last = gold.cand_to_gold[ent.end - 1]
 | 
				
			||||||
                if first is None or last is None:
 | 
					                if first is None or last is None:
 | 
				
			||||||
                    self.ner.fp += 1
 | 
					                    self.ner.fp += 1
 | 
				
			||||||
 | 
					                    self.ner_per_ents[ent.label_].fp += 1
 | 
				
			||||||
                else:
 | 
					                else:
 | 
				
			||||||
                    cand_ents.add((ent.label_, first, last))
 | 
					                    cand_ents.add((ent.label_, first, last))
 | 
				
			||||||
 | 
					                    current_ent[ent.label_].add(
 | 
				
			||||||
 | 
					                        tuple(x for x in cand_ents if x[0] == ent.label_)
 | 
				
			||||||
 | 
					                    )
 | 
				
			||||||
 | 
					                    current_gold[ent.label_].add(
 | 
				
			||||||
 | 
					                        tuple(x for x in gold_ents if x[0] == ent.label_)
 | 
				
			||||||
 | 
					                    )
 | 
				
			||||||
 | 
					            # Scores per ent
 | 
				
			||||||
 | 
					            [
 | 
				
			||||||
 | 
					                v.score_set(current_ent[k], current_gold[k])
 | 
				
			||||||
 | 
					                for k, v in self.ner_per_ents.items()
 | 
				
			||||||
 | 
					                if k in current_ent
 | 
				
			||||||
 | 
					            ]
 | 
				
			||||||
 | 
					            # Score for all ents
 | 
				
			||||||
            self.ner.score_set(cand_ents, gold_ents)
 | 
					            self.ner.score_set(cand_ents, gold_ents)
 | 
				
			||||||
        self.tags.score_set(cand_tags, gold_tags)
 | 
					        self.tags.score_set(cand_tags, gold_tags)
 | 
				
			||||||
        self.labelled.score_set(cand_deps, gold_deps)
 | 
					        self.labelled.score_set(cand_deps, gold_deps)
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -3,6 +3,10 @@ from libc.stdint cimport uint8_t, uint32_t, int32_t, uint64_t
 | 
				
			||||||
from .typedefs cimport flags_t, attr_t, hash_t
 | 
					from .typedefs cimport flags_t, attr_t, hash_t
 | 
				
			||||||
from .parts_of_speech cimport univ_pos_t
 | 
					from .parts_of_speech cimport univ_pos_t
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					from libcpp.vector cimport vector
 | 
				
			||||||
 | 
					from libc.stdint cimport int32_t, int64_t
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
cdef struct LexemeC:
 | 
					cdef struct LexemeC:
 | 
				
			||||||
    flags_t flags
 | 
					    flags_t flags
 | 
				
			||||||
| 
						 | 
					@ -72,3 +76,32 @@ cdef struct TokenC:
 | 
				
			||||||
    attr_t ent_type # TODO: Is there a better way to do this? Multiple sources of truth..
 | 
					    attr_t ent_type # TODO: Is there a better way to do this? Multiple sources of truth..
 | 
				
			||||||
    attr_t ent_kb_id
 | 
					    attr_t ent_kb_id
 | 
				
			||||||
    hash_t ent_id
 | 
					    hash_t ent_id
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					# Internal struct, for storage and disambiguation of entities.
 | 
				
			||||||
 | 
					cdef struct KBEntryC:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    # The hash of this entry's unique ID/name in the kB
 | 
				
			||||||
 | 
					    hash_t entity_hash
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    # Allows retrieval of the entity vector, as an index into a vectors table of the KB.
 | 
				
			||||||
 | 
					    # Can be expanded later to refer to multiple rows (compositional model to reduce storage footprint).
 | 
				
			||||||
 | 
					    int32_t vector_index
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    # Allows retrieval of a struct of non-vector features.
 | 
				
			||||||
 | 
					    # This is currently not implemented and set to -1 for the common case where there are no features.
 | 
				
			||||||
 | 
					    int32_t feats_row
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    # log probability of entity, based on corpus frequency
 | 
				
			||||||
 | 
					    float prob
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					# Each alias struct stores a list of Entry pointers with their prior probabilities
 | 
				
			||||||
 | 
					# for this specific mention/alias.
 | 
				
			||||||
 | 
					cdef struct AliasC:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    # All entry candidates for this alias
 | 
				
			||||||
 | 
					    vector[int64_t] entry_indices
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    # Prior probability P(entity|alias) - should sum up to (at most) 1.
 | 
				
			||||||
 | 
					    vector[float] probs
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -81,6 +81,7 @@ cdef enum symbol_t:
 | 
				
			||||||
    DEP
 | 
					    DEP
 | 
				
			||||||
    ENT_IOB
 | 
					    ENT_IOB
 | 
				
			||||||
    ENT_TYPE
 | 
					    ENT_TYPE
 | 
				
			||||||
 | 
					    ENT_KB_ID
 | 
				
			||||||
    HEAD
 | 
					    HEAD
 | 
				
			||||||
    SENT_START
 | 
					    SENT_START
 | 
				
			||||||
    SPACY
 | 
					    SPACY
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -86,6 +86,7 @@ IDS = {
 | 
				
			||||||
    "DEP": DEP,
 | 
					    "DEP": DEP,
 | 
				
			||||||
    "ENT_IOB": ENT_IOB,
 | 
					    "ENT_IOB": ENT_IOB,
 | 
				
			||||||
    "ENT_TYPE": ENT_TYPE,
 | 
					    "ENT_TYPE": ENT_TYPE,
 | 
				
			||||||
 | 
					    "ENT_KB_ID": ENT_KB_ID,
 | 
				
			||||||
    "HEAD": HEAD,
 | 
					    "HEAD": HEAD,
 | 
				
			||||||
    "SENT_START": SENT_START,
 | 
					    "SENT_START": SENT_START,
 | 
				
			||||||
    "SPACY": SPACY,
 | 
					    "SPACY": SPACY,
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -124,6 +124,22 @@ def ja_tokenizer():
 | 
				
			||||||
    return get_lang_class("ja").Defaults.create_tokenizer()
 | 
					    return get_lang_class("ja").Defaults.create_tokenizer()
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					@pytest.fixture(scope="session")
 | 
				
			||||||
 | 
					def ko_tokenizer():
 | 
				
			||||||
 | 
					    pytest.importorskip("natto")
 | 
				
			||||||
 | 
					    return get_lang_class("ko").Defaults.create_tokenizer()
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					@pytest.fixture(scope="session")
 | 
				
			||||||
 | 
					def lt_tokenizer():
 | 
				
			||||||
 | 
					    return get_lang_class("lt").Defaults.create_tokenizer()
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					@pytest.fixture(scope="session")
 | 
				
			||||||
 | 
					def lt_lemmatizer():
 | 
				
			||||||
 | 
					    return get_lang_class("lt").Defaults.create_lemmatizer()
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
@pytest.fixture(scope="session")
 | 
					@pytest.fixture(scope="session")
 | 
				
			||||||
def nb_tokenizer():
 | 
					def nb_tokenizer():
 | 
				
			||||||
    return get_lang_class("nb").Defaults.create_tokenizer()
 | 
					    return get_lang_class("nb").Defaults.create_tokenizer()
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
							
								
								
									
										0
									
								
								spacy/tests/lang/ko/__init__.py
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										0
									
								
								spacy/tests/lang/ko/__init__.py
									
									
									
									
									
										Normal file
									
								
							
							
								
								
									
										12
									
								
								spacy/tests/lang/ko/test_lemmatization.py
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										12
									
								
								spacy/tests/lang/ko/test_lemmatization.py
									
									
									
									
									
										Normal file
									
								
							| 
						 | 
					@ -0,0 +1,12 @@
 | 
				
			||||||
 | 
					# coding: utf-8
 | 
				
			||||||
 | 
					from __future__ import unicode_literals
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					import pytest
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					@pytest.mark.parametrize(
 | 
				
			||||||
 | 
					    "word,lemma", [("새로운", "새롭"), ("빨간", "빨갛"), ("클수록", "크"), ("뭡니까", "뭣"), ("됐다", "되")]
 | 
				
			||||||
 | 
					)
 | 
				
			||||||
 | 
					def test_ko_lemmatizer_assigns(ko_tokenizer, word, lemma):
 | 
				
			||||||
 | 
					    test_lemma = ko_tokenizer(word)[0].lemma_
 | 
				
			||||||
 | 
					    assert test_lemma == lemma
 | 
				
			||||||
							
								
								
									
										46
									
								
								spacy/tests/lang/ko/test_tokenizer.py
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										46
									
								
								spacy/tests/lang/ko/test_tokenizer.py
									
									
									
									
									
										Normal file
									
								
							| 
						 | 
					@ -0,0 +1,46 @@
 | 
				
			||||||
 | 
					# coding: utf-8
 | 
				
			||||||
 | 
					from __future__ import unicode_literals
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					import pytest
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					# fmt: off
 | 
				
			||||||
 | 
					TOKENIZER_TESTS = [("서울 타워 근처에 살고 있습니다.", "서울 타워 근처 에 살 고 있 습니다 ."),
 | 
				
			||||||
 | 
					                   ("영등포구에 있는 맛집 좀 알려주세요.", "영등포구 에 있 는 맛집 좀 알려 주 세요 .")]
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					TAG_TESTS = [("서울 타워 근처에 살고 있습니다.",
 | 
				
			||||||
 | 
					              "NNP NNG NNG JKB VV EC VX EF SF"),
 | 
				
			||||||
 | 
					             ("영등포구에 있는 맛집 좀 알려주세요.",
 | 
				
			||||||
 | 
					              "NNP JKB VV ETM NNG MAG VV VX EP SF")]
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					FULL_TAG_TESTS = [("영등포구에 있는 맛집 좀 알려주세요.",
 | 
				
			||||||
 | 
					                   "NNP JKB VV ETM NNG MAG VV+EC VX EP+EF SF")]
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					POS_TESTS = [("서울 타워 근처에 살고 있습니다.",
 | 
				
			||||||
 | 
					              "PROPN NOUN NOUN ADP VERB X AUX X PUNCT"),
 | 
				
			||||||
 | 
					             ("영등포구에 있는 맛집 좀 알려주세요.",
 | 
				
			||||||
 | 
					              "PROPN ADP VERB X NOUN ADV VERB AUX X PUNCT")]
 | 
				
			||||||
 | 
					# fmt: on
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					@pytest.mark.parametrize("text,expected_tokens", TOKENIZER_TESTS)
 | 
				
			||||||
 | 
					def test_ko_tokenizer(ko_tokenizer, text, expected_tokens):
 | 
				
			||||||
 | 
					    tokens = [token.text for token in ko_tokenizer(text)]
 | 
				
			||||||
 | 
					    assert tokens == expected_tokens.split()
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					@pytest.mark.parametrize("text,expected_tags", TAG_TESTS)
 | 
				
			||||||
 | 
					def test_ko_tokenizer_tags(ko_tokenizer, text, expected_tags):
 | 
				
			||||||
 | 
					    tags = [token.tag_ for token in ko_tokenizer(text)]
 | 
				
			||||||
 | 
					    assert tags == expected_tags.split()
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					@pytest.mark.parametrize("text,expected_tags", FULL_TAG_TESTS)
 | 
				
			||||||
 | 
					def test_ko_tokenizer_full_tags(ko_tokenizer, text, expected_tags):
 | 
				
			||||||
 | 
					    tags = ko_tokenizer(text).user_data["full_tags"]
 | 
				
			||||||
 | 
					    assert tags == expected_tags.split()
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					@pytest.mark.parametrize("text,expected_pos", POS_TESTS)
 | 
				
			||||||
 | 
					def test_ko_tokenizer_pos(ko_tokenizer, text, expected_pos):
 | 
				
			||||||
 | 
					    pos = [token.pos_ for token in ko_tokenizer(text)]
 | 
				
			||||||
 | 
					    assert pos == expected_pos.split()
 | 
				
			||||||
							
								
								
									
										0
									
								
								spacy/tests/lang/lt/__init__.py
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										0
									
								
								spacy/tests/lang/lt/__init__.py
									
									
									
									
									
										Normal file
									
								
							
							
								
								
									
										15
									
								
								spacy/tests/lang/lt/test_lemmatizer.py
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										15
									
								
								spacy/tests/lang/lt/test_lemmatizer.py
									
									
									
									
									
										Normal file
									
								
							| 
						 | 
					@ -0,0 +1,15 @@
 | 
				
			||||||
 | 
					# coding: utf-8
 | 
				
			||||||
 | 
					from __future__ import unicode_literals
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					import pytest
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					@pytest.mark.parametrize("tokens,lemmas", [
 | 
				
			||||||
 | 
					    (["Galime", "vadinti", "gerovės", "valstybe", ",", "turime", "išvystytą", "socialinę", "apsaugą", ",",
 | 
				
			||||||
 | 
					      "sveikatos", "apsaugą", "ir", "prieinamą", "švietimą", "."],
 | 
				
			||||||
 | 
					     ["galėti", "vadintas", "gerovė", "valstybė", ",", "turėti", "išvystytas", "socialinis",
 | 
				
			||||||
 | 
					      "apsauga", ",", "sveikata", "apsauga", "ir", "prieinamas", "švietimas", "."]),
 | 
				
			||||||
 | 
					    (["taip", ",", "uoliai", "tyrinėjau", "ir", "pasirinkau", "geriausią", "variantą", "."],
 | 
				
			||||||
 | 
					     ["taip", ",", "uolus", "tyrinėti", "ir", "pasirinkti", "geras", "variantas", "."])])
 | 
				
			||||||
 | 
					def test_lt_lemmatizer(lt_lemmatizer, tokens, lemmas):
 | 
				
			||||||
 | 
					    assert lemmas == [lt_lemmatizer.lookup(token) for token in tokens]
 | 
				
			||||||
							
								
								
									
										56
									
								
								spacy/tests/lang/lt/test_text.py
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										56
									
								
								spacy/tests/lang/lt/test_text.py
									
									
									
									
									
										Normal file
									
								
							| 
						 | 
					@ -0,0 +1,56 @@
 | 
				
			||||||
 | 
					# coding: utf-8
 | 
				
			||||||
 | 
					from __future__ import unicode_literals
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					import pytest
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					def test_lt_tokenizer_handles_long_text(lt_tokenizer):
 | 
				
			||||||
 | 
					    text = """Tokios sausros kriterijus atitinka pirmadienį atlikti skaičiavimai, palyginus faktinį ir žemiausią vidutinį daugiametį vandens lygį. Nustatyta, kad iš 48 šalies vandens matavimo stočių 28-iose stotyse vandens lygis yra žemesnis arba lygus žemiausiam vidutiniam daugiamečiam šiltojo laikotarpio vandens lygiui."""
 | 
				
			||||||
 | 
					    tokens = lt_tokenizer(text)
 | 
				
			||||||
 | 
					    assert len(tokens) == 42
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					@pytest.mark.parametrize(
 | 
				
			||||||
 | 
					    "text,length",
 | 
				
			||||||
 | 
					    [
 | 
				
			||||||
 | 
					        (
 | 
				
			||||||
 | 
					            "177R Parodų rūmai–Ozo g. nuo vasario 18 d. bus skelbiamas interneto tinklalapyje.",
 | 
				
			||||||
 | 
					            15,
 | 
				
			||||||
 | 
					        ),
 | 
				
			||||||
 | 
					        (
 | 
				
			||||||
 | 
					            "ISM universiteto doc. dr. Ieva Augutytė-Kvedaravičienė pastebi, kad tyrimais nustatyti elgesio pokyčiai.",
 | 
				
			||||||
 | 
					            16,
 | 
				
			||||||
 | 
					        ),
 | 
				
			||||||
 | 
					    ],
 | 
				
			||||||
 | 
					)
 | 
				
			||||||
 | 
					def test_lt_tokenizer_handles_punct_abbrev(lt_tokenizer, text, length):
 | 
				
			||||||
 | 
					    tokens = lt_tokenizer(text)
 | 
				
			||||||
 | 
					    assert len(tokens) == length
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					@pytest.mark.parametrize("text", ["km.", "pvz.", "biol."])
 | 
				
			||||||
 | 
					def test_lt_tokenizer_abbrev_exceptions(lt_tokenizer, text):
 | 
				
			||||||
 | 
					    tokens = lt_tokenizer(text)
 | 
				
			||||||
 | 
					    assert len(tokens) == 1
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					@pytest.mark.parametrize(
 | 
				
			||||||
 | 
					    "text,match",
 | 
				
			||||||
 | 
					    [
 | 
				
			||||||
 | 
					        ("10", True),
 | 
				
			||||||
 | 
					        ("1", True),
 | 
				
			||||||
 | 
					        ("10,000", True),
 | 
				
			||||||
 | 
					        ("10,00", True),
 | 
				
			||||||
 | 
					        ("999.0", True),
 | 
				
			||||||
 | 
					        ("vienas", True),
 | 
				
			||||||
 | 
					        ("du", True),
 | 
				
			||||||
 | 
					        ("milijardas", True),
 | 
				
			||||||
 | 
					        ("šuo", False),
 | 
				
			||||||
 | 
					        (",", False),
 | 
				
			||||||
 | 
					        ("1/2", True),
 | 
				
			||||||
 | 
					    ],
 | 
				
			||||||
 | 
					)
 | 
				
			||||||
 | 
					def test_lt_lex_attrs_like_number(lt_tokenizer, text, match):
 | 
				
			||||||
 | 
					    tokens = lt_tokenizer(text)
 | 
				
			||||||
 | 
					    assert len(tokens) == 1
 | 
				
			||||||
 | 
					    assert tokens[0].like_num == match
 | 
				
			||||||
| 
						 | 
					@ -5,7 +5,6 @@ import pytest
 | 
				
			||||||
import re
 | 
					import re
 | 
				
			||||||
from spacy.matcher import Matcher, DependencyMatcher
 | 
					from spacy.matcher import Matcher, DependencyMatcher
 | 
				
			||||||
from spacy.tokens import Doc, Token
 | 
					from spacy.tokens import Doc, Token
 | 
				
			||||||
from ..util import get_doc
 | 
					 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
@pytest.fixture
 | 
					@pytest.fixture
 | 
				
			||||||
| 
						 | 
					@ -288,24 +287,43 @@ def deps():
 | 
				
			||||||
def dependency_matcher(en_vocab):
 | 
					def dependency_matcher(en_vocab):
 | 
				
			||||||
    def is_brown_yellow(text):
 | 
					    def is_brown_yellow(text):
 | 
				
			||||||
        return bool(re.compile(r"brown|yellow|over").match(text))
 | 
					        return bool(re.compile(r"brown|yellow|over").match(text))
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    IS_BROWN_YELLOW = en_vocab.add_flag(is_brown_yellow)
 | 
					    IS_BROWN_YELLOW = en_vocab.add_flag(is_brown_yellow)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    pattern1 = [
 | 
					    pattern1 = [
 | 
				
			||||||
        {"SPEC": {"NODE_NAME": "fox"}, "PATTERN": {"ORTH": "fox"}},
 | 
					        {"SPEC": {"NODE_NAME": "fox"}, "PATTERN": {"ORTH": "fox"}},
 | 
				
			||||||
        {"SPEC": {"NODE_NAME": "q", "NBOR_RELOP": ">", "NBOR_NAME": "fox"},"PATTERN": {"ORTH": "quick", "DEP": "amod"}},
 | 
					        {
 | 
				
			||||||
        {"SPEC": {"NODE_NAME": "r", "NBOR_RELOP": ">", "NBOR_NAME": "fox"}, "PATTERN": {IS_BROWN_YELLOW: True}},
 | 
					            "SPEC": {"NODE_NAME": "q", "NBOR_RELOP": ">", "NBOR_NAME": "fox"},
 | 
				
			||||||
 | 
					            "PATTERN": {"ORTH": "quick", "DEP": "amod"},
 | 
				
			||||||
 | 
					        },
 | 
				
			||||||
 | 
					        {
 | 
				
			||||||
 | 
					            "SPEC": {"NODE_NAME": "r", "NBOR_RELOP": ">", "NBOR_NAME": "fox"},
 | 
				
			||||||
 | 
					            "PATTERN": {IS_BROWN_YELLOW: True},
 | 
				
			||||||
 | 
					        },
 | 
				
			||||||
    ]
 | 
					    ]
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    pattern2 = [
 | 
					    pattern2 = [
 | 
				
			||||||
        {"SPEC": {"NODE_NAME": "jumped"}, "PATTERN": {"ORTH": "jumped"}},
 | 
					        {"SPEC": {"NODE_NAME": "jumped"}, "PATTERN": {"ORTH": "jumped"}},
 | 
				
			||||||
        {"SPEC": {"NODE_NAME": "fox", "NBOR_RELOP": ">", "NBOR_NAME": "jumped"}, "PATTERN": {"ORTH": "fox"}},
 | 
					        {
 | 
				
			||||||
        {"SPEC": {"NODE_NAME": "quick", "NBOR_RELOP": ".", "NBOR_NAME": "jumped"}, "PATTERN": {"ORTH": "fox"}}
 | 
					            "SPEC": {"NODE_NAME": "fox", "NBOR_RELOP": ">", "NBOR_NAME": "jumped"},
 | 
				
			||||||
 | 
					            "PATTERN": {"ORTH": "fox"},
 | 
				
			||||||
 | 
					        },
 | 
				
			||||||
 | 
					        {
 | 
				
			||||||
 | 
					            "SPEC": {"NODE_NAME": "quick", "NBOR_RELOP": ".", "NBOR_NAME": "jumped"},
 | 
				
			||||||
 | 
					            "PATTERN": {"ORTH": "fox"},
 | 
				
			||||||
 | 
					        },
 | 
				
			||||||
    ]
 | 
					    ]
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    pattern3 = [
 | 
					    pattern3 = [
 | 
				
			||||||
        {"SPEC": {"NODE_NAME": "jumped"}, "PATTERN": {"ORTH": "jumped"}},
 | 
					        {"SPEC": {"NODE_NAME": "jumped"}, "PATTERN": {"ORTH": "jumped"}},
 | 
				
			||||||
        {"SPEC": {"NODE_NAME": "fox", "NBOR_RELOP": ">", "NBOR_NAME": "jumped"}, "PATTERN": {"ORTH": "fox"}},
 | 
					        {
 | 
				
			||||||
        {"SPEC": {"NODE_NAME": "r", "NBOR_RELOP": ">>", "NBOR_NAME": "fox"}, "PATTERN": {"ORTH": "brown"}}
 | 
					            "SPEC": {"NODE_NAME": "fox", "NBOR_RELOP": ">", "NBOR_NAME": "jumped"},
 | 
				
			||||||
 | 
					            "PATTERN": {"ORTH": "fox"},
 | 
				
			||||||
 | 
					        },
 | 
				
			||||||
 | 
					        {
 | 
				
			||||||
 | 
					            "SPEC": {"NODE_NAME": "r", "NBOR_RELOP": ">>", "NBOR_NAME": "fox"},
 | 
				
			||||||
 | 
					            "PATTERN": {"ORTH": "brown"},
 | 
				
			||||||
 | 
					        },
 | 
				
			||||||
    ]
 | 
					    ]
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    matcher = DependencyMatcher(en_vocab)
 | 
					    matcher = DependencyMatcher(en_vocab)
 | 
				
			||||||
| 
						 | 
					@ -320,9 +338,9 @@ def test_dependency_matcher_compile(dependency_matcher):
 | 
				
			||||||
    assert len(dependency_matcher) == 3
 | 
					    assert len(dependency_matcher) == 3
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
def test_dependency_matcher(dependency_matcher, text, heads, deps):
 | 
					# def test_dependency_matcher(dependency_matcher, text, heads, deps):
 | 
				
			||||||
    doc = get_doc(dependency_matcher.vocab, text.split(), heads=heads, deps=deps)
 | 
					#     doc = get_doc(dependency_matcher.vocab, text.split(), heads=heads, deps=deps)
 | 
				
			||||||
    matches = dependency_matcher(doc)
 | 
					#     matches = dependency_matcher(doc)
 | 
				
			||||||
#     assert matches[0][1] == [[3, 1, 2]]
 | 
					#     assert matches[0][1] == [[3, 1, 2]]
 | 
				
			||||||
#     assert matches[1][1] == [[4, 3, 3]]
 | 
					#     assert matches[1][1] == [[4, 3, 3]]
 | 
				
			||||||
#     assert matches[2][1] == [[4, 3, 2]]
 | 
					#     assert matches[2][1] == [[4, 3, 2]]
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -1,91 +0,0 @@
 | 
				
			||||||
# coding: utf-8
 | 
					 | 
				
			||||||
from __future__ import unicode_literals
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
import pytest
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
from spacy.kb import KnowledgeBase
 | 
					 | 
				
			||||||
from spacy.lang.en import English
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
@pytest.fixture
 | 
					 | 
				
			||||||
def nlp():
 | 
					 | 
				
			||||||
    return English()
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
def test_kb_valid_entities(nlp):
 | 
					 | 
				
			||||||
    """Test the valid construction of a KB with 3 entities and two aliases"""
 | 
					 | 
				
			||||||
    mykb = KnowledgeBase(nlp.vocab)
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
    # adding entities
 | 
					 | 
				
			||||||
    mykb.add_entity(entity=u'Q1', prob=0.9)
 | 
					 | 
				
			||||||
    mykb.add_entity(entity=u'Q2')
 | 
					 | 
				
			||||||
    mykb.add_entity(entity=u'Q3', prob=0.5)
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
    # adding aliases
 | 
					 | 
				
			||||||
    mykb.add_alias(alias=u'douglas', entities=[u'Q2', u'Q3'], probabilities=[0.8, 0.2])
 | 
					 | 
				
			||||||
    mykb.add_alias(alias=u'adam', entities=[u'Q2'], probabilities=[0.9])
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
    # test the size of the corresponding KB
 | 
					 | 
				
			||||||
    assert(mykb.get_size_entities() == 3)
 | 
					 | 
				
			||||||
    assert(mykb.get_size_aliases() == 2)
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
def test_kb_invalid_entities(nlp):
 | 
					 | 
				
			||||||
    """Test the invalid construction of a KB with an alias linked to a non-existing entity"""
 | 
					 | 
				
			||||||
    mykb = KnowledgeBase(nlp.vocab)
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
    # adding entities
 | 
					 | 
				
			||||||
    mykb.add_entity(entity=u'Q1', prob=0.9)
 | 
					 | 
				
			||||||
    mykb.add_entity(entity=u'Q2', prob=0.2)
 | 
					 | 
				
			||||||
    mykb.add_entity(entity=u'Q3', prob=0.5)
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
    # adding aliases - should fail because one of the given IDs is not valid
 | 
					 | 
				
			||||||
    with pytest.raises(ValueError):
 | 
					 | 
				
			||||||
        mykb.add_alias(alias=u'douglas', entities=[u'Q2', u'Q342'], probabilities=[0.8, 0.2])
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
def test_kb_invalid_probabilities(nlp):
 | 
					 | 
				
			||||||
    """Test the invalid construction of a KB with wrong prior probabilities"""
 | 
					 | 
				
			||||||
    mykb = KnowledgeBase(nlp.vocab)
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
    # adding entities
 | 
					 | 
				
			||||||
    mykb.add_entity(entity=u'Q1', prob=0.9)
 | 
					 | 
				
			||||||
    mykb.add_entity(entity=u'Q2', prob=0.2)
 | 
					 | 
				
			||||||
    mykb.add_entity(entity=u'Q3', prob=0.5)
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
    # adding aliases - should fail because the sum of the probabilities exceeds 1
 | 
					 | 
				
			||||||
    with pytest.raises(ValueError):
 | 
					 | 
				
			||||||
        mykb.add_alias(alias=u'douglas', entities=[u'Q2', u'Q3'], probabilities=[0.8, 0.4])
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
def test_kb_invalid_combination(nlp):
 | 
					 | 
				
			||||||
    """Test the invalid construction of a KB with non-matching entity and probability lists"""
 | 
					 | 
				
			||||||
    mykb = KnowledgeBase(nlp.vocab)
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
    # adding entities
 | 
					 | 
				
			||||||
    mykb.add_entity(entity=u'Q1', prob=0.9)
 | 
					 | 
				
			||||||
    mykb.add_entity(entity=u'Q2', prob=0.2)
 | 
					 | 
				
			||||||
    mykb.add_entity(entity=u'Q3', prob=0.5)
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
    # adding aliases - should fail because the entities and probabilities vectors are not of equal length
 | 
					 | 
				
			||||||
    with pytest.raises(ValueError):
 | 
					 | 
				
			||||||
        mykb.add_alias(alias=u'douglas', entities=[u'Q2', u'Q3'], probabilities=[0.3, 0.4, 0.1])
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
def test_candidate_generation(nlp):
 | 
					 | 
				
			||||||
    """Test correct candidate generation"""
 | 
					 | 
				
			||||||
    mykb = KnowledgeBase(nlp.vocab)
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
    # adding entities
 | 
					 | 
				
			||||||
    mykb.add_entity(entity=u'Q1', prob=0.9)
 | 
					 | 
				
			||||||
    mykb.add_entity(entity=u'Q2', prob=0.2)
 | 
					 | 
				
			||||||
    mykb.add_entity(entity=u'Q3', prob=0.5)
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
    # adding aliases
 | 
					 | 
				
			||||||
    mykb.add_alias(alias=u'douglas', entities=[u'Q2', u'Q3'], probabilities=[0.8, 0.2])
 | 
					 | 
				
			||||||
    mykb.add_alias(alias=u'adam', entities=[u'Q2'], probabilities=[0.9])
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
    # test the size of the relevant candidates
 | 
					 | 
				
			||||||
    assert(len(mykb.get_candidates(u'douglas')) == 2)
 | 
					 | 
				
			||||||
    assert(len(mykb.get_candidates(u'adam')) == 1)
 | 
					 | 
				
			||||||
    assert(len(mykb.get_candidates(u'shrubbery')) == 0)
 | 
					 | 
				
			||||||
							
								
								
									
										145
									
								
								spacy/tests/pipeline/test_entity_linker.py
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										145
									
								
								spacy/tests/pipeline/test_entity_linker.py
									
									
									
									
									
										Normal file
									
								
							| 
						 | 
					@ -0,0 +1,145 @@
 | 
				
			||||||
 | 
					# coding: utf-8
 | 
				
			||||||
 | 
					from __future__ import unicode_literals
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					import pytest
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					from spacy.kb import KnowledgeBase
 | 
				
			||||||
 | 
					from spacy.lang.en import English
 | 
				
			||||||
 | 
					from spacy.pipeline import EntityRuler
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					@pytest.fixture
 | 
				
			||||||
 | 
					def nlp():
 | 
				
			||||||
 | 
					    return English()
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					def test_kb_valid_entities(nlp):
 | 
				
			||||||
 | 
					    """Test the valid construction of a KB with 3 entities and two aliases"""
 | 
				
			||||||
 | 
					    mykb = KnowledgeBase(nlp.vocab, entity_vector_length=1)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    # adding entities
 | 
				
			||||||
 | 
					    mykb.add_entity(entity='Q1', prob=0.9, entity_vector=[1])
 | 
				
			||||||
 | 
					    mykb.add_entity(entity='Q2', prob=0.5, entity_vector=[2])
 | 
				
			||||||
 | 
					    mykb.add_entity(entity='Q3', prob=0.5, entity_vector=[3])
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    # adding aliases
 | 
				
			||||||
 | 
					    mykb.add_alias(alias='douglas', entities=['Q2', 'Q3'], probabilities=[0.8, 0.2])
 | 
				
			||||||
 | 
					    mykb.add_alias(alias='adam', entities=['Q2'], probabilities=[0.9])
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    # test the size of the corresponding KB
 | 
				
			||||||
 | 
					    assert(mykb.get_size_entities() == 3)
 | 
				
			||||||
 | 
					    assert(mykb.get_size_aliases() == 2)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					def test_kb_invalid_entities(nlp):
 | 
				
			||||||
 | 
					    """Test the invalid construction of a KB with an alias linked to a non-existing entity"""
 | 
				
			||||||
 | 
					    mykb = KnowledgeBase(nlp.vocab, entity_vector_length=1)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    # adding entities
 | 
				
			||||||
 | 
					    mykb.add_entity(entity='Q1', prob=0.9, entity_vector=[1])
 | 
				
			||||||
 | 
					    mykb.add_entity(entity='Q2', prob=0.2, entity_vector=[2])
 | 
				
			||||||
 | 
					    mykb.add_entity(entity='Q3', prob=0.5, entity_vector=[3])
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    # adding aliases - should fail because one of the given IDs is not valid
 | 
				
			||||||
 | 
					    with pytest.raises(ValueError):
 | 
				
			||||||
 | 
					        mykb.add_alias(alias='douglas', entities=['Q2', 'Q342'], probabilities=[0.8, 0.2])
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					def test_kb_invalid_probabilities(nlp):
 | 
				
			||||||
 | 
					    """Test the invalid construction of a KB with wrong prior probabilities"""
 | 
				
			||||||
 | 
					    mykb = KnowledgeBase(nlp.vocab, entity_vector_length=1)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    # adding entities
 | 
				
			||||||
 | 
					    mykb.add_entity(entity='Q1', prob=0.9, entity_vector=[1])
 | 
				
			||||||
 | 
					    mykb.add_entity(entity='Q2', prob=0.2, entity_vector=[2])
 | 
				
			||||||
 | 
					    mykb.add_entity(entity='Q3', prob=0.5, entity_vector=[3])
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    # adding aliases - should fail because the sum of the probabilities exceeds 1
 | 
				
			||||||
 | 
					    with pytest.raises(ValueError):
 | 
				
			||||||
 | 
					        mykb.add_alias(alias='douglas', entities=['Q2', 'Q3'], probabilities=[0.8, 0.4])
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					def test_kb_invalid_combination(nlp):
 | 
				
			||||||
 | 
					    """Test the invalid construction of a KB with non-matching entity and probability lists"""
 | 
				
			||||||
 | 
					    mykb = KnowledgeBase(nlp.vocab, entity_vector_length=1)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    # adding entities
 | 
				
			||||||
 | 
					    mykb.add_entity(entity='Q1', prob=0.9, entity_vector=[1])
 | 
				
			||||||
 | 
					    mykb.add_entity(entity='Q2', prob=0.2, entity_vector=[2])
 | 
				
			||||||
 | 
					    mykb.add_entity(entity='Q3', prob=0.5, entity_vector=[3])
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    # adding aliases - should fail because the entities and probabilities vectors are not of equal length
 | 
				
			||||||
 | 
					    with pytest.raises(ValueError):
 | 
				
			||||||
 | 
					        mykb.add_alias(alias='douglas', entities=['Q2', 'Q3'], probabilities=[0.3, 0.4, 0.1])
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					def test_kb_invalid_entity_vector(nlp):
 | 
				
			||||||
 | 
					    """Test the invalid construction of a KB with non-matching entity vector lengths"""
 | 
				
			||||||
 | 
					    mykb = KnowledgeBase(nlp.vocab, entity_vector_length=3)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    # adding entities
 | 
				
			||||||
 | 
					    mykb.add_entity(entity='Q1', prob=0.9, entity_vector=[1, 2, 3])
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    # this should fail because the kb's expected entity vector length is 3
 | 
				
			||||||
 | 
					    with pytest.raises(ValueError):
 | 
				
			||||||
 | 
					        mykb.add_entity(entity='Q2', prob=0.2, entity_vector=[2])
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					def test_candidate_generation(nlp):
 | 
				
			||||||
 | 
					    """Test correct candidate generation"""
 | 
				
			||||||
 | 
					    mykb = KnowledgeBase(nlp.vocab, entity_vector_length=1)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    # adding entities
 | 
				
			||||||
 | 
					    mykb.add_entity(entity='Q1', prob=0.9, entity_vector=[1])
 | 
				
			||||||
 | 
					    mykb.add_entity(entity='Q2', prob=0.2, entity_vector=[2])
 | 
				
			||||||
 | 
					    mykb.add_entity(entity='Q3', prob=0.5, entity_vector=[3])
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    # adding aliases
 | 
				
			||||||
 | 
					    mykb.add_alias(alias='douglas', entities=['Q2', 'Q3'], probabilities=[0.8, 0.2])
 | 
				
			||||||
 | 
					    mykb.add_alias(alias='adam', entities=['Q2'], probabilities=[0.9])
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    # test the size of the relevant candidates
 | 
				
			||||||
 | 
					    assert(len(mykb.get_candidates('douglas')) == 2)
 | 
				
			||||||
 | 
					    assert(len(mykb.get_candidates('adam')) == 1)
 | 
				
			||||||
 | 
					    assert(len(mykb.get_candidates('shrubbery')) == 0)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					def test_preserving_links_asdoc(nlp):
 | 
				
			||||||
 | 
					    """Test that Span.as_doc preserves the existing entity links"""
 | 
				
			||||||
 | 
					    mykb = KnowledgeBase(nlp.vocab, entity_vector_length=1)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    # adding entities
 | 
				
			||||||
 | 
					    mykb.add_entity(entity='Q1', prob=0.9, entity_vector=[1])
 | 
				
			||||||
 | 
					    mykb.add_entity(entity='Q2', prob=0.8, entity_vector=[1])
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    # adding aliases
 | 
				
			||||||
 | 
					    mykb.add_alias(alias='Boston', entities=['Q1'], probabilities=[0.7])
 | 
				
			||||||
 | 
					    mykb.add_alias(alias='Denver', entities=['Q2'], probabilities=[0.6])
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    # set up pipeline with NER (Entity Ruler) and NEL (prior probability only, model not trained)
 | 
				
			||||||
 | 
					    sentencizer = nlp.create_pipe("sentencizer")
 | 
				
			||||||
 | 
					    nlp.add_pipe(sentencizer)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    ruler = EntityRuler(nlp)
 | 
				
			||||||
 | 
					    patterns = [{"label": "GPE", "pattern": "Boston"},
 | 
				
			||||||
 | 
					                {"label": "GPE", "pattern": "Denver"}]
 | 
				
			||||||
 | 
					    ruler.add_patterns(patterns)
 | 
				
			||||||
 | 
					    nlp.add_pipe(ruler)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    el_pipe = nlp.create_pipe(name='entity_linker', config={"context_width": 64})
 | 
				
			||||||
 | 
					    el_pipe.set_kb(mykb)
 | 
				
			||||||
 | 
					    el_pipe.begin_training()
 | 
				
			||||||
 | 
					    el_pipe.context_weight = 0
 | 
				
			||||||
 | 
					    el_pipe.prior_weight = 1
 | 
				
			||||||
 | 
					    nlp.add_pipe(el_pipe, last=True)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    # test whether the entity links are preserved by the `as_doc()` function
 | 
				
			||||||
 | 
					    text = "She lives in Boston. He lives in Denver."
 | 
				
			||||||
 | 
					    doc = nlp(text)
 | 
				
			||||||
 | 
					    for ent in doc.ents:
 | 
				
			||||||
 | 
					        orig_text = ent.text
 | 
				
			||||||
 | 
					        orig_kb_id = ent.kb_id_
 | 
				
			||||||
 | 
					        sent_doc = ent.sent.as_doc()
 | 
				
			||||||
 | 
					        for s_ent in sent_doc.ents:
 | 
				
			||||||
 | 
					            if s_ent.text == orig_text:
 | 
				
			||||||
 | 
					                assert s_ent.kb_id_ == orig_kb_id
 | 
				
			||||||
| 
						 | 
					@ -106,5 +106,24 @@ def test_entity_ruler_serialize_bytes(nlp, patterns):
 | 
				
			||||||
    assert len(new_ruler) == 0
 | 
					    assert len(new_ruler) == 0
 | 
				
			||||||
    assert len(new_ruler.labels) == 0
 | 
					    assert len(new_ruler.labels) == 0
 | 
				
			||||||
    new_ruler = new_ruler.from_bytes(ruler_bytes)
 | 
					    new_ruler = new_ruler.from_bytes(ruler_bytes)
 | 
				
			||||||
 | 
					    assert len(new_ruler) == len(patterns)
 | 
				
			||||||
 | 
					    assert len(new_ruler.labels) == 4
 | 
				
			||||||
 | 
					    assert len(new_ruler.patterns) == len(ruler.patterns)
 | 
				
			||||||
 | 
					    for pattern in ruler.patterns:
 | 
				
			||||||
 | 
					        assert pattern in new_ruler.patterns
 | 
				
			||||||
 | 
					    assert sorted(new_ruler.labels) == sorted(ruler.labels)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					def test_entity_ruler_serialize_phrase_matcher_attr_bytes(nlp, patterns):
 | 
				
			||||||
 | 
					    ruler = EntityRuler(nlp, phrase_matcher_attr="LOWER", patterns=patterns)
 | 
				
			||||||
    assert len(ruler) == len(patterns)
 | 
					    assert len(ruler) == len(patterns)
 | 
				
			||||||
    assert len(ruler.labels) == 4
 | 
					    assert len(ruler.labels) == 4
 | 
				
			||||||
 | 
					    ruler_bytes = ruler.to_bytes()
 | 
				
			||||||
 | 
					    new_ruler = EntityRuler(nlp)
 | 
				
			||||||
 | 
					    assert len(new_ruler) == 0
 | 
				
			||||||
 | 
					    assert len(new_ruler.labels) == 0
 | 
				
			||||||
 | 
					    assert new_ruler.phrase_matcher_attr is None
 | 
				
			||||||
 | 
					    new_ruler = new_ruler.from_bytes(ruler_bytes)
 | 
				
			||||||
 | 
					    assert len(new_ruler) == len(patterns)
 | 
				
			||||||
 | 
					    assert len(new_ruler.labels) == 4
 | 
				
			||||||
 | 
					    assert new_ruler.phrase_matcher_attr == "LOWER"
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -4,6 +4,7 @@ from __future__ import unicode_literals
 | 
				
			||||||
import pytest
 | 
					import pytest
 | 
				
			||||||
import numpy
 | 
					import numpy
 | 
				
			||||||
from spacy.tokens import Doc
 | 
					from spacy.tokens import Doc
 | 
				
			||||||
 | 
					from spacy.matcher import Matcher
 | 
				
			||||||
from spacy.displacy import render
 | 
					from spacy.displacy import render
 | 
				
			||||||
from spacy.gold import iob_to_biluo
 | 
					from spacy.gold import iob_to_biluo
 | 
				
			||||||
from spacy.lang.it import Italian
 | 
					from spacy.lang.it import Italian
 | 
				
			||||||
| 
						 | 
					@ -123,6 +124,15 @@ def test_issue2396(en_vocab):
 | 
				
			||||||
    assert (span.get_lca_matrix() == matrix).all()
 | 
					    assert (span.get_lca_matrix() == matrix).all()
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					def test_issue2464(en_vocab):
 | 
				
			||||||
 | 
					    """Test problem with successive ?. This is the same bug, so putting it here."""
 | 
				
			||||||
 | 
					    matcher = Matcher(en_vocab)
 | 
				
			||||||
 | 
					    doc = Doc(en_vocab, words=["a", "b"])
 | 
				
			||||||
 | 
					    matcher.add("4", None, [{"OP": "?"}, {"OP": "?"}])
 | 
				
			||||||
 | 
					    matches = matcher(doc)
 | 
				
			||||||
 | 
					    assert len(matches) == 3
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
def test_issue2482():
 | 
					def test_issue2482():
 | 
				
			||||||
    """Test we can serialize and deserialize a blank NER or parser model."""
 | 
					    """Test we can serialize and deserialize a blank NER or parser model."""
 | 
				
			||||||
    nlp = Italian()
 | 
					    nlp = Italian()
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
							
								
								
									
										334
									
								
								spacy/tests/regression/test_issue3001-3500.py
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										334
									
								
								spacy/tests/regression/test_issue3001-3500.py
									
									
									
									
									
										Normal file
									
								
							| 
						 | 
					@ -0,0 +1,334 @@
 | 
				
			||||||
 | 
					# coding: utf8
 | 
				
			||||||
 | 
					from __future__ import unicode_literals
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					import pytest
 | 
				
			||||||
 | 
					from spacy.lang.en import English
 | 
				
			||||||
 | 
					from spacy.lang.de import German
 | 
				
			||||||
 | 
					from spacy.pipeline import EntityRuler, EntityRecognizer
 | 
				
			||||||
 | 
					from spacy.matcher import Matcher, PhraseMatcher
 | 
				
			||||||
 | 
					from spacy.tokens import Doc
 | 
				
			||||||
 | 
					from spacy.vocab import Vocab
 | 
				
			||||||
 | 
					from spacy.attrs import ENT_IOB, ENT_TYPE
 | 
				
			||||||
 | 
					from spacy.compat import pickle, is_python2, unescape_unicode
 | 
				
			||||||
 | 
					from spacy import displacy
 | 
				
			||||||
 | 
					from spacy.util import decaying
 | 
				
			||||||
 | 
					import numpy
 | 
				
			||||||
 | 
					import re
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					from ..util import get_doc
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					def test_issue3002():
 | 
				
			||||||
 | 
					    """Test that the tokenizer doesn't hang on a long list of dots"""
 | 
				
			||||||
 | 
					    nlp = German()
 | 
				
			||||||
 | 
					    doc = nlp(
 | 
				
			||||||
 | 
					        "880.794.982.218.444.893.023.439.794.626.120.190.780.624.990.275.671 ist eine lange Zahl"
 | 
				
			||||||
 | 
					    )
 | 
				
			||||||
 | 
					    assert len(doc) == 5
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					def test_issue3009(en_vocab):
 | 
				
			||||||
 | 
					    """Test problem with matcher quantifiers"""
 | 
				
			||||||
 | 
					    patterns = [
 | 
				
			||||||
 | 
					        [{"LEMMA": "have"}, {"LOWER": "to"}, {"LOWER": "do"}, {"POS": "ADP"}],
 | 
				
			||||||
 | 
					        [
 | 
				
			||||||
 | 
					            {"LEMMA": "have"},
 | 
				
			||||||
 | 
					            {"IS_ASCII": True, "IS_PUNCT": False, "OP": "*"},
 | 
				
			||||||
 | 
					            {"LOWER": "to"},
 | 
				
			||||||
 | 
					            {"LOWER": "do"},
 | 
				
			||||||
 | 
					            {"POS": "ADP"},
 | 
				
			||||||
 | 
					        ],
 | 
				
			||||||
 | 
					        [
 | 
				
			||||||
 | 
					            {"LEMMA": "have"},
 | 
				
			||||||
 | 
					            {"IS_ASCII": True, "IS_PUNCT": False, "OP": "?"},
 | 
				
			||||||
 | 
					            {"LOWER": "to"},
 | 
				
			||||||
 | 
					            {"LOWER": "do"},
 | 
				
			||||||
 | 
					            {"POS": "ADP"},
 | 
				
			||||||
 | 
					        ],
 | 
				
			||||||
 | 
					    ]
 | 
				
			||||||
 | 
					    words = ["also", "has", "to", "do", "with"]
 | 
				
			||||||
 | 
					    tags = ["RB", "VBZ", "TO", "VB", "IN"]
 | 
				
			||||||
 | 
					    doc = get_doc(en_vocab, words=words, tags=tags)
 | 
				
			||||||
 | 
					    matcher = Matcher(en_vocab)
 | 
				
			||||||
 | 
					    for i, pattern in enumerate(patterns):
 | 
				
			||||||
 | 
					        matcher.add(str(i), None, pattern)
 | 
				
			||||||
 | 
					        matches = matcher(doc)
 | 
				
			||||||
 | 
					        assert matches
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					def test_issue3012(en_vocab):
 | 
				
			||||||
 | 
					    """Test that the is_tagged attribute doesn't get overwritten when we from_array
 | 
				
			||||||
 | 
					    without tag information."""
 | 
				
			||||||
 | 
					    words = ["This", "is", "10", "%", "."]
 | 
				
			||||||
 | 
					    tags = ["DT", "VBZ", "CD", "NN", "."]
 | 
				
			||||||
 | 
					    pos = ["DET", "VERB", "NUM", "NOUN", "PUNCT"]
 | 
				
			||||||
 | 
					    ents = [(2, 4, "PERCENT")]
 | 
				
			||||||
 | 
					    doc = get_doc(en_vocab, words=words, tags=tags, pos=pos, ents=ents)
 | 
				
			||||||
 | 
					    assert doc.is_tagged
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    expected = ("10", "NUM", "CD", "PERCENT")
 | 
				
			||||||
 | 
					    assert (doc[2].text, doc[2].pos_, doc[2].tag_, doc[2].ent_type_) == expected
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    header = [ENT_IOB, ENT_TYPE]
 | 
				
			||||||
 | 
					    ent_array = doc.to_array(header)
 | 
				
			||||||
 | 
					    doc.from_array(header, ent_array)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    assert (doc[2].text, doc[2].pos_, doc[2].tag_, doc[2].ent_type_) == expected
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    # Serializing then deserializing
 | 
				
			||||||
 | 
					    doc_bytes = doc.to_bytes()
 | 
				
			||||||
 | 
					    doc2 = Doc(en_vocab).from_bytes(doc_bytes)
 | 
				
			||||||
 | 
					    assert (doc2[2].text, doc2[2].pos_, doc2[2].tag_, doc2[2].ent_type_) == expected
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					def test_issue3199():
 | 
				
			||||||
 | 
					    """Test that Span.noun_chunks works correctly if no noun chunks iterator
 | 
				
			||||||
 | 
					    is available. To make this test future-proof, we're constructing a Doc
 | 
				
			||||||
 | 
					    with a new Vocab here and setting is_parsed to make sure the noun chunks run.
 | 
				
			||||||
 | 
					    """
 | 
				
			||||||
 | 
					    doc = Doc(Vocab(), words=["This", "is", "a", "sentence"])
 | 
				
			||||||
 | 
					    doc.is_parsed = True
 | 
				
			||||||
 | 
					    assert list(doc[0:3].noun_chunks) == []
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					def test_issue3209():
 | 
				
			||||||
 | 
					    """Test issue that occurred in spaCy nightly where NER labels were being
 | 
				
			||||||
 | 
					    mapped to classes incorrectly after loading the model, when the labels
 | 
				
			||||||
 | 
					    were added using ner.add_label().
 | 
				
			||||||
 | 
					    """
 | 
				
			||||||
 | 
					    nlp = English()
 | 
				
			||||||
 | 
					    ner = nlp.create_pipe("ner")
 | 
				
			||||||
 | 
					    nlp.add_pipe(ner)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    ner.add_label("ANIMAL")
 | 
				
			||||||
 | 
					    nlp.begin_training()
 | 
				
			||||||
 | 
					    move_names = ["O", "B-ANIMAL", "I-ANIMAL", "L-ANIMAL", "U-ANIMAL"]
 | 
				
			||||||
 | 
					    assert ner.move_names == move_names
 | 
				
			||||||
 | 
					    nlp2 = English()
 | 
				
			||||||
 | 
					    nlp2.add_pipe(nlp2.create_pipe("ner"))
 | 
				
			||||||
 | 
					    nlp2.from_bytes(nlp.to_bytes())
 | 
				
			||||||
 | 
					    assert nlp2.get_pipe("ner").move_names == move_names
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					def test_issue3248_1():
 | 
				
			||||||
 | 
					    """Test that the PhraseMatcher correctly reports its number of rules, not
 | 
				
			||||||
 | 
					    total number of patterns."""
 | 
				
			||||||
 | 
					    nlp = English()
 | 
				
			||||||
 | 
					    matcher = PhraseMatcher(nlp.vocab)
 | 
				
			||||||
 | 
					    matcher.add("TEST1", None, nlp("a"), nlp("b"), nlp("c"))
 | 
				
			||||||
 | 
					    matcher.add("TEST2", None, nlp("d"))
 | 
				
			||||||
 | 
					    assert len(matcher) == 2
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					def test_issue3248_2():
 | 
				
			||||||
 | 
					    """Test that the PhraseMatcher can be pickled correctly."""
 | 
				
			||||||
 | 
					    nlp = English()
 | 
				
			||||||
 | 
					    matcher = PhraseMatcher(nlp.vocab)
 | 
				
			||||||
 | 
					    matcher.add("TEST1", None, nlp("a"), nlp("b"), nlp("c"))
 | 
				
			||||||
 | 
					    matcher.add("TEST2", None, nlp("d"))
 | 
				
			||||||
 | 
					    data = pickle.dumps(matcher)
 | 
				
			||||||
 | 
					    new_matcher = pickle.loads(data)
 | 
				
			||||||
 | 
					    assert len(new_matcher) == len(matcher)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					def test_issue3277(es_tokenizer):
 | 
				
			||||||
 | 
					    """Test that hyphens are split correctly as prefixes."""
 | 
				
			||||||
 | 
					    doc = es_tokenizer("—Yo me llamo... –murmuró el niño– Emilio Sánchez Pérez.")
 | 
				
			||||||
 | 
					    assert len(doc) == 14
 | 
				
			||||||
 | 
					    assert doc[0].text == "\u2014"
 | 
				
			||||||
 | 
					    assert doc[5].text == "\u2013"
 | 
				
			||||||
 | 
					    assert doc[9].text == "\u2013"
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					def test_issue3288(en_vocab):
 | 
				
			||||||
 | 
					    """Test that retokenization works correctly via displaCy when punctuation
 | 
				
			||||||
 | 
					    is merged onto the preceeding token and tensor is resized."""
 | 
				
			||||||
 | 
					    words = ["Hello", "World", "!", "When", "is", "this", "breaking", "?"]
 | 
				
			||||||
 | 
					    heads = [1, 0, -1, 1, 0, 1, -2, -3]
 | 
				
			||||||
 | 
					    deps = ["intj", "ROOT", "punct", "advmod", "ROOT", "det", "nsubj", "punct"]
 | 
				
			||||||
 | 
					    doc = get_doc(en_vocab, words=words, heads=heads, deps=deps)
 | 
				
			||||||
 | 
					    doc.tensor = numpy.zeros((len(words), 96), dtype="float32")
 | 
				
			||||||
 | 
					    displacy.render(doc)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					def test_issue3289():
 | 
				
			||||||
 | 
					    """Test that Language.to_bytes handles serializing a pipeline component
 | 
				
			||||||
 | 
					    with an uninitialized model."""
 | 
				
			||||||
 | 
					    nlp = English()
 | 
				
			||||||
 | 
					    nlp.add_pipe(nlp.create_pipe("textcat"))
 | 
				
			||||||
 | 
					    bytes_data = nlp.to_bytes()
 | 
				
			||||||
 | 
					    new_nlp = English()
 | 
				
			||||||
 | 
					    new_nlp.add_pipe(nlp.create_pipe("textcat"))
 | 
				
			||||||
 | 
					    new_nlp.from_bytes(bytes_data)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					def test_issue3328(en_vocab):
 | 
				
			||||||
 | 
					    doc = Doc(en_vocab, words=["Hello", ",", "how", "are", "you", "doing", "?"])
 | 
				
			||||||
 | 
					    matcher = Matcher(en_vocab)
 | 
				
			||||||
 | 
					    patterns = [
 | 
				
			||||||
 | 
					        [{"LOWER": {"IN": ["hello", "how"]}}],
 | 
				
			||||||
 | 
					        [{"LOWER": {"IN": ["you", "doing"]}}],
 | 
				
			||||||
 | 
					    ]
 | 
				
			||||||
 | 
					    matcher.add("TEST", None, *patterns)
 | 
				
			||||||
 | 
					    matches = matcher(doc)
 | 
				
			||||||
 | 
					    assert len(matches) == 4
 | 
				
			||||||
 | 
					    matched_texts = [doc[start:end].text for _, start, end in matches]
 | 
				
			||||||
 | 
					    assert matched_texts == ["Hello", "how", "you", "doing"]
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					@pytest.mark.xfail
 | 
				
			||||||
 | 
					def test_issue3331(en_vocab):
 | 
				
			||||||
 | 
					    """Test that duplicate patterns for different rules result in multiple
 | 
				
			||||||
 | 
					    matches, one per rule.
 | 
				
			||||||
 | 
					    """
 | 
				
			||||||
 | 
					    matcher = PhraseMatcher(en_vocab)
 | 
				
			||||||
 | 
					    matcher.add("A", None, Doc(en_vocab, words=["Barack", "Obama"]))
 | 
				
			||||||
 | 
					    matcher.add("B", None, Doc(en_vocab, words=["Barack", "Obama"]))
 | 
				
			||||||
 | 
					    doc = Doc(en_vocab, words=["Barack", "Obama", "lifts", "America"])
 | 
				
			||||||
 | 
					    matches = matcher(doc)
 | 
				
			||||||
 | 
					    assert len(matches) == 2
 | 
				
			||||||
 | 
					    match_ids = [en_vocab.strings[matches[0][0]], en_vocab.strings[matches[1][0]]]
 | 
				
			||||||
 | 
					    assert sorted(match_ids) == ["A", "B"]
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					def test_issue3345():
 | 
				
			||||||
 | 
					    """Test case where preset entity crosses sentence boundary."""
 | 
				
			||||||
 | 
					    nlp = English()
 | 
				
			||||||
 | 
					    doc = Doc(nlp.vocab, words=["I", "live", "in", "New", "York"])
 | 
				
			||||||
 | 
					    doc[4].is_sent_start = True
 | 
				
			||||||
 | 
					    ruler = EntityRuler(nlp, patterns=[{"label": "GPE", "pattern": "New York"}])
 | 
				
			||||||
 | 
					    ner = EntityRecognizer(doc.vocab)
 | 
				
			||||||
 | 
					    # Add the OUT action. I wouldn't have thought this would be necessary...
 | 
				
			||||||
 | 
					    ner.moves.add_action(5, "")
 | 
				
			||||||
 | 
					    ner.add_label("GPE")
 | 
				
			||||||
 | 
					    doc = ruler(doc)
 | 
				
			||||||
 | 
					    # Get into the state just before "New"
 | 
				
			||||||
 | 
					    state = ner.moves.init_batch([doc])[0]
 | 
				
			||||||
 | 
					    ner.moves.apply_transition(state, "O")
 | 
				
			||||||
 | 
					    ner.moves.apply_transition(state, "O")
 | 
				
			||||||
 | 
					    ner.moves.apply_transition(state, "O")
 | 
				
			||||||
 | 
					    # Check that B-GPE is valid.
 | 
				
			||||||
 | 
					    assert ner.moves.is_valid(state, "B-GPE")
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					if is_python2:
 | 
				
			||||||
 | 
					    # If we have this test in Python 3, pytest chokes, as it can't print the
 | 
				
			||||||
 | 
					    # string above in the xpass message.
 | 
				
			||||||
 | 
					    prefix_search = (
 | 
				
			||||||
 | 
					        b"^\xc2\xa7|^%|^=|^\xe2\x80\x94|^\xe2\x80\x93|^\\+(?![0-9])"
 | 
				
			||||||
 | 
					        b"|^\xe2\x80\xa6|^\xe2\x80\xa6\xe2\x80\xa6|^,|^:|^;|^\\!|^\\?"
 | 
				
			||||||
 | 
					        b"|^\xc2\xbf|^\xd8\x9f|^\xc2\xa1|^\\(|^\\)|^\\[|^\\]|^\\{|^\\}"
 | 
				
			||||||
 | 
					        b"|^<|^>|^_|^#|^\\*|^&|^\xe3\x80\x82|^\xef\xbc\x9f|^\xef\xbc\x81|"
 | 
				
			||||||
 | 
					        b"^\xef\xbc\x8c|^\xe3\x80\x81|^\xef\xbc\x9b|^\xef\xbc\x9a|"
 | 
				
			||||||
 | 
					        b"^\xef\xbd\x9e|^\xc2\xb7|^\xe0\xa5\xa4|^\xd8\x8c|^\xd8\x9b|"
 | 
				
			||||||
 | 
					        b"^\xd9\xaa|^\\.\\.+|^\xe2\x80\xa6|^\\'|^\"|^\xe2\x80\x9d|"
 | 
				
			||||||
 | 
					        b"^\xe2\x80\x9c|^`|^\xe2\x80\x98|^\xc2\xb4|^\xe2\x80\x99|"
 | 
				
			||||||
 | 
					        b"^\xe2\x80\x9a|^,|^\xe2\x80\x9e|^\xc2\xbb|^\xc2\xab|^\xe3\x80\x8c|"
 | 
				
			||||||
 | 
					        b"^\xe3\x80\x8d|^\xe3\x80\x8e|^\xe3\x80\x8f|^\xef\xbc\x88|"
 | 
				
			||||||
 | 
					        b"^\xef\xbc\x89|^\xe3\x80\x94|^\xe3\x80\x95|^\xe3\x80\x90|"
 | 
				
			||||||
 | 
					        b"^\xe3\x80\x91|^\xe3\x80\x8a|^\xe3\x80\x8b|^\xe3\x80\x88|"
 | 
				
			||||||
 | 
					        b"^\xe3\x80\x89|^\\$|^\xc2\xa3|^\xe2\x82\xac|^\xc2\xa5|^\xe0\xb8\xbf|"
 | 
				
			||||||
 | 
					        b"^US\\$|^C\\$|^A\\$|^\xe2\x82\xbd|^\xef\xb7\xbc|^\xe2\x82\xb4|"
 | 
				
			||||||
 | 
					        b"^[\\u00A6\\u00A9\\u00AE\\u00B0\\u0482\\u058D\\u058E\\u060E\\u060F"
 | 
				
			||||||
 | 
					        b"\\u06DE\\u06E9\\u06FD\\u06FE\\u07F6\\u09FA\\u0B70\\u0BF3-\\u0BF8"
 | 
				
			||||||
 | 
					        b"\\u0BFA\\u0C7F\\u0D4F\\u0D79\\u0F01-\\u0F03\\u0F13\\u0F15-\\u0F17"
 | 
				
			||||||
 | 
					        b"\\u0F1A-\\u0F1F\\u0F34\\u0F36\\u0F38\\u0FBE-\\u0FC5\\u0FC7-\\u0FCC"
 | 
				
			||||||
 | 
					        b"\\u0FCE\\u0FCF\\u0FD5-\\u0FD8\\u109E\\u109F\\u1390-\\u1399\\u1940"
 | 
				
			||||||
 | 
					        b"\\u19DE-\\u19FF\\u1B61-\\u1B6A\\u1B74-\\u1B7C\\u2100\\u2101\\u2103"
 | 
				
			||||||
 | 
					        b"-\\u2106\\u2108\\u2109\\u2114\\u2116\\u2117\\u211E-\\u2123\\u2125"
 | 
				
			||||||
 | 
					        b"\\u2127\\u2129\\u212E\\u213A\\u213B\\u214A\\u214C\\u214D\\u214F"
 | 
				
			||||||
 | 
					        b"\\u218A\\u218B\\u2195-\\u2199\\u219C-\\u219F\\u21A1\\u21A2\\u21A4"
 | 
				
			||||||
 | 
					        b"\\u21A5\\u21A7-\\u21AD\\u21AF-\\u21CD\\u21D0\\u21D1\\u21D3\\u21D5"
 | 
				
			||||||
 | 
					        b"-\\u21F3\\u2300-\\u2307\\u230C-\\u231F\\u2322-\\u2328\\u232B"
 | 
				
			||||||
 | 
					        b"-\\u237B\\u237D-\\u239A\\u23B4-\\u23DB\\u23E2-\\u2426\\u2440"
 | 
				
			||||||
 | 
					        b"-\\u244A\\u249C-\\u24E9\\u2500-\\u25B6\\u25B8-\\u25C0\\u25C2"
 | 
				
			||||||
 | 
					        b"-\\u25F7\\u2600-\\u266E\\u2670-\\u2767\\u2794-\\u27BF\\u2800"
 | 
				
			||||||
 | 
					        b"-\\u28FF\\u2B00-\\u2B2F\\u2B45\\u2B46\\u2B4D-\\u2B73\\u2B76"
 | 
				
			||||||
 | 
					        b"-\\u2B95\\u2B98-\\u2BC8\\u2BCA-\\u2BFE\\u2CE5-\\u2CEA\\u2E80"
 | 
				
			||||||
 | 
					        b"-\\u2E99\\u2E9B-\\u2EF3\\u2F00-\\u2FD5\\u2FF0-\\u2FFB\\u3004"
 | 
				
			||||||
 | 
					        b"\\u3012\\u3013\\u3020\\u3036\\u3037\\u303E\\u303F\\u3190\\u3191"
 | 
				
			||||||
 | 
					        b"\\u3196-\\u319F\\u31C0-\\u31E3\\u3200-\\u321E\\u322A-\\u3247\\u3250"
 | 
				
			||||||
 | 
					        b"\\u3260-\\u327F\\u328A-\\u32B0\\u32C0-\\u32FE\\u3300-\\u33FF\\u4DC0"
 | 
				
			||||||
 | 
					        b"-\\u4DFF\\uA490-\\uA4C6\\uA828-\\uA82B\\uA836\\uA837\\uA839\\uAA77"
 | 
				
			||||||
 | 
					        b"-\\uAA79\\uFDFD\\uFFE4\\uFFE8\\uFFED\\uFFEE\\uFFFC\\uFFFD\\U00010137"
 | 
				
			||||||
 | 
					        b"-\\U0001013F\\U00010179-\\U00010189\\U0001018C-\\U0001018E"
 | 
				
			||||||
 | 
					        b"\\U00010190-\\U0001019B\\U000101A0\\U000101D0-\\U000101FC\\U00010877"
 | 
				
			||||||
 | 
					        b"\\U00010878\\U00010AC8\\U0001173F\\U00016B3C-\\U00016B3F\\U00016B45"
 | 
				
			||||||
 | 
					        b"\\U0001BC9C\\U0001D000-\\U0001D0F5\\U0001D100-\\U0001D126\\U0001D129"
 | 
				
			||||||
 | 
					        b"-\\U0001D164\\U0001D16A-\\U0001D16C\\U0001D183\\U0001D184\\U0001D18C"
 | 
				
			||||||
 | 
					        b"-\\U0001D1A9\\U0001D1AE-\\U0001D1E8\\U0001D200-\\U0001D241\\U0001D245"
 | 
				
			||||||
 | 
					        b"\\U0001D300-\\U0001D356\\U0001D800-\\U0001D9FF\\U0001DA37-\\U0001DA3A"
 | 
				
			||||||
 | 
					        b"\\U0001DA6D-\\U0001DA74\\U0001DA76-\\U0001DA83\\U0001DA85\\U0001DA86"
 | 
				
			||||||
 | 
					        b"\\U0001ECAC\\U0001F000-\\U0001F02B\\U0001F030-\\U0001F093\\U0001F0A0"
 | 
				
			||||||
 | 
					        b"-\\U0001F0AE\\U0001F0B1-\\U0001F0BF\\U0001F0C1-\\U0001F0CF\\U0001F0D1"
 | 
				
			||||||
 | 
					        b"-\\U0001F0F5\\U0001F110-\\U0001F16B\\U0001F170-\\U0001F1AC\\U0001F1E6"
 | 
				
			||||||
 | 
					        b"-\\U0001F202\\U0001F210-\\U0001F23B\\U0001F240-\\U0001F248\\U0001F250"
 | 
				
			||||||
 | 
					        b"\\U0001F251\\U0001F260-\\U0001F265\\U0001F300-\\U0001F3FA\\U0001F400"
 | 
				
			||||||
 | 
					        b"-\\U0001F6D4\\U0001F6E0-\\U0001F6EC\\U0001F6F0-\\U0001F6F9\\U0001F700"
 | 
				
			||||||
 | 
					        b"-\\U0001F773\\U0001F780-\\U0001F7D8\\U0001F800-\\U0001F80B\\U0001F810"
 | 
				
			||||||
 | 
					        b"-\\U0001F847\\U0001F850-\\U0001F859\\U0001F860-\\U0001F887\\U0001F890"
 | 
				
			||||||
 | 
					        b"-\\U0001F8AD\\U0001F900-\\U0001F90B\\U0001F910-\\U0001F93E\\U0001F940"
 | 
				
			||||||
 | 
					        b"-\\U0001F970\\U0001F973-\\U0001F976\\U0001F97A\\U0001F97C-\\U0001F9A2"
 | 
				
			||||||
 | 
					        b"\\U0001F9B0-\\U0001F9B9\\U0001F9C0-\\U0001F9C2\\U0001F9D0-\\U0001F9FF"
 | 
				
			||||||
 | 
					        b"\\U0001FA60-\\U0001FA6D]"
 | 
				
			||||||
 | 
					    )
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    def test_issue3356():
 | 
				
			||||||
 | 
					        pattern = re.compile(unescape_unicode(prefix_search.decode("utf8")))
 | 
				
			||||||
 | 
					        assert not pattern.search("hello")
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					def test_issue3410():
 | 
				
			||||||
 | 
					    texts = ["Hello world", "This is a test"]
 | 
				
			||||||
 | 
					    nlp = English()
 | 
				
			||||||
 | 
					    matcher = Matcher(nlp.vocab)
 | 
				
			||||||
 | 
					    phrasematcher = PhraseMatcher(nlp.vocab)
 | 
				
			||||||
 | 
					    with pytest.deprecated_call():
 | 
				
			||||||
 | 
					        docs = list(nlp.pipe(texts, n_threads=4))
 | 
				
			||||||
 | 
					    with pytest.deprecated_call():
 | 
				
			||||||
 | 
					        docs = list(nlp.tokenizer.pipe(texts, n_threads=4))
 | 
				
			||||||
 | 
					    with pytest.deprecated_call():
 | 
				
			||||||
 | 
					        list(matcher.pipe(docs, n_threads=4))
 | 
				
			||||||
 | 
					    with pytest.deprecated_call():
 | 
				
			||||||
 | 
					        list(phrasematcher.pipe(docs, n_threads=4))
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					def test_issue3447():
 | 
				
			||||||
 | 
					    sizes = decaying(10.0, 1.0, 0.5)
 | 
				
			||||||
 | 
					    size = next(sizes)
 | 
				
			||||||
 | 
					    assert size == 10.0
 | 
				
			||||||
 | 
					    size = next(sizes)
 | 
				
			||||||
 | 
					    assert size == 10.0 - 0.5
 | 
				
			||||||
 | 
					    size = next(sizes)
 | 
				
			||||||
 | 
					    assert size == 10.0 - 0.5 - 0.5
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					@pytest.mark.xfail(reason="default suffix rules avoid one upper-case letter before dot")
 | 
				
			||||||
 | 
					def test_issue3449():
 | 
				
			||||||
 | 
					    nlp = English()
 | 
				
			||||||
 | 
					    nlp.add_pipe(nlp.create_pipe("sentencizer"))
 | 
				
			||||||
 | 
					    text1 = "He gave the ball to I. Do you want to go to the movies with I?"
 | 
				
			||||||
 | 
					    text2 = "He gave the ball to I.  Do you want to go to the movies with I?"
 | 
				
			||||||
 | 
					    text3 = "He gave the ball to I.\nDo you want to go to the movies with I?"
 | 
				
			||||||
 | 
					    t1 = nlp(text1)
 | 
				
			||||||
 | 
					    t2 = nlp(text2)
 | 
				
			||||||
 | 
					    t3 = nlp(text3)
 | 
				
			||||||
 | 
					    assert t1[5].text == "I"
 | 
				
			||||||
 | 
					    assert t2[5].text == "I"
 | 
				
			||||||
 | 
					    assert t3[5].text == "I"
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					def test_issue3468():
 | 
				
			||||||
 | 
					    """Test that sentence boundaries are set correctly so Doc.is_sentenced can
 | 
				
			||||||
 | 
					    be restored after serialization."""
 | 
				
			||||||
 | 
					    nlp = English()
 | 
				
			||||||
 | 
					    nlp.add_pipe(nlp.create_pipe("sentencizer"))
 | 
				
			||||||
 | 
					    doc = nlp("Hello world")
 | 
				
			||||||
 | 
					    assert doc[0].is_sent_start
 | 
				
			||||||
 | 
					    assert doc.is_sentenced
 | 
				
			||||||
 | 
					    assert len(list(doc.sents)) == 1
 | 
				
			||||||
 | 
					    doc_bytes = doc.to_bytes()
 | 
				
			||||||
 | 
					    new_doc = Doc(nlp.vocab).from_bytes(doc_bytes)
 | 
				
			||||||
 | 
					    assert new_doc[0].is_sent_start
 | 
				
			||||||
 | 
					    assert new_doc.is_sentenced
 | 
				
			||||||
 | 
					    assert len(list(new_doc.sents)) == 1
 | 
				
			||||||
| 
						 | 
					@ -1,11 +0,0 @@
 | 
				
			||||||
# coding: utf8
 | 
					 | 
				
			||||||
from __future__ import unicode_literals
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
from spacy.lang.de import German
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
def test_issue3002():
 | 
					 | 
				
			||||||
    """Test that the tokenizer doesn't hang on a long list of dots"""
 | 
					 | 
				
			||||||
    nlp = German()
 | 
					 | 
				
			||||||
    doc = nlp('880.794.982.218.444.893.023.439.794.626.120.190.780.624.990.275.671 ist eine lange Zahl')
 | 
					 | 
				
			||||||
    assert len(doc) == 5
 | 
					 | 
				
			||||||
| 
						 | 
					@ -1,67 +0,0 @@
 | 
				
			||||||
# coding: utf-8
 | 
					 | 
				
			||||||
from __future__ import unicode_literals
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
import pytest
 | 
					 | 
				
			||||||
from spacy.matcher import Matcher
 | 
					 | 
				
			||||||
from spacy.tokens import Doc
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
PATTERNS = [
 | 
					 | 
				
			||||||
    ("1", [[{"LEMMA": "have"}, {"LOWER": "to"}, {"LOWER": "do"}, {"POS": "ADP"}]]),
 | 
					 | 
				
			||||||
    (
 | 
					 | 
				
			||||||
        "2",
 | 
					 | 
				
			||||||
        [
 | 
					 | 
				
			||||||
            [
 | 
					 | 
				
			||||||
                {"LEMMA": "have"},
 | 
					 | 
				
			||||||
                {"IS_ASCII": True, "IS_PUNCT": False, "OP": "*"},
 | 
					 | 
				
			||||||
                {"LOWER": "to"},
 | 
					 | 
				
			||||||
                {"LOWER": "do"},
 | 
					 | 
				
			||||||
                {"POS": "ADP"},
 | 
					 | 
				
			||||||
            ]
 | 
					 | 
				
			||||||
        ],
 | 
					 | 
				
			||||||
    ),
 | 
					 | 
				
			||||||
    (
 | 
					 | 
				
			||||||
        "3",
 | 
					 | 
				
			||||||
        [
 | 
					 | 
				
			||||||
            [
 | 
					 | 
				
			||||||
                {"LEMMA": "have"},
 | 
					 | 
				
			||||||
                {"IS_ASCII": True, "IS_PUNCT": False, "OP": "?"},
 | 
					 | 
				
			||||||
                {"LOWER": "to"},
 | 
					 | 
				
			||||||
                {"LOWER": "do"},
 | 
					 | 
				
			||||||
                {"POS": "ADP"},
 | 
					 | 
				
			||||||
            ]
 | 
					 | 
				
			||||||
        ],
 | 
					 | 
				
			||||||
    ),
 | 
					 | 
				
			||||||
]
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
@pytest.fixture
 | 
					 | 
				
			||||||
def doc(en_tokenizer):
 | 
					 | 
				
			||||||
    doc = en_tokenizer("also has to do with")
 | 
					 | 
				
			||||||
    doc[0].tag_ = "RB"
 | 
					 | 
				
			||||||
    doc[1].tag_ = "VBZ"
 | 
					 | 
				
			||||||
    doc[2].tag_ = "TO"
 | 
					 | 
				
			||||||
    doc[3].tag_ = "VB"
 | 
					 | 
				
			||||||
    doc[4].tag_ = "IN"
 | 
					 | 
				
			||||||
    return doc
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
@pytest.fixture
 | 
					 | 
				
			||||||
def matcher(en_tokenizer):
 | 
					 | 
				
			||||||
    return Matcher(en_tokenizer.vocab)
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
@pytest.mark.parametrize("pattern", PATTERNS)
 | 
					 | 
				
			||||||
def test_issue3009(doc, matcher, pattern):
 | 
					 | 
				
			||||||
    """Test problem with matcher quantifiers"""
 | 
					 | 
				
			||||||
    matcher.add(pattern[0], None, *pattern[1])
 | 
					 | 
				
			||||||
    matches = matcher(doc)
 | 
					 | 
				
			||||||
    assert matches
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
def test_issue2464(matcher):
 | 
					 | 
				
			||||||
    """Test problem with successive ?. This is the same bug, so putting it here."""
 | 
					 | 
				
			||||||
    doc = Doc(matcher.vocab, words=["a", "b"])
 | 
					 | 
				
			||||||
    matcher.add("4", None, [{"OP": "?"}, {"OP": "?"}])
 | 
					 | 
				
			||||||
    matches = matcher(doc)
 | 
					 | 
				
			||||||
    assert len(matches) == 3
 | 
					 | 
				
			||||||
| 
						 | 
					@ -1,31 +0,0 @@
 | 
				
			||||||
# coding: utf8
 | 
					 | 
				
			||||||
from __future__ import unicode_literals
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
from ...attrs import ENT_IOB, ENT_TYPE
 | 
					 | 
				
			||||||
from ...tokens import Doc
 | 
					 | 
				
			||||||
from ..util import get_doc
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
def test_issue3012(en_vocab):
 | 
					 | 
				
			||||||
    """Test that the is_tagged attribute doesn't get overwritten when we from_array
 | 
					 | 
				
			||||||
    without tag information."""
 | 
					 | 
				
			||||||
    words = ["This", "is", "10", "%", "."]
 | 
					 | 
				
			||||||
    tags = ["DT", "VBZ", "CD", "NN", "."]
 | 
					 | 
				
			||||||
    pos = ["DET", "VERB", "NUM", "NOUN", "PUNCT"]
 | 
					 | 
				
			||||||
    ents = [(2, 4, "PERCENT")]
 | 
					 | 
				
			||||||
    doc = get_doc(en_vocab, words=words, tags=tags, pos=pos, ents=ents)
 | 
					 | 
				
			||||||
    assert doc.is_tagged
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
    expected = ("10", "NUM", "CD", "PERCENT")
 | 
					 | 
				
			||||||
    assert (doc[2].text, doc[2].pos_, doc[2].tag_, doc[2].ent_type_) == expected
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
    header = [ENT_IOB, ENT_TYPE]
 | 
					 | 
				
			||||||
    ent_array = doc.to_array(header)
 | 
					 | 
				
			||||||
    doc.from_array(header, ent_array)
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
    assert (doc[2].text, doc[2].pos_, doc[2].tag_, doc[2].ent_type_) == expected
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
    # serializing then deserializing
 | 
					 | 
				
			||||||
    doc_bytes = doc.to_bytes()
 | 
					 | 
				
			||||||
    doc2 = Doc(en_vocab).from_bytes(doc_bytes)
 | 
					 | 
				
			||||||
    assert (doc2[2].text, doc2[2].pos_, doc2[2].tag_, doc2[2].ent_type_) == expected
 | 
					 | 
				
			||||||
| 
						 | 
					@ -1,15 +0,0 @@
 | 
				
			||||||
# coding: utf8
 | 
					 | 
				
			||||||
from __future__ import unicode_literals
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
from spacy.tokens import Doc
 | 
					 | 
				
			||||||
from spacy.vocab import Vocab
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
def test_issue3199():
 | 
					 | 
				
			||||||
    """Test that Span.noun_chunks works correctly if no noun chunks iterator
 | 
					 | 
				
			||||||
    is available. To make this test future-proof, we're constructing a Doc
 | 
					 | 
				
			||||||
    with a new Vocab here and setting is_parsed to make sure the noun chunks run.
 | 
					 | 
				
			||||||
    """
 | 
					 | 
				
			||||||
    doc = Doc(Vocab(), words=["This", "is", "a", "sentence"])
 | 
					 | 
				
			||||||
    doc.is_parsed = True
 | 
					 | 
				
			||||||
    assert list(doc[0:3].noun_chunks) == []
 | 
					 | 
				
			||||||
| 
						 | 
					@ -1,23 +0,0 @@
 | 
				
			||||||
# coding: utf8
 | 
					 | 
				
			||||||
from __future__ import unicode_literals
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
from spacy.lang.en import English
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
def test_issue3209():
 | 
					 | 
				
			||||||
    """Test issue that occurred in spaCy nightly where NER labels were being
 | 
					 | 
				
			||||||
    mapped to classes incorrectly after loading the model, when the labels
 | 
					 | 
				
			||||||
    were added using ner.add_label().
 | 
					 | 
				
			||||||
    """
 | 
					 | 
				
			||||||
    nlp = English()
 | 
					 | 
				
			||||||
    ner = nlp.create_pipe("ner")
 | 
					 | 
				
			||||||
    nlp.add_pipe(ner)
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
    ner.add_label("ANIMAL")
 | 
					 | 
				
			||||||
    nlp.begin_training()
 | 
					 | 
				
			||||||
    move_names = ["O", "B-ANIMAL", "I-ANIMAL", "L-ANIMAL", "U-ANIMAL"]
 | 
					 | 
				
			||||||
    assert ner.move_names == move_names
 | 
					 | 
				
			||||||
    nlp2 = English()
 | 
					 | 
				
			||||||
    nlp2.add_pipe(nlp2.create_pipe("ner"))
 | 
					 | 
				
			||||||
    nlp2.from_bytes(nlp.to_bytes())
 | 
					 | 
				
			||||||
    assert nlp2.get_pipe("ner").move_names == move_names
 | 
					 | 
				
			||||||
| 
						 | 
					@ -1,27 +0,0 @@
 | 
				
			||||||
# coding: utf-8
 | 
					 | 
				
			||||||
from __future__ import unicode_literals
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
from spacy.matcher import PhraseMatcher
 | 
					 | 
				
			||||||
from spacy.lang.en import English
 | 
					 | 
				
			||||||
from spacy.compat import pickle
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
def test_issue3248_1():
 | 
					 | 
				
			||||||
    """Test that the PhraseMatcher correctly reports its number of rules, not
 | 
					 | 
				
			||||||
    total number of patterns."""
 | 
					 | 
				
			||||||
    nlp = English()
 | 
					 | 
				
			||||||
    matcher = PhraseMatcher(nlp.vocab)
 | 
					 | 
				
			||||||
    matcher.add("TEST1", None, nlp("a"), nlp("b"), nlp("c"))
 | 
					 | 
				
			||||||
    matcher.add("TEST2", None, nlp("d"))
 | 
					 | 
				
			||||||
    assert len(matcher) == 2
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
def test_issue3248_2():
 | 
					 | 
				
			||||||
    """Test that the PhraseMatcher can be pickled correctly."""
 | 
					 | 
				
			||||||
    nlp = English()
 | 
					 | 
				
			||||||
    matcher = PhraseMatcher(nlp.vocab)
 | 
					 | 
				
			||||||
    matcher.add("TEST1", None, nlp("a"), nlp("b"), nlp("c"))
 | 
					 | 
				
			||||||
    matcher.add("TEST2", None, nlp("d"))
 | 
					 | 
				
			||||||
    data = pickle.dumps(matcher)
 | 
					 | 
				
			||||||
    new_matcher = pickle.loads(data)
 | 
					 | 
				
			||||||
    assert len(new_matcher) == len(matcher)
 | 
					 | 
				
			||||||
| 
						 | 
					@ -1,11 +0,0 @@
 | 
				
			||||||
# coding: utf-8
 | 
					 | 
				
			||||||
from __future__ import unicode_literals
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
def test_issue3277(es_tokenizer):
 | 
					 | 
				
			||||||
    """Test that hyphens are split correctly as prefixes."""
 | 
					 | 
				
			||||||
    doc = es_tokenizer("—Yo me llamo... –murmuró el niño– Emilio Sánchez Pérez.")
 | 
					 | 
				
			||||||
    assert len(doc) == 14
 | 
					 | 
				
			||||||
    assert doc[0].text == "\u2014"
 | 
					 | 
				
			||||||
    assert doc[5].text == "\u2013"
 | 
					 | 
				
			||||||
    assert doc[9].text == "\u2013"
 | 
					 | 
				
			||||||
| 
						 | 
					@ -1,18 +0,0 @@
 | 
				
			||||||
# coding: utf-8
 | 
					 | 
				
			||||||
from __future__ import unicode_literals
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
import numpy
 | 
					 | 
				
			||||||
from spacy import displacy
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
from ..util import get_doc
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
def test_issue3288(en_vocab):
 | 
					 | 
				
			||||||
    """Test that retokenization works correctly via displaCy when punctuation
 | 
					 | 
				
			||||||
    is merged onto the preceeding token and tensor is resized."""
 | 
					 | 
				
			||||||
    words = ["Hello", "World", "!", "When", "is", "this", "breaking", "?"]
 | 
					 | 
				
			||||||
    heads = [1, 0, -1, 1, 0, 1, -2, -3]
 | 
					 | 
				
			||||||
    deps = ["intj", "ROOT", "punct", "advmod", "ROOT", "det", "nsubj", "punct"]
 | 
					 | 
				
			||||||
    doc = get_doc(en_vocab, words=words, heads=heads, deps=deps)
 | 
					 | 
				
			||||||
    doc.tensor = numpy.zeros((len(words), 96), dtype="float32")
 | 
					 | 
				
			||||||
    displacy.render(doc)
 | 
					 | 
				
			||||||
| 
						 | 
					@ -1,15 +0,0 @@
 | 
				
			||||||
# coding: utf-8
 | 
					 | 
				
			||||||
from __future__ import unicode_literals
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
from spacy.lang.en import English
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
def test_issue3289():
 | 
					 | 
				
			||||||
    """Test that Language.to_bytes handles serializing a pipeline component
 | 
					 | 
				
			||||||
    with an uninitialized model."""
 | 
					 | 
				
			||||||
    nlp = English()
 | 
					 | 
				
			||||||
    nlp.add_pipe(nlp.create_pipe("textcat"))
 | 
					 | 
				
			||||||
    bytes_data = nlp.to_bytes()
 | 
					 | 
				
			||||||
    new_nlp = English()
 | 
					 | 
				
			||||||
    new_nlp.add_pipe(nlp.create_pipe("textcat"))
 | 
					 | 
				
			||||||
    new_nlp.from_bytes(bytes_data)
 | 
					 | 
				
			||||||
| 
						 | 
					@ -1,19 +0,0 @@
 | 
				
			||||||
# coding: utf-8
 | 
					 | 
				
			||||||
from __future__ import unicode_literals
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
from spacy.matcher import Matcher
 | 
					 | 
				
			||||||
from spacy.tokens import Doc
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
def test_issue3328(en_vocab):
 | 
					 | 
				
			||||||
    doc = Doc(en_vocab, words=["Hello", ",", "how", "are", "you", "doing", "?"])
 | 
					 | 
				
			||||||
    matcher = Matcher(en_vocab)
 | 
					 | 
				
			||||||
    patterns = [
 | 
					 | 
				
			||||||
        [{"LOWER": {"IN": ["hello", "how"]}}],
 | 
					 | 
				
			||||||
        [{"LOWER": {"IN": ["you", "doing"]}}],
 | 
					 | 
				
			||||||
    ]
 | 
					 | 
				
			||||||
    matcher.add("TEST", None, *patterns)
 | 
					 | 
				
			||||||
    matches = matcher(doc)
 | 
					 | 
				
			||||||
    assert len(matches) == 4
 | 
					 | 
				
			||||||
    matched_texts = [doc[start:end].text for _, start, end in matches]
 | 
					 | 
				
			||||||
    assert matched_texts == ["Hello", "how", "you", "doing"]
 | 
					 | 
				
			||||||
| 
						 | 
					@ -1,21 +0,0 @@
 | 
				
			||||||
# coding: utf-8
 | 
					 | 
				
			||||||
from __future__ import unicode_literals
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
import pytest
 | 
					 | 
				
			||||||
from spacy.matcher import PhraseMatcher
 | 
					 | 
				
			||||||
from spacy.tokens import Doc
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
@pytest.mark.xfail
 | 
					 | 
				
			||||||
def test_issue3331(en_vocab):
 | 
					 | 
				
			||||||
    """Test that duplicate patterns for different rules result in multiple
 | 
					 | 
				
			||||||
    matches, one per rule.
 | 
					 | 
				
			||||||
    """
 | 
					 | 
				
			||||||
    matcher = PhraseMatcher(en_vocab)
 | 
					 | 
				
			||||||
    matcher.add("A", None, Doc(en_vocab, words=["Barack", "Obama"]))
 | 
					 | 
				
			||||||
    matcher.add("B", None, Doc(en_vocab, words=["Barack", "Obama"]))
 | 
					 | 
				
			||||||
    doc = Doc(en_vocab, words=["Barack", "Obama", "lifts", "America"])
 | 
					 | 
				
			||||||
    matches = matcher(doc)
 | 
					 | 
				
			||||||
    assert len(matches) == 2
 | 
					 | 
				
			||||||
    match_ids = [en_vocab.strings[matches[0][0]], en_vocab.strings[matches[1][0]]]
 | 
					 | 
				
			||||||
    assert sorted(match_ids) == ["A", "B"]
 | 
					 | 
				
			||||||
| 
						 | 
					@ -1,26 +0,0 @@
 | 
				
			||||||
# coding: utf8
 | 
					 | 
				
			||||||
from __future__ import unicode_literals
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
from spacy.lang.en import English
 | 
					 | 
				
			||||||
from spacy.tokens import Doc
 | 
					 | 
				
			||||||
from spacy.pipeline import EntityRuler, EntityRecognizer
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
def test_issue3345():
 | 
					 | 
				
			||||||
    """Test case where preset entity crosses sentence boundary."""
 | 
					 | 
				
			||||||
    nlp = English()
 | 
					 | 
				
			||||||
    doc = Doc(nlp.vocab, words=["I", "live", "in", "New", "York"])
 | 
					 | 
				
			||||||
    doc[4].is_sent_start = True
 | 
					 | 
				
			||||||
    ruler = EntityRuler(nlp, patterns=[{"label": "GPE", "pattern": "New York"}])
 | 
					 | 
				
			||||||
    ner = EntityRecognizer(doc.vocab)
 | 
					 | 
				
			||||||
    # Add the OUT action. I wouldn't have thought this would be necessary...
 | 
					 | 
				
			||||||
    ner.moves.add_action(5, "")
 | 
					 | 
				
			||||||
    ner.add_label("GPE")
 | 
					 | 
				
			||||||
    doc = ruler(doc)
 | 
					 | 
				
			||||||
    # Get into the state just before "New"
 | 
					 | 
				
			||||||
    state = ner.moves.init_batch([doc])[0]
 | 
					 | 
				
			||||||
    ner.moves.apply_transition(state, "O")
 | 
					 | 
				
			||||||
    ner.moves.apply_transition(state, "O")
 | 
					 | 
				
			||||||
    ner.moves.apply_transition(state, "O")
 | 
					 | 
				
			||||||
    # Check that B-GPE is valid.
 | 
					 | 
				
			||||||
    assert ner.moves.is_valid(state, "B-GPE")
 | 
					 | 
				
			||||||
| 
						 | 
					@ -1,72 +0,0 @@
 | 
				
			||||||
# coding: utf8
 | 
					 | 
				
			||||||
from __future__ import unicode_literals
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
import re
 | 
					 | 
				
			||||||
from spacy import compat
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
prefix_search = (
 | 
					 | 
				
			||||||
    b"^\xc2\xa7|^%|^=|^\xe2\x80\x94|^\xe2\x80\x93|^\\+(?![0-9])"
 | 
					 | 
				
			||||||
    b"|^\xe2\x80\xa6|^\xe2\x80\xa6\xe2\x80\xa6|^,|^:|^;|^\\!|^\\?"
 | 
					 | 
				
			||||||
    b"|^\xc2\xbf|^\xd8\x9f|^\xc2\xa1|^\\(|^\\)|^\\[|^\\]|^\\{|^\\}"
 | 
					 | 
				
			||||||
    b"|^<|^>|^_|^#|^\\*|^&|^\xe3\x80\x82|^\xef\xbc\x9f|^\xef\xbc\x81|"
 | 
					 | 
				
			||||||
    b"^\xef\xbc\x8c|^\xe3\x80\x81|^\xef\xbc\x9b|^\xef\xbc\x9a|"
 | 
					 | 
				
			||||||
    b"^\xef\xbd\x9e|^\xc2\xb7|^\xe0\xa5\xa4|^\xd8\x8c|^\xd8\x9b|"
 | 
					 | 
				
			||||||
    b"^\xd9\xaa|^\\.\\.+|^\xe2\x80\xa6|^\\'|^\"|^\xe2\x80\x9d|"
 | 
					 | 
				
			||||||
    b"^\xe2\x80\x9c|^`|^\xe2\x80\x98|^\xc2\xb4|^\xe2\x80\x99|"
 | 
					 | 
				
			||||||
    b"^\xe2\x80\x9a|^,|^\xe2\x80\x9e|^\xc2\xbb|^\xc2\xab|^\xe3\x80\x8c|"
 | 
					 | 
				
			||||||
    b"^\xe3\x80\x8d|^\xe3\x80\x8e|^\xe3\x80\x8f|^\xef\xbc\x88|"
 | 
					 | 
				
			||||||
    b"^\xef\xbc\x89|^\xe3\x80\x94|^\xe3\x80\x95|^\xe3\x80\x90|"
 | 
					 | 
				
			||||||
    b"^\xe3\x80\x91|^\xe3\x80\x8a|^\xe3\x80\x8b|^\xe3\x80\x88|"
 | 
					 | 
				
			||||||
    b"^\xe3\x80\x89|^\\$|^\xc2\xa3|^\xe2\x82\xac|^\xc2\xa5|^\xe0\xb8\xbf|"
 | 
					 | 
				
			||||||
    b"^US\\$|^C\\$|^A\\$|^\xe2\x82\xbd|^\xef\xb7\xbc|^\xe2\x82\xb4|"
 | 
					 | 
				
			||||||
    b"^[\\u00A6\\u00A9\\u00AE\\u00B0\\u0482\\u058D\\u058E\\u060E\\u060F"
 | 
					 | 
				
			||||||
    b"\\u06DE\\u06E9\\u06FD\\u06FE\\u07F6\\u09FA\\u0B70\\u0BF3-\\u0BF8"
 | 
					 | 
				
			||||||
    b"\\u0BFA\\u0C7F\\u0D4F\\u0D79\\u0F01-\\u0F03\\u0F13\\u0F15-\\u0F17"
 | 
					 | 
				
			||||||
    b"\\u0F1A-\\u0F1F\\u0F34\\u0F36\\u0F38\\u0FBE-\\u0FC5\\u0FC7-\\u0FCC"
 | 
					 | 
				
			||||||
    b"\\u0FCE\\u0FCF\\u0FD5-\\u0FD8\\u109E\\u109F\\u1390-\\u1399\\u1940"
 | 
					 | 
				
			||||||
    b"\\u19DE-\\u19FF\\u1B61-\\u1B6A\\u1B74-\\u1B7C\\u2100\\u2101\\u2103"
 | 
					 | 
				
			||||||
    b"-\\u2106\\u2108\\u2109\\u2114\\u2116\\u2117\\u211E-\\u2123\\u2125"
 | 
					 | 
				
			||||||
    b"\\u2127\\u2129\\u212E\\u213A\\u213B\\u214A\\u214C\\u214D\\u214F"
 | 
					 | 
				
			||||||
    b"\\u218A\\u218B\\u2195-\\u2199\\u219C-\\u219F\\u21A1\\u21A2\\u21A4"
 | 
					 | 
				
			||||||
    b"\\u21A5\\u21A7-\\u21AD\\u21AF-\\u21CD\\u21D0\\u21D1\\u21D3\\u21D5"
 | 
					 | 
				
			||||||
    b"-\\u21F3\\u2300-\\u2307\\u230C-\\u231F\\u2322-\\u2328\\u232B"
 | 
					 | 
				
			||||||
    b"-\\u237B\\u237D-\\u239A\\u23B4-\\u23DB\\u23E2-\\u2426\\u2440"
 | 
					 | 
				
			||||||
    b"-\\u244A\\u249C-\\u24E9\\u2500-\\u25B6\\u25B8-\\u25C0\\u25C2"
 | 
					 | 
				
			||||||
    b"-\\u25F7\\u2600-\\u266E\\u2670-\\u2767\\u2794-\\u27BF\\u2800"
 | 
					 | 
				
			||||||
    b"-\\u28FF\\u2B00-\\u2B2F\\u2B45\\u2B46\\u2B4D-\\u2B73\\u2B76"
 | 
					 | 
				
			||||||
    b"-\\u2B95\\u2B98-\\u2BC8\\u2BCA-\\u2BFE\\u2CE5-\\u2CEA\\u2E80"
 | 
					 | 
				
			||||||
    b"-\\u2E99\\u2E9B-\\u2EF3\\u2F00-\\u2FD5\\u2FF0-\\u2FFB\\u3004"
 | 
					 | 
				
			||||||
    b"\\u3012\\u3013\\u3020\\u3036\\u3037\\u303E\\u303F\\u3190\\u3191"
 | 
					 | 
				
			||||||
    b"\\u3196-\\u319F\\u31C0-\\u31E3\\u3200-\\u321E\\u322A-\\u3247\\u3250"
 | 
					 | 
				
			||||||
    b"\\u3260-\\u327F\\u328A-\\u32B0\\u32C0-\\u32FE\\u3300-\\u33FF\\u4DC0"
 | 
					 | 
				
			||||||
    b"-\\u4DFF\\uA490-\\uA4C6\\uA828-\\uA82B\\uA836\\uA837\\uA839\\uAA77"
 | 
					 | 
				
			||||||
    b"-\\uAA79\\uFDFD\\uFFE4\\uFFE8\\uFFED\\uFFEE\\uFFFC\\uFFFD\\U00010137"
 | 
					 | 
				
			||||||
    b"-\\U0001013F\\U00010179-\\U00010189\\U0001018C-\\U0001018E"
 | 
					 | 
				
			||||||
    b"\\U00010190-\\U0001019B\\U000101A0\\U000101D0-\\U000101FC\\U00010877"
 | 
					 | 
				
			||||||
    b"\\U00010878\\U00010AC8\\U0001173F\\U00016B3C-\\U00016B3F\\U00016B45"
 | 
					 | 
				
			||||||
    b"\\U0001BC9C\\U0001D000-\\U0001D0F5\\U0001D100-\\U0001D126\\U0001D129"
 | 
					 | 
				
			||||||
    b"-\\U0001D164\\U0001D16A-\\U0001D16C\\U0001D183\\U0001D184\\U0001D18C"
 | 
					 | 
				
			||||||
    b"-\\U0001D1A9\\U0001D1AE-\\U0001D1E8\\U0001D200-\\U0001D241\\U0001D245"
 | 
					 | 
				
			||||||
    b"\\U0001D300-\\U0001D356\\U0001D800-\\U0001D9FF\\U0001DA37-\\U0001DA3A"
 | 
					 | 
				
			||||||
    b"\\U0001DA6D-\\U0001DA74\\U0001DA76-\\U0001DA83\\U0001DA85\\U0001DA86"
 | 
					 | 
				
			||||||
    b"\\U0001ECAC\\U0001F000-\\U0001F02B\\U0001F030-\\U0001F093\\U0001F0A0"
 | 
					 | 
				
			||||||
    b"-\\U0001F0AE\\U0001F0B1-\\U0001F0BF\\U0001F0C1-\\U0001F0CF\\U0001F0D1"
 | 
					 | 
				
			||||||
    b"-\\U0001F0F5\\U0001F110-\\U0001F16B\\U0001F170-\\U0001F1AC\\U0001F1E6"
 | 
					 | 
				
			||||||
    b"-\\U0001F202\\U0001F210-\\U0001F23B\\U0001F240-\\U0001F248\\U0001F250"
 | 
					 | 
				
			||||||
    b"\\U0001F251\\U0001F260-\\U0001F265\\U0001F300-\\U0001F3FA\\U0001F400"
 | 
					 | 
				
			||||||
    b"-\\U0001F6D4\\U0001F6E0-\\U0001F6EC\\U0001F6F0-\\U0001F6F9\\U0001F700"
 | 
					 | 
				
			||||||
    b"-\\U0001F773\\U0001F780-\\U0001F7D8\\U0001F800-\\U0001F80B\\U0001F810"
 | 
					 | 
				
			||||||
    b"-\\U0001F847\\U0001F850-\\U0001F859\\U0001F860-\\U0001F887\\U0001F890"
 | 
					 | 
				
			||||||
    b"-\\U0001F8AD\\U0001F900-\\U0001F90B\\U0001F910-\\U0001F93E\\U0001F940"
 | 
					 | 
				
			||||||
    b"-\\U0001F970\\U0001F973-\\U0001F976\\U0001F97A\\U0001F97C-\\U0001F9A2"
 | 
					 | 
				
			||||||
    b"\\U0001F9B0-\\U0001F9B9\\U0001F9C0-\\U0001F9C2\\U0001F9D0-\\U0001F9FF"
 | 
					 | 
				
			||||||
    b"\\U0001FA60-\\U0001FA6D]"
 | 
					 | 
				
			||||||
)
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
if compat.is_python2:
 | 
					 | 
				
			||||||
    # If we have this test in Python 3, pytest chokes, as it can't print the
 | 
					 | 
				
			||||||
    # string above in the xpass message.
 | 
					 | 
				
			||||||
    def test_issue3356():
 | 
					 | 
				
			||||||
        pattern = re.compile(compat.unescape_unicode(prefix_search.decode("utf8")))
 | 
					 | 
				
			||||||
        assert not pattern.search("hello")
 | 
					 | 
				
			||||||
| 
						 | 
					@ -1,21 +0,0 @@
 | 
				
			||||||
# coding: utf8
 | 
					 | 
				
			||||||
from __future__ import unicode_literals
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
import pytest
 | 
					 | 
				
			||||||
from spacy.lang.en import English
 | 
					 | 
				
			||||||
from spacy.matcher import Matcher, PhraseMatcher
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
def test_issue3410():
 | 
					 | 
				
			||||||
    texts = ["Hello world", "This is a test"]
 | 
					 | 
				
			||||||
    nlp = English()
 | 
					 | 
				
			||||||
    matcher = Matcher(nlp.vocab)
 | 
					 | 
				
			||||||
    phrasematcher = PhraseMatcher(nlp.vocab)
 | 
					 | 
				
			||||||
    with pytest.deprecated_call():
 | 
					 | 
				
			||||||
        docs = list(nlp.pipe(texts, n_threads=4))
 | 
					 | 
				
			||||||
    with pytest.deprecated_call():
 | 
					 | 
				
			||||||
        docs = list(nlp.tokenizer.pipe(texts, n_threads=4))
 | 
					 | 
				
			||||||
    with pytest.deprecated_call():
 | 
					 | 
				
			||||||
        list(matcher.pipe(docs, n_threads=4))
 | 
					 | 
				
			||||||
    with pytest.deprecated_call():
 | 
					 | 
				
			||||||
        list(phrasematcher.pipe(docs, n_threads=4))
 | 
					 | 
				
			||||||
| 
						 | 
					@ -1,14 +0,0 @@
 | 
				
			||||||
# coding: utf8
 | 
					 | 
				
			||||||
from __future__ import unicode_literals
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
from spacy.util import decaying
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
def test_issue3447():
 | 
					 | 
				
			||||||
    sizes = decaying(10.0, 1.0, 0.5)
 | 
					 | 
				
			||||||
    size = next(sizes)
 | 
					 | 
				
			||||||
    assert size == 10.0
 | 
					 | 
				
			||||||
    size = next(sizes)
 | 
					 | 
				
			||||||
    assert size == 10.0 - 0.5
 | 
					 | 
				
			||||||
    size = next(sizes)
 | 
					 | 
				
			||||||
    assert size == 10.0 - 0.5 - 0.5
 | 
					 | 
				
			||||||
| 
						 | 
					@ -1,21 +0,0 @@
 | 
				
			||||||
# coding: utf8
 | 
					 | 
				
			||||||
from __future__ import unicode_literals
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
import pytest
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
from spacy.lang.en import English
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
@pytest.mark.xfail(reason="default suffix rules avoid one upper-case letter before dot")
 | 
					 | 
				
			||||||
def test_issue3449():
 | 
					 | 
				
			||||||
    nlp = English()
 | 
					 | 
				
			||||||
    nlp.add_pipe(nlp.create_pipe("sentencizer"))
 | 
					 | 
				
			||||||
    text1 = "He gave the ball to I. Do you want to go to the movies with I?"
 | 
					 | 
				
			||||||
    text2 = "He gave the ball to I.  Do you want to go to the movies with I?"
 | 
					 | 
				
			||||||
    text3 = "He gave the ball to I.\nDo you want to go to the movies with I?"
 | 
					 | 
				
			||||||
    t1 = nlp(text1)
 | 
					 | 
				
			||||||
    t2 = nlp(text2)
 | 
					 | 
				
			||||||
    t3 = nlp(text3)
 | 
					 | 
				
			||||||
    assert t1[5].text == "I"
 | 
					 | 
				
			||||||
    assert t2[5].text == "I"
 | 
					 | 
				
			||||||
    assert t3[5].text == "I"
 | 
					 | 
				
			||||||
| 
						 | 
					@ -1,21 +0,0 @@
 | 
				
			||||||
# coding: utf8
 | 
					 | 
				
			||||||
from __future__ import unicode_literals
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
from spacy.lang.en import English
 | 
					 | 
				
			||||||
from spacy.tokens import Doc
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
def test_issue3468():
 | 
					 | 
				
			||||||
    """Test that sentence boundaries are set correctly so Doc.is_sentenced can
 | 
					 | 
				
			||||||
    be restored after serialization."""
 | 
					 | 
				
			||||||
    nlp = English()
 | 
					 | 
				
			||||||
    nlp.add_pipe(nlp.create_pipe("sentencizer"))
 | 
					 | 
				
			||||||
    doc = nlp("Hello world")
 | 
					 | 
				
			||||||
    assert doc[0].is_sent_start
 | 
					 | 
				
			||||||
    assert doc.is_sentenced
 | 
					 | 
				
			||||||
    assert len(list(doc.sents)) == 1
 | 
					 | 
				
			||||||
    doc_bytes = doc.to_bytes()
 | 
					 | 
				
			||||||
    new_doc = Doc(nlp.vocab).from_bytes(doc_bytes)
 | 
					 | 
				
			||||||
    assert new_doc[0].is_sent_start
 | 
					 | 
				
			||||||
    assert new_doc.is_sentenced
 | 
					 | 
				
			||||||
    assert len(list(new_doc.sents)) == 1
 | 
					 | 
				
			||||||
							
								
								
									
										88
									
								
								spacy/tests/regression/test_issue3526.py
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										88
									
								
								spacy/tests/regression/test_issue3526.py
									
									
									
									
									
										Normal file
									
								
							| 
						 | 
					@ -0,0 +1,88 @@
 | 
				
			||||||
 | 
					# coding: utf8
 | 
				
			||||||
 | 
					from __future__ import unicode_literals
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					import pytest
 | 
				
			||||||
 | 
					from spacy.tokens import Span
 | 
				
			||||||
 | 
					from spacy.language import Language
 | 
				
			||||||
 | 
					from spacy.pipeline import EntityRuler
 | 
				
			||||||
 | 
					from spacy import load
 | 
				
			||||||
 | 
					import srsly
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					from ..util import make_tempdir
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					@pytest.fixture
 | 
				
			||||||
 | 
					def patterns():
 | 
				
			||||||
 | 
					    return [
 | 
				
			||||||
 | 
					        {"label": "HELLO", "pattern": "hello world"},
 | 
				
			||||||
 | 
					        {"label": "BYE", "pattern": [{"LOWER": "bye"}, {"LOWER": "bye"}]},
 | 
				
			||||||
 | 
					        {"label": "HELLO", "pattern": [{"ORTH": "HELLO"}]},
 | 
				
			||||||
 | 
					        {"label": "COMPLEX", "pattern": [{"ORTH": "foo", "OP": "*"}]},
 | 
				
			||||||
 | 
					        {"label": "TECH_ORG", "pattern": "Apple", "id": "a1"},
 | 
				
			||||||
 | 
					    ]
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					@pytest.fixture
 | 
				
			||||||
 | 
					def add_ent():
 | 
				
			||||||
 | 
					    def add_ent_component(doc):
 | 
				
			||||||
 | 
					        doc.ents = [Span(doc, 0, 3, label=doc.vocab.strings["ORG"])]
 | 
				
			||||||
 | 
					        return doc
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    return add_ent_component
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					def test_entity_ruler_existing_overwrite_serialize_bytes(patterns, en_vocab):
 | 
				
			||||||
 | 
					    nlp = Language(vocab=en_vocab)
 | 
				
			||||||
 | 
					    ruler = EntityRuler(nlp, patterns=patterns, overwrite_ents=True)
 | 
				
			||||||
 | 
					    ruler_bytes = ruler.to_bytes()
 | 
				
			||||||
 | 
					    assert len(ruler) == len(patterns)
 | 
				
			||||||
 | 
					    assert len(ruler.labels) == 4
 | 
				
			||||||
 | 
					    assert ruler.overwrite
 | 
				
			||||||
 | 
					    new_ruler = EntityRuler(nlp)
 | 
				
			||||||
 | 
					    new_ruler = new_ruler.from_bytes(ruler_bytes)
 | 
				
			||||||
 | 
					    assert len(new_ruler) == len(ruler)
 | 
				
			||||||
 | 
					    assert len(new_ruler.labels) == 4
 | 
				
			||||||
 | 
					    assert new_ruler.overwrite == ruler.overwrite
 | 
				
			||||||
 | 
					    assert new_ruler.ent_id_sep == ruler.ent_id_sep
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					def test_entity_ruler_existing_bytes_old_format_safe(patterns, en_vocab):
 | 
				
			||||||
 | 
					    nlp = Language(vocab=en_vocab)
 | 
				
			||||||
 | 
					    ruler = EntityRuler(nlp, patterns=patterns, overwrite_ents=True)
 | 
				
			||||||
 | 
					    bytes_old_style = srsly.msgpack_dumps(ruler.patterns)
 | 
				
			||||||
 | 
					    new_ruler = EntityRuler(nlp)
 | 
				
			||||||
 | 
					    new_ruler = new_ruler.from_bytes(bytes_old_style)
 | 
				
			||||||
 | 
					    assert len(new_ruler) == len(ruler)
 | 
				
			||||||
 | 
					    for pattern in ruler.patterns:
 | 
				
			||||||
 | 
					        assert pattern in new_ruler.patterns
 | 
				
			||||||
 | 
					    assert new_ruler.overwrite is not ruler.overwrite
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					def test_entity_ruler_from_disk_old_format_safe(patterns, en_vocab):
 | 
				
			||||||
 | 
					    nlp = Language(vocab=en_vocab)
 | 
				
			||||||
 | 
					    ruler = EntityRuler(nlp, patterns=patterns, overwrite_ents=True)
 | 
				
			||||||
 | 
					    with make_tempdir() as tmpdir:
 | 
				
			||||||
 | 
					        out_file = tmpdir / "entity_ruler"
 | 
				
			||||||
 | 
					        srsly.write_jsonl(out_file.with_suffix(".jsonl"), ruler.patterns)
 | 
				
			||||||
 | 
					        new_ruler = EntityRuler(nlp).from_disk(out_file)
 | 
				
			||||||
 | 
					        for pattern in ruler.patterns:
 | 
				
			||||||
 | 
					            assert pattern in new_ruler.patterns
 | 
				
			||||||
 | 
					        assert len(new_ruler) == len(ruler)
 | 
				
			||||||
 | 
					        assert new_ruler.overwrite is not ruler.overwrite
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					def test_entity_ruler_in_pipeline_from_issue(patterns, en_vocab):
 | 
				
			||||||
 | 
					    nlp = Language(vocab=en_vocab)
 | 
				
			||||||
 | 
					    ruler = EntityRuler(nlp, overwrite_ents=True)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    ruler.add_patterns([{"label": "ORG", "pattern": "Apple"}])
 | 
				
			||||||
 | 
					    nlp.add_pipe(ruler)
 | 
				
			||||||
 | 
					    with make_tempdir() as tmpdir:
 | 
				
			||||||
 | 
					        nlp.to_disk(tmpdir)
 | 
				
			||||||
 | 
					        ruler = nlp.get_pipe("entity_ruler")
 | 
				
			||||||
 | 
					        assert ruler.patterns == [{"label": "ORG", "pattern": "Apple"}]
 | 
				
			||||||
 | 
					        assert ruler.overwrite is True
 | 
				
			||||||
 | 
					        nlp2 = load(tmpdir)
 | 
				
			||||||
 | 
					        new_ruler = nlp2.get_pipe("entity_ruler")
 | 
				
			||||||
 | 
					        assert new_ruler.patterns == [{"label": "ORG", "pattern": "Apple"}]
 | 
				
			||||||
 | 
					        assert new_ruler.overwrite is True
 | 
				
			||||||
							
								
								
									
										51
									
								
								spacy/tests/regression/test_issue3611.py
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										51
									
								
								spacy/tests/regression/test_issue3611.py
									
									
									
									
									
										Normal file
									
								
							| 
						 | 
					@ -0,0 +1,51 @@
 | 
				
			||||||
 | 
					# coding: utf8
 | 
				
			||||||
 | 
					from __future__ import unicode_literals
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					import pytest
 | 
				
			||||||
 | 
					import spacy
 | 
				
			||||||
 | 
					from spacy.util import minibatch, compounding
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					def test_issue3611():
 | 
				
			||||||
 | 
					    """ Test whether adding n-grams in the textcat works even when n > token length of some docs """
 | 
				
			||||||
 | 
					    unique_classes = ["offensive", "inoffensive"]
 | 
				
			||||||
 | 
					    x_train = ["This is an offensive text",
 | 
				
			||||||
 | 
					               "This is the second offensive text",
 | 
				
			||||||
 | 
					               "inoff"]
 | 
				
			||||||
 | 
					    y_train = ["offensive", "offensive", "inoffensive"]
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    # preparing the data
 | 
				
			||||||
 | 
					    pos_cats = list()
 | 
				
			||||||
 | 
					    for train_instance in y_train:
 | 
				
			||||||
 | 
					        pos_cats.append({label: label == train_instance for label in unique_classes})
 | 
				
			||||||
 | 
					    train_data = list(zip(x_train, [{'cats': cats} for cats in pos_cats]))
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    # set up the spacy model with a text categorizer component
 | 
				
			||||||
 | 
					    nlp = spacy.blank('en')
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    textcat = nlp.create_pipe(
 | 
				
			||||||
 | 
					        "textcat",
 | 
				
			||||||
 | 
					        config={
 | 
				
			||||||
 | 
					            "exclusive_classes": True,
 | 
				
			||||||
 | 
					            "architecture": "bow",
 | 
				
			||||||
 | 
					            "ngram_size": 2
 | 
				
			||||||
 | 
					        }
 | 
				
			||||||
 | 
					    )
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    for label in unique_classes:
 | 
				
			||||||
 | 
					        textcat.add_label(label)
 | 
				
			||||||
 | 
					    nlp.add_pipe(textcat, last=True)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    # training the network
 | 
				
			||||||
 | 
					    other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'textcat']
 | 
				
			||||||
 | 
					    with nlp.disable_pipes(*other_pipes):
 | 
				
			||||||
 | 
					        optimizer = nlp.begin_training()
 | 
				
			||||||
 | 
					        for i in range(3):
 | 
				
			||||||
 | 
					            losses = {}
 | 
				
			||||||
 | 
					            batches = minibatch(train_data, size=compounding(4.0, 32.0, 1.001))
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					            for batch in batches:
 | 
				
			||||||
 | 
					                texts, annotations = zip(*batch)
 | 
				
			||||||
 | 
					                nlp.update(docs=texts, golds=annotations, sgd=optimizer, drop=0.1, losses=losses)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
							
								
								
									
										10
									
								
								spacy/tests/regression/test_issue3625.py
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										10
									
								
								spacy/tests/regression/test_issue3625.py
									
									
									
									
									
										Normal file
									
								
							| 
						 | 
					@ -0,0 +1,10 @@
 | 
				
			||||||
 | 
					# coding: utf8
 | 
				
			||||||
 | 
					from __future__ import unicode_literals
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					from spacy.lang.hi import Hindi
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					def test_issue3625():
 | 
				
			||||||
 | 
					    """Test that default punctuation rules applies to hindi unicode characters"""
 | 
				
			||||||
 | 
					    nlp = Hindi()
 | 
				
			||||||
 | 
					    doc = nlp(u"hi. how हुए. होटल, होटल")
 | 
				
			||||||
 | 
					    assert [token.text for token in doc] == ['hi', '.', 'how', 'हुए', '.', 'होटल', ',', 'होटल']
 | 
				
			||||||
| 
						 | 
					@ -6,7 +6,6 @@ from spacy.matcher import Matcher
 | 
				
			||||||
from spacy.tokens import Doc
 | 
					from spacy.tokens import Doc
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
@pytest.mark.xfail
 | 
					 | 
				
			||||||
def test_issue3839(en_vocab):
 | 
					def test_issue3839(en_vocab):
 | 
				
			||||||
    """Test that match IDs returned by the matcher are correct, are in the string """
 | 
					    """Test that match IDs returned by the matcher are correct, are in the string """
 | 
				
			||||||
    doc = Doc(en_vocab, words=["terrific", "group", "of", "people"])
 | 
					    doc = Doc(en_vocab, words=["terrific", "group", "of", "people"])
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
							
								
								
									
										31
									
								
								spacy/tests/regression/test_issue3869.py
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										31
									
								
								spacy/tests/regression/test_issue3869.py
									
									
									
									
									
										Normal file
									
								
							| 
						 | 
					@ -0,0 +1,31 @@
 | 
				
			||||||
 | 
					# coding: utf8
 | 
				
			||||||
 | 
					from __future__ import unicode_literals
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					import pytest
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					from spacy.attrs import IS_ALPHA
 | 
				
			||||||
 | 
					from spacy.lang.en import English
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					@pytest.mark.parametrize(
 | 
				
			||||||
 | 
					    "sentence",
 | 
				
			||||||
 | 
					    [
 | 
				
			||||||
 | 
					        'The story was to the effect that a young American student recently called on Professor Christlieb with a letter of introduction.',
 | 
				
			||||||
 | 
					        'The next month Barry Siddall joined Stoke City on a free transfer, after Chris Pearce had established himself as the Vale\'s #1.',
 | 
				
			||||||
 | 
					        'The next month Barry Siddall joined Stoke City on a free transfer, after Chris Pearce had established himself as the Vale\'s number one',
 | 
				
			||||||
 | 
					        'Indeed, making the one who remains do all the work has installed him into a position of such insolent tyranny, it will take a month at least to reduce him to his proper proportions.',
 | 
				
			||||||
 | 
					        "It was a missed assignment, but it shouldn't have resulted in a turnover ..."
 | 
				
			||||||
 | 
					    ],
 | 
				
			||||||
 | 
					)
 | 
				
			||||||
 | 
					def test_issue3869(sentence):
 | 
				
			||||||
 | 
					    """Test that the Doc's count_by function works consistently"""
 | 
				
			||||||
 | 
					    nlp = English()
 | 
				
			||||||
 | 
					    doc = nlp(sentence)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    count = 0
 | 
				
			||||||
 | 
					    for token in doc:
 | 
				
			||||||
 | 
					        count += token.is_alpha
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    assert count == doc.count_by(IS_ALPHA).get(1, 0)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
							
								
								
									
										22
									
								
								spacy/tests/regression/test_issue3880.py
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										22
									
								
								spacy/tests/regression/test_issue3880.py
									
									
									
									
									
										Normal file
									
								
							| 
						 | 
					@ -0,0 +1,22 @@
 | 
				
			||||||
 | 
					# coding: utf8
 | 
				
			||||||
 | 
					from __future__ import unicode_literals
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					from spacy.lang.en import English
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					def test_issue3880():
 | 
				
			||||||
 | 
					    """Test that `nlp.pipe()` works when an empty string ends the batch.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    Fixed in v7.0.5 of Thinc.
 | 
				
			||||||
 | 
					    """
 | 
				
			||||||
 | 
					    texts = ["hello", "world", "", ""]
 | 
				
			||||||
 | 
					    nlp = English()
 | 
				
			||||||
 | 
					    nlp.add_pipe(nlp.create_pipe("parser"))
 | 
				
			||||||
 | 
					    nlp.add_pipe(nlp.create_pipe("ner"))
 | 
				
			||||||
 | 
					    nlp.add_pipe(nlp.create_pipe("tagger"))
 | 
				
			||||||
 | 
					    nlp.get_pipe("parser").add_label("dep")
 | 
				
			||||||
 | 
					    nlp.get_pipe("ner").add_label("PERSON")
 | 
				
			||||||
 | 
					    nlp.get_pipe("tagger").add_label("NN")
 | 
				
			||||||
 | 
					    nlp.begin_training()
 | 
				
			||||||
 | 
					    for doc in nlp.pipe(texts):
 | 
				
			||||||
 | 
					        pass
 | 
				
			||||||
							
								
								
									
										74
									
								
								spacy/tests/serialize/test_serialize_kb.py
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										74
									
								
								spacy/tests/serialize/test_serialize_kb.py
									
									
									
									
									
										Normal file
									
								
							| 
						 | 
					@ -0,0 +1,74 @@
 | 
				
			||||||
 | 
					# coding: utf-8
 | 
				
			||||||
 | 
					from __future__ import unicode_literals
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					from ..util import make_tempdir
 | 
				
			||||||
 | 
					from ...util import ensure_path
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					from spacy.kb import KnowledgeBase
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					def test_serialize_kb_disk(en_vocab):
 | 
				
			||||||
 | 
					    # baseline assertions
 | 
				
			||||||
 | 
					    kb1 = _get_dummy_kb(en_vocab)
 | 
				
			||||||
 | 
					    _check_kb(kb1)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    # dumping to file & loading back in
 | 
				
			||||||
 | 
					    with make_tempdir() as d:
 | 
				
			||||||
 | 
					        dir_path = ensure_path(d)
 | 
				
			||||||
 | 
					        if not dir_path.exists():
 | 
				
			||||||
 | 
					            dir_path.mkdir()
 | 
				
			||||||
 | 
					        file_path = dir_path / "kb"
 | 
				
			||||||
 | 
					        kb1.dump(str(file_path))
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					        kb2 = KnowledgeBase(vocab=en_vocab, entity_vector_length=3)
 | 
				
			||||||
 | 
					        kb2.load_bulk(str(file_path))
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    # final assertions
 | 
				
			||||||
 | 
					    _check_kb(kb2)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					def _get_dummy_kb(vocab):
 | 
				
			||||||
 | 
					    kb = KnowledgeBase(vocab=vocab, entity_vector_length=3)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    kb.add_entity(entity='Q53', prob=0.33, entity_vector=[0, 5, 3])
 | 
				
			||||||
 | 
					    kb.add_entity(entity='Q17', prob=0.2, entity_vector=[7, 1, 0])
 | 
				
			||||||
 | 
					    kb.add_entity(entity='Q007', prob=0.7, entity_vector=[0, 0, 7])
 | 
				
			||||||
 | 
					    kb.add_entity(entity='Q44', prob=0.4, entity_vector=[4, 4, 4])
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    kb.add_alias(alias='double07', entities=['Q17', 'Q007'], probabilities=[0.1, 0.9])
 | 
				
			||||||
 | 
					    kb.add_alias(alias='guy', entities=['Q53', 'Q007', 'Q17', 'Q44'], probabilities=[0.3, 0.3, 0.2, 0.1])
 | 
				
			||||||
 | 
					    kb.add_alias(alias='random', entities=['Q007'], probabilities=[1.0])
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    return kb
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					def _check_kb(kb):
 | 
				
			||||||
 | 
					    # check entities
 | 
				
			||||||
 | 
					    assert kb.get_size_entities() == 4
 | 
				
			||||||
 | 
					    for entity_string in ['Q53', 'Q17', 'Q007', 'Q44']:
 | 
				
			||||||
 | 
					        assert entity_string in kb.get_entity_strings()
 | 
				
			||||||
 | 
					    for entity_string in ['', 'Q0']:
 | 
				
			||||||
 | 
					        assert entity_string not in kb.get_entity_strings()
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    # check aliases
 | 
				
			||||||
 | 
					    assert kb.get_size_aliases() == 3
 | 
				
			||||||
 | 
					    for alias_string in ['double07', 'guy', 'random']:
 | 
				
			||||||
 | 
					        assert alias_string in kb.get_alias_strings()
 | 
				
			||||||
 | 
					    for alias_string in ['nothingness', '', 'randomnoise']:
 | 
				
			||||||
 | 
					        assert alias_string not in kb.get_alias_strings()
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    # check candidates & probabilities
 | 
				
			||||||
 | 
					    candidates = sorted(kb.get_candidates('double07'), key=lambda x: x.entity_)
 | 
				
			||||||
 | 
					    assert len(candidates) == 2
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    assert candidates[0].entity_ == 'Q007'
 | 
				
			||||||
 | 
					    assert 0.6999 < candidates[0].entity_freq < 0.701
 | 
				
			||||||
 | 
					    assert candidates[0].entity_vector == [0, 0, 7]
 | 
				
			||||||
 | 
					    assert candidates[0].alias_ == 'double07'
 | 
				
			||||||
 | 
					    assert 0.899 < candidates[0].prior_prob < 0.901
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    assert candidates[1].entity_ == 'Q17'
 | 
				
			||||||
 | 
					    assert 0.199 < candidates[1].entity_freq < 0.201
 | 
				
			||||||
 | 
					    assert candidates[1].entity_vector == [7, 1, 0]
 | 
				
			||||||
 | 
					    assert candidates[1].alias_ == 'double07'
 | 
				
			||||||
 | 
					    assert 0.099 < candidates[1].prior_prob < 0.101
 | 
				
			||||||
| 
						 | 
					@ -11,29 +11,27 @@ from ..tokens import Doc
 | 
				
			||||||
from ..attrs import SPACY, ORTH
 | 
					from ..attrs import SPACY, ORTH
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
class Binder(object):
 | 
					class DocBox(object):
 | 
				
			||||||
    """Serialize analyses from a collection of doc objects."""
 | 
					    """Serialize analyses from a collection of doc objects."""
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    def __init__(self, attrs=None):
 | 
					    def __init__(self, attrs=None, store_user_data=False):
 | 
				
			||||||
        """Create a Binder object, to hold serialized annotations.
 | 
					        """Create a DocBox object, to hold serialized annotations.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
        attrs (list): List of attributes to serialize. 'orth' and 'spacy' are
 | 
					        attrs (list): List of attributes to serialize. 'orth' and 'spacy' are
 | 
				
			||||||
            always serialized, so they're not required. Defaults to None.
 | 
					            always serialized, so they're not required. Defaults to None.
 | 
				
			||||||
        """
 | 
					        """
 | 
				
			||||||
        attrs = attrs or []
 | 
					        attrs = attrs or []
 | 
				
			||||||
        self.attrs = list(attrs)
 | 
					 | 
				
			||||||
        # Ensure ORTH is always attrs[0]
 | 
					        # Ensure ORTH is always attrs[0]
 | 
				
			||||||
        if ORTH in self.attrs:
 | 
					        self.attrs = [attr for attr in attrs if attr != ORTH and attr != SPACY]
 | 
				
			||||||
            self.attrs.pop(ORTH)
 | 
					 | 
				
			||||||
        if SPACY in self.attrs:
 | 
					 | 
				
			||||||
            self.attrs.pop(SPACY)
 | 
					 | 
				
			||||||
        self.attrs.insert(0, ORTH)
 | 
					        self.attrs.insert(0, ORTH)
 | 
				
			||||||
        self.tokens = []
 | 
					        self.tokens = []
 | 
				
			||||||
        self.spaces = []
 | 
					        self.spaces = []
 | 
				
			||||||
 | 
					        self.user_data = []
 | 
				
			||||||
        self.strings = set()
 | 
					        self.strings = set()
 | 
				
			||||||
 | 
					        self.store_user_data = store_user_data
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    def add(self, doc):
 | 
					    def add(self, doc):
 | 
				
			||||||
        """Add a doc's annotations to the binder for serialization."""
 | 
					        """Add a doc's annotations to the DocBox for serialization."""
 | 
				
			||||||
        array = doc.to_array(self.attrs)
 | 
					        array = doc.to_array(self.attrs)
 | 
				
			||||||
        if len(array.shape) == 1:
 | 
					        if len(array.shape) == 1:
 | 
				
			||||||
            array = array.reshape((array.shape[0], 1))
 | 
					            array = array.reshape((array.shape[0], 1))
 | 
				
			||||||
| 
						 | 
					@ -43,27 +41,35 @@ class Binder(object):
 | 
				
			||||||
        spaces = spaces.reshape((spaces.shape[0], 1))
 | 
					        spaces = spaces.reshape((spaces.shape[0], 1))
 | 
				
			||||||
        self.spaces.append(numpy.asarray(spaces, dtype=bool))
 | 
					        self.spaces.append(numpy.asarray(spaces, dtype=bool))
 | 
				
			||||||
        self.strings.update(w.text for w in doc)
 | 
					        self.strings.update(w.text for w in doc)
 | 
				
			||||||
 | 
					        if self.store_user_data:
 | 
				
			||||||
 | 
					            self.user_data.append(srsly.msgpack_dumps(doc.user_data))
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    def get_docs(self, vocab):
 | 
					    def get_docs(self, vocab):
 | 
				
			||||||
        """Recover Doc objects from the annotations, using the given vocab."""
 | 
					        """Recover Doc objects from the annotations, using the given vocab."""
 | 
				
			||||||
        for string in self.strings:
 | 
					        for string in self.strings:
 | 
				
			||||||
            vocab[string]
 | 
					            vocab[string]
 | 
				
			||||||
        orth_col = self.attrs.index(ORTH)
 | 
					        orth_col = self.attrs.index(ORTH)
 | 
				
			||||||
        for tokens, spaces in zip(self.tokens, self.spaces):
 | 
					        for i in range(len(self.tokens)):
 | 
				
			||||||
 | 
					            tokens = self.tokens[i]
 | 
				
			||||||
 | 
					            spaces = self.spaces[i]
 | 
				
			||||||
            words = [vocab.strings[orth] for orth in tokens[:, orth_col]]
 | 
					            words = [vocab.strings[orth] for orth in tokens[:, orth_col]]
 | 
				
			||||||
            doc = Doc(vocab, words=words, spaces=spaces)
 | 
					            doc = Doc(vocab, words=words, spaces=spaces)
 | 
				
			||||||
            doc = doc.from_array(self.attrs, tokens)
 | 
					            doc = doc.from_array(self.attrs, tokens)
 | 
				
			||||||
 | 
					            if self.store_user_data:
 | 
				
			||||||
 | 
					                doc.user_data.update(srsly.msgpack_loads(self.user_data[i]))
 | 
				
			||||||
            yield doc
 | 
					            yield doc
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    def merge(self, other):
 | 
					    def merge(self, other):
 | 
				
			||||||
        """Extend the annotations of this binder with the annotations from another."""
 | 
					        """Extend the annotations of this DocBox with the annotations from another."""
 | 
				
			||||||
        assert self.attrs == other.attrs
 | 
					        assert self.attrs == other.attrs
 | 
				
			||||||
        self.tokens.extend(other.tokens)
 | 
					        self.tokens.extend(other.tokens)
 | 
				
			||||||
        self.spaces.extend(other.spaces)
 | 
					        self.spaces.extend(other.spaces)
 | 
				
			||||||
        self.strings.update(other.strings)
 | 
					        self.strings.update(other.strings)
 | 
				
			||||||
 | 
					        if self.store_user_data:
 | 
				
			||||||
 | 
					            self.user_data.extend(other.user_data)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    def to_bytes(self):
 | 
					    def to_bytes(self):
 | 
				
			||||||
        """Serialize the binder's annotations into a byte string."""
 | 
					        """Serialize the DocBox's annotations into a byte string."""
 | 
				
			||||||
        for tokens in self.tokens:
 | 
					        for tokens in self.tokens:
 | 
				
			||||||
            assert len(tokens.shape) == 2, tokens.shape
 | 
					            assert len(tokens.shape) == 2, tokens.shape
 | 
				
			||||||
        lengths = [len(tokens) for tokens in self.tokens]
 | 
					        lengths = [len(tokens) for tokens in self.tokens]
 | 
				
			||||||
| 
						 | 
					@ -74,10 +80,12 @@ class Binder(object):
 | 
				
			||||||
            "lengths": numpy.asarray(lengths, dtype="int32").tobytes("C"),
 | 
					            "lengths": numpy.asarray(lengths, dtype="int32").tobytes("C"),
 | 
				
			||||||
            "strings": list(self.strings),
 | 
					            "strings": list(self.strings),
 | 
				
			||||||
        }
 | 
					        }
 | 
				
			||||||
 | 
					        if self.store_user_data:
 | 
				
			||||||
 | 
					            msg["user_data"] = self.user_data
 | 
				
			||||||
        return gzip.compress(srsly.msgpack_dumps(msg))
 | 
					        return gzip.compress(srsly.msgpack_dumps(msg))
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    def from_bytes(self, string):
 | 
					    def from_bytes(self, string):
 | 
				
			||||||
        """Deserialize the binder's annotations from a byte string."""
 | 
					        """Deserialize the DocBox's annotations from a byte string."""
 | 
				
			||||||
        msg = srsly.msgpack_loads(gzip.decompress(string))
 | 
					        msg = srsly.msgpack_loads(gzip.decompress(string))
 | 
				
			||||||
        self.attrs = msg["attrs"]
 | 
					        self.attrs = msg["attrs"]
 | 
				
			||||||
        self.strings = set(msg["strings"])
 | 
					        self.strings = set(msg["strings"])
 | 
				
			||||||
| 
						 | 
					@ -89,29 +97,38 @@ class Binder(object):
 | 
				
			||||||
        flat_spaces = flat_spaces.reshape((flat_spaces.size, 1))
 | 
					        flat_spaces = flat_spaces.reshape((flat_spaces.size, 1))
 | 
				
			||||||
        self.tokens = NumpyOps().unflatten(flat_tokens, lengths)
 | 
					        self.tokens = NumpyOps().unflatten(flat_tokens, lengths)
 | 
				
			||||||
        self.spaces = NumpyOps().unflatten(flat_spaces, lengths)
 | 
					        self.spaces = NumpyOps().unflatten(flat_spaces, lengths)
 | 
				
			||||||
 | 
					        if self.store_user_data and "user_data" in msg:
 | 
				
			||||||
 | 
					            self.user_data = list(msg["user_data"])
 | 
				
			||||||
        for tokens in self.tokens:
 | 
					        for tokens in self.tokens:
 | 
				
			||||||
            assert len(tokens.shape) == 2, tokens.shape
 | 
					            assert len(tokens.shape) == 2, tokens.shape
 | 
				
			||||||
        return self
 | 
					        return self
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
def merge_bytes(binder_strings):
 | 
					def merge_boxes(boxes):
 | 
				
			||||||
    """Concatenate multiple serialized binders into one byte string."""
 | 
					    merged = None
 | 
				
			||||||
    output = None
 | 
					    for byte_string in boxes:
 | 
				
			||||||
    for byte_string in binder_strings:
 | 
					        if byte_string is not None:
 | 
				
			||||||
        binder = Binder().from_bytes(byte_string)
 | 
					            box = DocBox(store_user_data=True).from_bytes(byte_string)
 | 
				
			||||||
        if output is None:
 | 
					            if merged is None:
 | 
				
			||||||
            output = binder
 | 
					                merged = box
 | 
				
			||||||
            else:
 | 
					            else:
 | 
				
			||||||
            output.merge(binder)
 | 
					                merged.merge(box)
 | 
				
			||||||
    return output.to_bytes()
 | 
					    if merged is not None:
 | 
				
			||||||
 | 
					        return merged.to_bytes()
 | 
				
			||||||
 | 
					    else:
 | 
				
			||||||
 | 
					        return b""
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
def pickle_binder(binder):
 | 
					def pickle_box(box):
 | 
				
			||||||
    return (unpickle_binder, (binder.to_bytes(),))
 | 
					    return (unpickle_box, (box.to_bytes(),))
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
def unpickle_binder(byte_string):
 | 
					def unpickle_box(byte_string):
 | 
				
			||||||
    return Binder().from_bytes(byte_string)
 | 
					    return DocBox().from_bytes(byte_string)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
copy_reg.pickle(Binder, pickle_binder, unpickle_binder)
 | 
					copy_reg.pickle(DocBox, pickle_box, unpickle_box)
 | 
				
			||||||
 | 
					# Compatibility, as we had named it this previously.
 | 
				
			||||||
 | 
					Binder = DocBox
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					__all__ = ["DocBox"]
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -1,6 +1,5 @@
 | 
				
			||||||
from cymem.cymem cimport Pool
 | 
					from cymem.cymem cimport Pool
 | 
				
			||||||
cimport numpy as np
 | 
					cimport numpy as np
 | 
				
			||||||
from preshed.counter cimport PreshCounter
 | 
					 | 
				
			||||||
 | 
					
 | 
				
			||||||
from ..vocab cimport Vocab
 | 
					from ..vocab cimport Vocab
 | 
				
			||||||
from ..structs cimport TokenC, LexemeC
 | 
					from ..structs cimport TokenC, LexemeC
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -9,6 +9,7 @@ cimport cython
 | 
				
			||||||
cimport numpy as np
 | 
					cimport numpy as np
 | 
				
			||||||
from libc.string cimport memcpy, memset
 | 
					from libc.string cimport memcpy, memset
 | 
				
			||||||
from libc.math cimport sqrt
 | 
					from libc.math cimport sqrt
 | 
				
			||||||
 | 
					from collections import Counter
 | 
				
			||||||
 | 
					
 | 
				
			||||||
import numpy
 | 
					import numpy
 | 
				
			||||||
import numpy.linalg
 | 
					import numpy.linalg
 | 
				
			||||||
| 
						 | 
					@ -22,7 +23,7 @@ from ..lexeme cimport Lexeme, EMPTY_LEXEME
 | 
				
			||||||
from ..typedefs cimport attr_t, flags_t
 | 
					from ..typedefs cimport attr_t, flags_t
 | 
				
			||||||
from ..attrs cimport ID, ORTH, NORM, LOWER, SHAPE, PREFIX, SUFFIX, CLUSTER
 | 
					from ..attrs cimport ID, ORTH, NORM, LOWER, SHAPE, PREFIX, SUFFIX, CLUSTER
 | 
				
			||||||
from ..attrs cimport LENGTH, POS, LEMMA, TAG, DEP, HEAD, SPACY, ENT_IOB
 | 
					from ..attrs cimport LENGTH, POS, LEMMA, TAG, DEP, HEAD, SPACY, ENT_IOB
 | 
				
			||||||
from ..attrs cimport ENT_TYPE, SENT_START, attr_id_t
 | 
					from ..attrs cimport ENT_TYPE, ENT_KB_ID, SENT_START, attr_id_t
 | 
				
			||||||
from ..parts_of_speech cimport CCONJ, PUNCT, NOUN, univ_pos_t
 | 
					from ..parts_of_speech cimport CCONJ, PUNCT, NOUN, univ_pos_t
 | 
				
			||||||
 | 
					
 | 
				
			||||||
from ..attrs import intify_attrs, IDS
 | 
					from ..attrs import intify_attrs, IDS
 | 
				
			||||||
| 
						 | 
					@ -64,6 +65,8 @@ cdef attr_t get_token_attr(const TokenC* token, attr_id_t feat_name) nogil:
 | 
				
			||||||
        return token.ent_iob
 | 
					        return token.ent_iob
 | 
				
			||||||
    elif feat_name == ENT_TYPE:
 | 
					    elif feat_name == ENT_TYPE:
 | 
				
			||||||
        return token.ent_type
 | 
					        return token.ent_type
 | 
				
			||||||
 | 
					    elif feat_name == ENT_KB_ID:
 | 
				
			||||||
 | 
					        return token.ent_kb_id
 | 
				
			||||||
    else:
 | 
					    else:
 | 
				
			||||||
        return Lexeme.get_struct_attr(token.lex, feat_name)
 | 
					        return Lexeme.get_struct_attr(token.lex, feat_name)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
| 
						 | 
					@ -85,13 +88,14 @@ cdef class Doc:
 | 
				
			||||||
    Python-level `Token` and `Span` objects are views of this array, i.e.
 | 
					    Python-level `Token` and `Span` objects are views of this array, i.e.
 | 
				
			||||||
    they don't own the data themselves.
 | 
					    they don't own the data themselves.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    EXAMPLE: Construction 1
 | 
					    EXAMPLE:
 | 
				
			||||||
 | 
					        Construction 1
 | 
				
			||||||
        >>> doc = nlp(u'Some text')
 | 
					        >>> doc = nlp(u'Some text')
 | 
				
			||||||
 | 
					
 | 
				
			||||||
        Construction 2
 | 
					        Construction 2
 | 
				
			||||||
        >>> from spacy.tokens import Doc
 | 
					        >>> from spacy.tokens import Doc
 | 
				
			||||||
        >>> doc = Doc(nlp.vocab, words=[u'hello', u'world', u'!'],
 | 
					        >>> doc = Doc(nlp.vocab, words=[u'hello', u'world', u'!'],
 | 
				
			||||||
                      spaces=[True, False, False])
 | 
					        >>>           spaces=[True, False, False])
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    DOCS: https://spacy.io/api/doc
 | 
					    DOCS: https://spacy.io/api/doc
 | 
				
			||||||
    """
 | 
					    """
 | 
				
			||||||
| 
						 | 
					@ -237,6 +241,8 @@ cdef class Doc:
 | 
				
			||||||
            return True
 | 
					            return True
 | 
				
			||||||
        if self.is_parsed:
 | 
					        if self.is_parsed:
 | 
				
			||||||
            return True
 | 
					            return True
 | 
				
			||||||
 | 
					        if len(self) < 2:
 | 
				
			||||||
 | 
					            return True
 | 
				
			||||||
        for i in range(1, self.length):
 | 
					        for i in range(1, self.length):
 | 
				
			||||||
            if self.c[i].sent_start == -1 or self.c[i].sent_start == 1:
 | 
					            if self.c[i].sent_start == -1 or self.c[i].sent_start == 1:
 | 
				
			||||||
                return True
 | 
					                return True
 | 
				
			||||||
| 
						 | 
					@ -248,6 +254,8 @@ cdef class Doc:
 | 
				
			||||||
        *any* of the tokens has a named entity tag set (even if the others are
 | 
					        *any* of the tokens has a named entity tag set (even if the others are
 | 
				
			||||||
        uknown values).
 | 
					        uknown values).
 | 
				
			||||||
        """
 | 
					        """
 | 
				
			||||||
 | 
					        if len(self) == 0:
 | 
				
			||||||
 | 
					            return True
 | 
				
			||||||
        for i in range(self.length):
 | 
					        for i in range(self.length):
 | 
				
			||||||
            if self.c[i].ent_iob != 0:
 | 
					            if self.c[i].ent_iob != 0:
 | 
				
			||||||
                return True
 | 
					                return True
 | 
				
			||||||
| 
						 | 
					@ -690,7 +698,7 @@ cdef class Doc:
 | 
				
			||||||
        # Handle 1d case
 | 
					        # Handle 1d case
 | 
				
			||||||
        return output if len(attr_ids) >= 2 else output.reshape((self.length,))
 | 
					        return output if len(attr_ids) >= 2 else output.reshape((self.length,))
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    def count_by(self, attr_id_t attr_id, exclude=None, PreshCounter counts=None):
 | 
					    def count_by(self, attr_id_t attr_id, exclude=None, object counts=None):
 | 
				
			||||||
        """Count the frequencies of a given attribute. Produces a dict of
 | 
					        """Count the frequencies of a given attribute. Produces a dict of
 | 
				
			||||||
        `{attribute (int): count (ints)}` frequencies, keyed by the values of
 | 
					        `{attribute (int): count (ints)}` frequencies, keyed by the values of
 | 
				
			||||||
        the given attribute ID.
 | 
					        the given attribute ID.
 | 
				
			||||||
| 
						 | 
					@ -705,19 +713,18 @@ cdef class Doc:
 | 
				
			||||||
        cdef size_t count
 | 
					        cdef size_t count
 | 
				
			||||||
 | 
					
 | 
				
			||||||
        if counts is None:
 | 
					        if counts is None:
 | 
				
			||||||
            counts = PreshCounter()
 | 
					            counts = Counter()
 | 
				
			||||||
            output_dict = True
 | 
					            output_dict = True
 | 
				
			||||||
        else:
 | 
					        else:
 | 
				
			||||||
            output_dict = False
 | 
					            output_dict = False
 | 
				
			||||||
        # Take this check out of the loop, for a bit of extra speed
 | 
					        # Take this check out of the loop, for a bit of extra speed
 | 
				
			||||||
        if exclude is None:
 | 
					        if exclude is None:
 | 
				
			||||||
            for i in range(self.length):
 | 
					            for i in range(self.length):
 | 
				
			||||||
                counts.inc(get_token_attr(&self.c[i], attr_id), 1)
 | 
					                counts[get_token_attr(&self.c[i], attr_id)] += 1
 | 
				
			||||||
        else:
 | 
					        else:
 | 
				
			||||||
            for i in range(self.length):
 | 
					            for i in range(self.length):
 | 
				
			||||||
                if not exclude(self[i]):
 | 
					                if not exclude(self[i]):
 | 
				
			||||||
                    attr = get_token_attr(&self.c[i], attr_id)
 | 
					                    counts[get_token_attr(&self.c[i], attr_id)] += 1
 | 
				
			||||||
                    counts.inc(attr, 1)
 | 
					 | 
				
			||||||
        if output_dict:
 | 
					        if output_dict:
 | 
				
			||||||
            return dict(counts)
 | 
					            return dict(counts)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
| 
						 | 
					@ -850,7 +857,7 @@ cdef class Doc:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
        DOCS: https://spacy.io/api/doc#to_bytes
 | 
					        DOCS: https://spacy.io/api/doc#to_bytes
 | 
				
			||||||
        """
 | 
					        """
 | 
				
			||||||
        array_head = [LENGTH, SPACY, LEMMA, ENT_IOB, ENT_TYPE]
 | 
					        array_head = [LENGTH, SPACY, LEMMA, ENT_IOB, ENT_TYPE]  # TODO: ENT_KB_ID ?
 | 
				
			||||||
        if self.is_tagged:
 | 
					        if self.is_tagged:
 | 
				
			||||||
            array_head.append(TAG)
 | 
					            array_head.append(TAG)
 | 
				
			||||||
        # If doc parsed add head and dep attribute
 | 
					        # If doc parsed add head and dep attribute
 | 
				
			||||||
| 
						 | 
					@ -1004,6 +1011,7 @@ cdef class Doc:
 | 
				
			||||||
        """
 | 
					        """
 | 
				
			||||||
        cdef unicode tag, lemma, ent_type
 | 
					        cdef unicode tag, lemma, ent_type
 | 
				
			||||||
        deprecation_warning(Warnings.W013.format(obj="Doc"))
 | 
					        deprecation_warning(Warnings.W013.format(obj="Doc"))
 | 
				
			||||||
 | 
					        # TODO: ENT_KB_ID ?
 | 
				
			||||||
        if len(args) == 3:
 | 
					        if len(args) == 3:
 | 
				
			||||||
            deprecation_warning(Warnings.W003)
 | 
					            deprecation_warning(Warnings.W003)
 | 
				
			||||||
            tag, lemma, ent_type = args
 | 
					            tag, lemma, ent_type = args
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -210,7 +210,7 @@ cdef class Span:
 | 
				
			||||||
        words = [t.text for t in self]
 | 
					        words = [t.text for t in self]
 | 
				
			||||||
        spaces = [bool(t.whitespace_) for t in self]
 | 
					        spaces = [bool(t.whitespace_) for t in self]
 | 
				
			||||||
        cdef Doc doc = Doc(self.doc.vocab, words=words, spaces=spaces)
 | 
					        cdef Doc doc = Doc(self.doc.vocab, words=words, spaces=spaces)
 | 
				
			||||||
        array_head = [LENGTH, SPACY, LEMMA, ENT_IOB, ENT_TYPE]
 | 
					        array_head = [LENGTH, SPACY, LEMMA, ENT_IOB, ENT_TYPE, ENT_KB_ID]
 | 
				
			||||||
        if self.doc.is_tagged:
 | 
					        if self.doc.is_tagged:
 | 
				
			||||||
            array_head.append(TAG)
 | 
					            array_head.append(TAG)
 | 
				
			||||||
        # If doc parsed add head and dep attribute
 | 
					        # If doc parsed add head and dep attribute
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -53,6 +53,8 @@ cdef class Token:
 | 
				
			||||||
            return token.ent_iob
 | 
					            return token.ent_iob
 | 
				
			||||||
        elif feat_name == ENT_TYPE:
 | 
					        elif feat_name == ENT_TYPE:
 | 
				
			||||||
            return token.ent_type
 | 
					            return token.ent_type
 | 
				
			||||||
 | 
					        elif feat_name == ENT_KB_ID:
 | 
				
			||||||
 | 
					            return token.ent_kb_id
 | 
				
			||||||
        elif feat_name == SENT_START:
 | 
					        elif feat_name == SENT_START:
 | 
				
			||||||
            return token.sent_start
 | 
					            return token.sent_start
 | 
				
			||||||
        else:
 | 
					        else:
 | 
				
			||||||
| 
						 | 
					@ -79,5 +81,7 @@ cdef class Token:
 | 
				
			||||||
            token.ent_iob = value
 | 
					            token.ent_iob = value
 | 
				
			||||||
        elif feat_name == ENT_TYPE:
 | 
					        elif feat_name == ENT_TYPE:
 | 
				
			||||||
            token.ent_type = value
 | 
					            token.ent_type = value
 | 
				
			||||||
 | 
					        elif feat_name == ENT_KB_ID:
 | 
				
			||||||
 | 
					            token.ent_kb_id = value
 | 
				
			||||||
        elif feat_name == SENT_START:
 | 
					        elif feat_name == SENT_START:
 | 
				
			||||||
            token.sent_start = value
 | 
					            token.sent_start = value
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -520,7 +520,9 @@ spaCy takes training data in JSON format. The built-in
 | 
				
			||||||
[`convert`](/api/cli#convert) command helps you convert the `.conllu` format
 | 
					[`convert`](/api/cli#convert) command helps you convert the `.conllu` format
 | 
				
			||||||
used by the
 | 
					used by the
 | 
				
			||||||
[Universal Dependencies corpora](https://github.com/UniversalDependencies) to
 | 
					[Universal Dependencies corpora](https://github.com/UniversalDependencies) to
 | 
				
			||||||
spaCy's training format.
 | 
					spaCy's training format. To convert one or more existing `Doc` objects to
 | 
				
			||||||
 | 
					spaCy's JSON format, you can use the
 | 
				
			||||||
 | 
					[`gold.docs_to_json`](/api/goldparse#docs_to_json) helper.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
> #### Annotating entities
 | 
					> #### Annotating entities
 | 
				
			||||||
>
 | 
					>
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
Some files were not shown because too many files have changed in this diff Show More
		Loading…
	
		Reference in New Issue
	
	Block a user