mirror of
				https://github.com/explosion/spaCy.git
				synced 2025-10-26 13:41:21 +03:00 
			
		
		
		
	Merge branch 'master' into develop
This commit is contained in:
		
						commit
						330c039106
					
				
							
								
								
									
										106
									
								
								.github/contributors/BigstickCarpet.md
									
									
									
									
										vendored
									
									
										Normal file
									
								
							
							
						
						
									
										106
									
								
								.github/contributors/BigstickCarpet.md
									
									
									
									
										vendored
									
									
										Normal file
									
								
							|  | @ -0,0 +1,106 @@ | ||||||
|  | # spaCy contributor agreement | ||||||
|  | 
 | ||||||
|  | This spaCy Contributor Agreement (**"SCA"**) is based on the | ||||||
|  | [Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf). | ||||||
|  | The SCA applies to any contribution that you make to any product or project | ||||||
|  | managed by us (the **"project"**), and sets out the intellectual property rights | ||||||
|  | you grant to us in the contributed materials. The term **"us"** shall mean | ||||||
|  | [ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term | ||||||
|  | **"you"** shall mean the person or entity identified below. | ||||||
|  | 
 | ||||||
|  | If you agree to be bound by these terms, fill in the information requested | ||||||
|  | below and include the filled-in version with your first pull request, under the | ||||||
|  | folder [`.github/contributors/`](/.github/contributors/). The name of the file | ||||||
|  | should be your GitHub username, with the extension `.md`. For example, the user | ||||||
|  | example_user would create the file `.github/contributors/example_user.md`. | ||||||
|  | 
 | ||||||
|  | Read this agreement carefully before signing. These terms and conditions | ||||||
|  | constitute a binding legal agreement. | ||||||
|  | 
 | ||||||
|  | ## Contributor Agreement | ||||||
|  | 
 | ||||||
|  | 1. The term "contribution" or "contributed materials" means any source code, | ||||||
|  | object code, patch, tool, sample, graphic, specification, manual, | ||||||
|  | documentation, or any other material posted or submitted by you to the project. | ||||||
|  | 
 | ||||||
|  | 2. With respect to any worldwide copyrights, or copyright applications and | ||||||
|  | registrations, in your contribution: | ||||||
|  | 
 | ||||||
|  |     * you hereby assign to us joint ownership, and to the extent that such | ||||||
|  |     assignment is or becomes invalid, ineffective or unenforceable, you hereby | ||||||
|  |     grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge, | ||||||
|  |     royalty-free, unrestricted license to exercise all rights under those | ||||||
|  |     copyrights. This includes, at our option, the right to sublicense these same | ||||||
|  |     rights to third parties through multiple levels of sublicensees or other | ||||||
|  |     licensing arrangements; | ||||||
|  | 
 | ||||||
|  |     * you agree that each of us can do all things in relation to your | ||||||
|  |     contribution as if each of us were the sole owners, and if one of us makes | ||||||
|  |     a derivative work of your contribution, the one who makes the derivative | ||||||
|  |     work (or has it made will be the sole owner of that derivative work; | ||||||
|  | 
 | ||||||
|  |     * you agree that you will not assert any moral rights in your contribution | ||||||
|  |     against us, our licensees or transferees; | ||||||
|  | 
 | ||||||
|  |     * you agree that we may register a copyright in your contribution and | ||||||
|  |     exercise all ownership rights associated with it; and | ||||||
|  | 
 | ||||||
|  |     * you agree that neither of us has any duty to consult with, obtain the | ||||||
|  |     consent of, pay or render an accounting to the other for any use or | ||||||
|  |     distribution of your contribution. | ||||||
|  | 
 | ||||||
|  | 3. With respect to any patents you own, or that you can license without payment | ||||||
|  | to any third party, you hereby grant to us a perpetual, irrevocable, | ||||||
|  | non-exclusive, worldwide, no-charge, royalty-free license to: | ||||||
|  | 
 | ||||||
|  |     * make, have made, use, sell, offer to sell, import, and otherwise transfer | ||||||
|  |     your contribution in whole or in part, alone or in combination with or | ||||||
|  |     included in any product, work or materials arising out of the project to | ||||||
|  |     which your contribution was submitted, and | ||||||
|  | 
 | ||||||
|  |     * at our option, to sublicense these same rights to third parties through | ||||||
|  |     multiple levels of sublicensees or other licensing arrangements. | ||||||
|  | 
 | ||||||
|  | 4. Except as set out above, you keep all right, title, and interest in your | ||||||
|  | contribution. The rights that you grant to us under these terms are effective | ||||||
|  | on the date you first submitted a contribution to us, even if your submission | ||||||
|  | took place before the date you sign these terms. | ||||||
|  | 
 | ||||||
|  | 5. You covenant, represent, warrant and agree that: | ||||||
|  | 
 | ||||||
|  |     * Each contribution that you submit is and shall be an original work of | ||||||
|  |     authorship and you can legally grant the rights set out in this SCA; | ||||||
|  | 
 | ||||||
|  |     * to the best of your knowledge, each contribution will not violate any | ||||||
|  |     third party's copyrights, trademarks, patents, or other intellectual | ||||||
|  |     property rights; and | ||||||
|  | 
 | ||||||
|  |     * each contribution shall be in compliance with U.S. export control laws and | ||||||
|  |     other applicable export and import laws. You agree to notify us if you | ||||||
|  |     become aware of any circumstance which would make any of the foregoing | ||||||
|  |     representations inaccurate in any respect. We may publicly disclose your | ||||||
|  |     participation in the project, including the fact that you have signed the SCA. | ||||||
|  | 
 | ||||||
|  | 6. This SCA is governed by the laws of the State of California and applicable | ||||||
|  | U.S. Federal law. Any choice of law rules will not apply. | ||||||
|  | 
 | ||||||
|  | 7. Please place an “x” on one of the applicable statement below. Please do NOT | ||||||
|  | mark both statements: | ||||||
|  | 
 | ||||||
|  |     * [ X] I am signing on behalf of myself as an individual and no other person | ||||||
|  |     or entity, including my employer, has or will have rights with respect to my | ||||||
|  |     contributions. | ||||||
|  | 
 | ||||||
|  |     * [ ] I am signing on behalf of my employer or a legal entity and I have the | ||||||
|  |     actual authority to contractually bind that entity. | ||||||
|  | 
 | ||||||
|  | ## Contributor Details | ||||||
|  | 
 | ||||||
|  | | Field                          | Entry                | | ||||||
|  | |------------------------------- | -------------------- | | ||||||
|  | | Name                           | James Messinger                     | | ||||||
|  | | Company name (if applicable)   |                      | | ||||||
|  | | Title or role (if applicable)  |                      | | ||||||
|  | | Date                           | May 23, 2018                     | | ||||||
|  | | GitHub username                | BigstickCarpet                     | | ||||||
|  | | Website (optional)             |                      | | ||||||
							
								
								
									
										106
									
								
								.github/contributors/aristorinjuang.md
									
									
									
									
										vendored
									
									
										Normal file
									
								
							
							
						
						
									
										106
									
								
								.github/contributors/aristorinjuang.md
									
									
									
									
										vendored
									
									
										Normal file
									
								
							|  | @ -0,0 +1,106 @@ | ||||||
|  | # spaCy contributor agreement | ||||||
|  | 
 | ||||||
|  | This spaCy Contributor Agreement (**"SCA"**) is based on the | ||||||
|  | [Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf). | ||||||
|  | The SCA applies to any contribution that you make to any product or project | ||||||
|  | managed by us (the **"project"**), and sets out the intellectual property rights | ||||||
|  | you grant to us in the contributed materials. The term **"us"** shall mean | ||||||
|  | [ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term | ||||||
|  | **"you"** shall mean the person or entity identified below. | ||||||
|  | 
 | ||||||
|  | If you agree to be bound by these terms, fill in the information requested | ||||||
|  | below and include the filled-in version with your first pull request, under the | ||||||
|  | folder [`.github/contributors/`](/.github/contributors/). The name of the file | ||||||
|  | should be your GitHub username, with the extension `.md`. For example, the user | ||||||
|  | example_user would create the file `.github/contributors/example_user.md`. | ||||||
|  | 
 | ||||||
|  | Read this agreement carefully before signing. These terms and conditions | ||||||
|  | constitute a binding legal agreement. | ||||||
|  | 
 | ||||||
|  | ## Contributor Agreement | ||||||
|  | 
 | ||||||
|  | 1. The term "contribution" or "contributed materials" means any source code, | ||||||
|  | object code, patch, tool, sample, graphic, specification, manual, | ||||||
|  | documentation, or any other material posted or submitted by you to the project. | ||||||
|  | 
 | ||||||
|  | 2. With respect to any worldwide copyrights, or copyright applications and | ||||||
|  | registrations, in your contribution: | ||||||
|  | 
 | ||||||
|  |     * you hereby assign to us joint ownership, and to the extent that such | ||||||
|  |     assignment is or becomes invalid, ineffective or unenforceable, you hereby | ||||||
|  |     grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge, | ||||||
|  |     royalty-free, unrestricted license to exercise all rights under those | ||||||
|  |     copyrights. This includes, at our option, the right to sublicense these same | ||||||
|  |     rights to third parties through multiple levels of sublicensees or other | ||||||
|  |     licensing arrangements; | ||||||
|  | 
 | ||||||
|  |     * you agree that each of us can do all things in relation to your | ||||||
|  |     contribution as if each of us were the sole owners, and if one of us makes | ||||||
|  |     a derivative work of your contribution, the one who makes the derivative | ||||||
|  |     work (or has it made will be the sole owner of that derivative work; | ||||||
|  | 
 | ||||||
|  |     * you agree that you will not assert any moral rights in your contribution | ||||||
|  |     against us, our licensees or transferees; | ||||||
|  | 
 | ||||||
|  |     * you agree that we may register a copyright in your contribution and | ||||||
|  |     exercise all ownership rights associated with it; and | ||||||
|  | 
 | ||||||
|  |     * you agree that neither of us has any duty to consult with, obtain the | ||||||
|  |     consent of, pay or render an accounting to the other for any use or | ||||||
|  |     distribution of your contribution. | ||||||
|  | 
 | ||||||
|  | 3. With respect to any patents you own, or that you can license without payment | ||||||
|  | to any third party, you hereby grant to us a perpetual, irrevocable, | ||||||
|  | non-exclusive, worldwide, no-charge, royalty-free license to: | ||||||
|  | 
 | ||||||
|  |     * make, have made, use, sell, offer to sell, import, and otherwise transfer | ||||||
|  |     your contribution in whole or in part, alone or in combination with or | ||||||
|  |     included in any product, work or materials arising out of the project to | ||||||
|  |     which your contribution was submitted, and | ||||||
|  | 
 | ||||||
|  |     * at our option, to sublicense these same rights to third parties through | ||||||
|  |     multiple levels of sublicensees or other licensing arrangements. | ||||||
|  | 
 | ||||||
|  | 4. Except as set out above, you keep all right, title, and interest in your | ||||||
|  | contribution. The rights that you grant to us under these terms are effective | ||||||
|  | on the date you first submitted a contribution to us, even if your submission | ||||||
|  | took place before the date you sign these terms. | ||||||
|  | 
 | ||||||
|  | 5. You covenant, represent, warrant and agree that: | ||||||
|  | 
 | ||||||
|  |     * Each contribution that you submit is and shall be an original work of | ||||||
|  |     authorship and you can legally grant the rights set out in this SCA; | ||||||
|  | 
 | ||||||
|  |     * to the best of your knowledge, each contribution will not violate any | ||||||
|  |     third party's copyrights, trademarks, patents, or other intellectual | ||||||
|  |     property rights; and | ||||||
|  | 
 | ||||||
|  |     * each contribution shall be in compliance with U.S. export control laws and | ||||||
|  |     other applicable export and import laws. You agree to notify us if you | ||||||
|  |     become aware of any circumstance which would make any of the foregoing | ||||||
|  |     representations inaccurate in any respect. We may publicly disclose your | ||||||
|  |     participation in the project, including the fact that you have signed the SCA. | ||||||
|  | 
 | ||||||
|  | 6. This SCA is governed by the laws of the State of California and applicable | ||||||
|  | U.S. Federal law. Any choice of law rules will not apply. | ||||||
|  | 
 | ||||||
|  | 7. Please place an “x” on one of the applicable statement below. Please do NOT | ||||||
|  | mark both statements: | ||||||
|  | 
 | ||||||
|  |     * [x] I am signing on behalf of myself as an individual and no other person | ||||||
|  |     or entity, including my employer, has or will have rights with respect to my | ||||||
|  |     contributions. | ||||||
|  | 
 | ||||||
|  |     * [x] I am signing on behalf of my employer or a legal entity and I have the | ||||||
|  |     actual authority to contractually bind that entity. | ||||||
|  | 
 | ||||||
|  | ## Contributor Details | ||||||
|  | 
 | ||||||
|  | | Field                          | Entry                      | | ||||||
|  | |------------------------------- | -------------------------- | | ||||||
|  | | Name                           | Aristo Rinjuang            | | ||||||
|  | | Company name (if applicable)   |                            | | ||||||
|  | | Title or role (if applicable)  |                            | | ||||||
|  | | Date                           | May 22, 2018               | | ||||||
|  | | GitHub username                | aristorinjuang             | | ||||||
|  | | Website (optional)             | https://aristorinjuang.com | | ||||||
							
								
								
									
										106
									
								
								.github/contributors/armsp.md
									
									
									
									
										vendored
									
									
										Normal file
									
								
							
							
						
						
									
										106
									
								
								.github/contributors/armsp.md
									
									
									
									
										vendored
									
									
										Normal file
									
								
							|  | @ -0,0 +1,106 @@ | ||||||
|  | # spaCy contributor agreement | ||||||
|  | 
 | ||||||
|  | This spaCy Contributor Agreement (**"SCA"**) is based on the | ||||||
|  | [Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf). | ||||||
|  | The SCA applies to any contribution that you make to any product or project | ||||||
|  | managed by us (the **"project"**), and sets out the intellectual property rights | ||||||
|  | you grant to us in the contributed materials. The term **"us"** shall mean | ||||||
|  | [ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term | ||||||
|  | **"you"** shall mean the person or entity identified below. | ||||||
|  | 
 | ||||||
|  | If you agree to be bound by these terms, fill in the information requested | ||||||
|  | below and include the filled-in version with your first pull request, under the | ||||||
|  | folder [`.github/contributors/`](/.github/contributors/). The name of the file | ||||||
|  | should be your GitHub username, with the extension `.md`. For example, the user | ||||||
|  | example_user would create the file `.github/contributors/example_user.md`. | ||||||
|  | 
 | ||||||
|  | Read this agreement carefully before signing. These terms and conditions | ||||||
|  | constitute a binding legal agreement. | ||||||
|  | 
 | ||||||
|  | ## Contributor Agreement | ||||||
|  | 
 | ||||||
|  | 1. The term "contribution" or "contributed materials" means any source code, | ||||||
|  | object code, patch, tool, sample, graphic, specification, manual, | ||||||
|  | documentation, or any other material posted or submitted by you to the project. | ||||||
|  | 
 | ||||||
|  | 2. With respect to any worldwide copyrights, or copyright applications and | ||||||
|  | registrations, in your contribution: | ||||||
|  | 
 | ||||||
|  |     * you hereby assign to us joint ownership, and to the extent that such | ||||||
|  |     assignment is or becomes invalid, ineffective or unenforceable, you hereby | ||||||
|  |     grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge, | ||||||
|  |     royalty-free, unrestricted license to exercise all rights under those | ||||||
|  |     copyrights. This includes, at our option, the right to sublicense these same | ||||||
|  |     rights to third parties through multiple levels of sublicensees or other | ||||||
|  |     licensing arrangements; | ||||||
|  | 
 | ||||||
|  |     * you agree that each of us can do all things in relation to your | ||||||
|  |     contribution as if each of us were the sole owners, and if one of us makes | ||||||
|  |     a derivative work of your contribution, the one who makes the derivative | ||||||
|  |     work (or has it made will be the sole owner of that derivative work; | ||||||
|  | 
 | ||||||
|  |     * you agree that you will not assert any moral rights in your contribution | ||||||
|  |     against us, our licensees or transferees; | ||||||
|  | 
 | ||||||
|  |     * you agree that we may register a copyright in your contribution and | ||||||
|  |     exercise all ownership rights associated with it; and | ||||||
|  | 
 | ||||||
|  |     * you agree that neither of us has any duty to consult with, obtain the | ||||||
|  |     consent of, pay or render an accounting to the other for any use or | ||||||
|  |     distribution of your contribution. | ||||||
|  | 
 | ||||||
|  | 3. With respect to any patents you own, or that you can license without payment | ||||||
|  | to any third party, you hereby grant to us a perpetual, irrevocable, | ||||||
|  | non-exclusive, worldwide, no-charge, royalty-free license to: | ||||||
|  | 
 | ||||||
|  |     * make, have made, use, sell, offer to sell, import, and otherwise transfer | ||||||
|  |     your contribution in whole or in part, alone or in combination with or | ||||||
|  |     included in any product, work or materials arising out of the project to | ||||||
|  |     which your contribution was submitted, and | ||||||
|  | 
 | ||||||
|  |     * at our option, to sublicense these same rights to third parties through | ||||||
|  |     multiple levels of sublicensees or other licensing arrangements. | ||||||
|  | 
 | ||||||
|  | 4. Except as set out above, you keep all right, title, and interest in your | ||||||
|  | contribution. The rights that you grant to us under these terms are effective | ||||||
|  | on the date you first submitted a contribution to us, even if your submission | ||||||
|  | took place before the date you sign these terms. | ||||||
|  | 
 | ||||||
|  | 5. You covenant, represent, warrant and agree that: | ||||||
|  | 
 | ||||||
|  |     * Each contribution that you submit is and shall be an original work of | ||||||
|  |     authorship and you can legally grant the rights set out in this SCA; | ||||||
|  | 
 | ||||||
|  |     * to the best of your knowledge, each contribution will not violate any | ||||||
|  |     third party's copyrights, trademarks, patents, or other intellectual | ||||||
|  |     property rights; and | ||||||
|  | 
 | ||||||
|  |     * each contribution shall be in compliance with U.S. export control laws and | ||||||
|  |     other applicable export and import laws. You agree to notify us if you | ||||||
|  |     become aware of any circumstance which would make any of the foregoing | ||||||
|  |     representations inaccurate in any respect. We may publicly disclose your | ||||||
|  |     participation in the project, including the fact that you have signed the SCA. | ||||||
|  | 
 | ||||||
|  | 6. This SCA is governed by the laws of the State of California and applicable | ||||||
|  | U.S. Federal law. Any choice of law rules will not apply. | ||||||
|  | 
 | ||||||
|  | 7. Please place an “x” on one of the applicable statement below. Please do NOT | ||||||
|  | mark both statements: | ||||||
|  | 
 | ||||||
|  |     * [x] I am signing on behalf of myself as an individual and no other person | ||||||
|  |     or entity, including my employer, has or will have rights with respect to my | ||||||
|  |     contributions. | ||||||
|  | 
 | ||||||
|  |     * [ ] I am signing on behalf of my employer or a legal entity and I have the | ||||||
|  |     actual authority to contractually bind that entity. | ||||||
|  | 
 | ||||||
|  | ## Contributor Details | ||||||
|  | 
 | ||||||
|  | | Field                          | Entry                | | ||||||
|  | |------------------------------- | -------------------- | | ||||||
|  | | Name                           |  Shantam             | | ||||||
|  | | Company name (if applicable)   |                      | | ||||||
|  | | Title or role (if applicable)  |                      | | ||||||
|  | | Date                           |   21/5/2018          | | ||||||
|  | | GitHub username                |     armsp            | | ||||||
|  | | Website (optional)             |                      | | ||||||
							
								
								
									
										106
									
								
								.github/contributors/idealley.md
									
									
									
									
										vendored
									
									
										Normal file
									
								
							
							
						
						
									
										106
									
								
								.github/contributors/idealley.md
									
									
									
									
										vendored
									
									
										Normal file
									
								
							|  | @ -0,0 +1,106 @@ | ||||||
|  | # spaCy contributor agreement | ||||||
|  | 
 | ||||||
|  | This spaCy Contributor Agreement (**"SCA"**) is based on the | ||||||
|  | [Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf). | ||||||
|  | The SCA applies to any contribution that you make to any product or project | ||||||
|  | managed by us (the **"project"**), and sets out the intellectual property rights | ||||||
|  | you grant to us in the contributed materials. The term **"us"** shall mean | ||||||
|  | [ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term | ||||||
|  | **"you"** shall mean the person or entity identified below. | ||||||
|  | 
 | ||||||
|  | If you agree to be bound by these terms, fill in the information requested | ||||||
|  | below and include the filled-in version with your first pull request, under the | ||||||
|  | folder [`.github/contributors/`](/.github/contributors/). The name of the file | ||||||
|  | should be your GitHub username, with the extension `.md`. For example, the user | ||||||
|  | example_user would create the file `.github/contributors/example_user.md`. | ||||||
|  | 
 | ||||||
|  | Read this agreement carefully before signing. These terms and conditions | ||||||
|  | constitute a binding legal agreement. | ||||||
|  | 
 | ||||||
|  | ## Contributor Agreement | ||||||
|  | 
 | ||||||
|  | 1. The term "contribution" or "contributed materials" means any source code, | ||||||
|  | object code, patch, tool, sample, graphic, specification, manual, | ||||||
|  | documentation, or any other material posted or submitted by you to the project. | ||||||
|  | 
 | ||||||
|  | 2. With respect to any worldwide copyrights, or copyright applications and | ||||||
|  | registrations, in your contribution: | ||||||
|  | 
 | ||||||
|  |     * you hereby assign to us joint ownership, and to the extent that such | ||||||
|  |     assignment is or becomes invalid, ineffective or unenforceable, you hereby | ||||||
|  |     grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge, | ||||||
|  |     royalty-free, unrestricted license to exercise all rights under those | ||||||
|  |     copyrights. This includes, at our option, the right to sublicense these same | ||||||
|  |     rights to third parties through multiple levels of sublicensees or other | ||||||
|  |     licensing arrangements; | ||||||
|  | 
 | ||||||
|  |     * you agree that each of us can do all things in relation to your | ||||||
|  |     contribution as if each of us were the sole owners, and if one of us makes | ||||||
|  |     a derivative work of your contribution, the one who makes the derivative | ||||||
|  |     work (or has it made will be the sole owner of that derivative work; | ||||||
|  | 
 | ||||||
|  |     * you agree that you will not assert any moral rights in your contribution | ||||||
|  |     against us, our licensees or transferees; | ||||||
|  | 
 | ||||||
|  |     * you agree that we may register a copyright in your contribution and | ||||||
|  |     exercise all ownership rights associated with it; and | ||||||
|  | 
 | ||||||
|  |     * you agree that neither of us has any duty to consult with, obtain the | ||||||
|  |     consent of, pay or render an accounting to the other for any use or | ||||||
|  |     distribution of your contribution. | ||||||
|  | 
 | ||||||
|  | 3. With respect to any patents you own, or that you can license without payment | ||||||
|  | to any third party, you hereby grant to us a perpetual, irrevocable, | ||||||
|  | non-exclusive, worldwide, no-charge, royalty-free license to: | ||||||
|  | 
 | ||||||
|  |     * make, have made, use, sell, offer to sell, import, and otherwise transfer | ||||||
|  |     your contribution in whole or in part, alone or in combination with or | ||||||
|  |     included in any product, work or materials arising out of the project to | ||||||
|  |     which your contribution was submitted, and | ||||||
|  | 
 | ||||||
|  |     * at our option, to sublicense these same rights to third parties through | ||||||
|  |     multiple levels of sublicensees or other licensing arrangements. | ||||||
|  | 
 | ||||||
|  | 4. Except as set out above, you keep all right, title, and interest in your | ||||||
|  | contribution. The rights that you grant to us under these terms are effective | ||||||
|  | on the date you first submitted a contribution to us, even if your submission | ||||||
|  | took place before the date you sign these terms. | ||||||
|  | 
 | ||||||
|  | 5. You covenant, represent, warrant and agree that: | ||||||
|  | 
 | ||||||
|  |     * Each contribution that you submit is and shall be an original work of | ||||||
|  |     authorship and you can legally grant the rights set out in this SCA; | ||||||
|  | 
 | ||||||
|  |     * to the best of your knowledge, each contribution will not violate any | ||||||
|  |     third party's copyrights, trademarks, patents, or other intellectual | ||||||
|  |     property rights; and | ||||||
|  | 
 | ||||||
|  |     * each contribution shall be in compliance with U.S. export control laws and | ||||||
|  |     other applicable export and import laws. You agree to notify us if you | ||||||
|  |     become aware of any circumstance which would make any of the foregoing | ||||||
|  |     representations inaccurate in any respect. We may publicly disclose your | ||||||
|  |     participation in the project, including the fact that you have signed the SCA. | ||||||
|  | 
 | ||||||
|  | 6. This SCA is governed by the laws of the State of California and applicable | ||||||
|  | U.S. Federal law. Any choice of law rules will not apply. | ||||||
|  | 
 | ||||||
|  | 7. Please place an “x” on one of the applicable statement below. Please do NOT | ||||||
|  | mark both statements: | ||||||
|  | 
 | ||||||
|  |     * [x] I am signing on behalf of myself as an individual and no other person | ||||||
|  |     or entity, including my employer, has or will have rights with respect to my | ||||||
|  |     contributions. | ||||||
|  | 
 | ||||||
|  |     * [x] I am signing on behalf of my employer or a legal entity and I have the | ||||||
|  |     actual authority to contractually bind that entity. | ||||||
|  | 
 | ||||||
|  | ## Contributor Details | ||||||
|  | 
 | ||||||
|  | | Field                          | Entry                | | ||||||
|  | |------------------------------- | -------------------- | | ||||||
|  | | Name                           |    Pouyt Samuel      | | ||||||
|  | | Company name (if applicable)   |                      | | ||||||
|  | | Title or role (if applicable)  |                      | | ||||||
|  | | Date                           |    26.05.2018        | | ||||||
|  | | GitHub username                |    Idealley          | | ||||||
|  | | Website (optional)             |                      | | ||||||
|  | @ -118,7 +118,7 @@ def train(lang, output_dir, train_data, dev_data, n_iter=30, n_sents=0, | ||||||
|     optimizer = nlp.begin_training(lambda: corpus.train_tuples, device=use_gpu) |     optimizer = nlp.begin_training(lambda: corpus.train_tuples, device=use_gpu) | ||||||
|     nlp._optimizer = None |     nlp._optimizer = None | ||||||
| 
 | 
 | ||||||
|     print("Itn.\tP.Loss\tN.Loss\tUAS\tNER P.\tNER R.\tNER F.\tTag %\tToken %") |     print("Itn.  Dep Loss  NER Loss  UAS     NER P.  NER R.  NER F.  Tag %   Token %  CPU WPS  GPU WPS") | ||||||
|     try: |     try: | ||||||
|         for i in range(n_iter): |         for i in range(n_iter): | ||||||
|             train_docs = corpus.train_docs(nlp, noise_level=0.0, |             train_docs = corpus.train_docs(nlp, noise_level=0.0, | ||||||
|  | @ -208,17 +208,17 @@ def print_progress(itn, losses, dev_scores, cpu_wps=0.0, gpu_wps=0.0): | ||||||
|     scores.update(dev_scores) |     scores.update(dev_scores) | ||||||
|     scores['cpu_wps'] = cpu_wps |     scores['cpu_wps'] = cpu_wps | ||||||
|     scores['gpu_wps'] = gpu_wps or 0.0 |     scores['gpu_wps'] = gpu_wps or 0.0 | ||||||
|     tpl = '\t'.join(( |     tpl = ''.join(( | ||||||
|         '{:d}', |         '{:<6d}', | ||||||
|         '{dep_loss:.3f}', |         '{dep_loss:<10.3f}', | ||||||
|         '{ner_loss:.3f}', |         '{ner_loss:<10.3f}', | ||||||
|         '{uas:.3f}', |         '{uas:<8.3f}', | ||||||
|         '{ents_p:.3f}', |         '{ents_p:<8.3f}', | ||||||
|         '{ents_r:.3f}', |         '{ents_r:<8.3f}', | ||||||
|         '{ents_f:.3f}', |         '{ents_f:<8.3f}', | ||||||
|         '{tags_acc:.3f}', |         '{tags_acc:<8.3f}', | ||||||
|         '{token_acc:.3f}', |         '{token_acc:<9.3f}', | ||||||
|         '{cpu_wps:.1f}', |         '{cpu_wps:<9.1f}', | ||||||
|         '{gpu_wps:.1f}', |         '{gpu_wps:.1f}', | ||||||
|     )) |     )) | ||||||
|     print(tpl.format(itn, **scores)) |     print(tpl.format(itn, **scores)) | ||||||
|  |  | ||||||
|  | @ -4,19 +4,10 @@ from __future__ import unicode_literals | ||||||
| from ...attrs import LIKE_NUM | from ...attrs import LIKE_NUM | ||||||
| 
 | 
 | ||||||
| 
 | 
 | ||||||
| _num_words = ['zero', 'one', 'two', 'three', 'four', 'five', 'six', 'seven', | _num_words = ['nol', 'satu', 'dua', 'tiga', 'empat', 'lima', 'enam', 'tujuh', | ||||||
|               'eight', 'nine', 'ten', 'eleven', 'twelve', 'thirteen', 'fourteen', |               'delapan', 'sembilan', 'sepuluh', 'sebelas', 'belas', 'puluh', | ||||||
|               'fifteen', 'sixteen', 'seventeen', 'eighteen', 'nineteen', 'twenty', |               'ratus', 'ribu', 'juta', 'miliar', 'biliun', 'triliun', 'kuadriliun', | ||||||
|               'thirty', 'forty', 'fifty', 'sixty', 'seventy', 'eighty', 'ninety', |               'kuintiliun', 'sekstiliun', 'septiliun', 'oktiliun', 'noniliun', 'desiliun'] | ||||||
|               'hundred', 'thousand', 'million', 'billion', 'trillion', 'quadrillion', |  | ||||||
|               'gajillion', 'bazillion', |  | ||||||
|               'nol', 'satu', 'dua', 'tiga', 'empat', 'lima', 'enam', 'tujuh', |  | ||||||
|               'delapan', 'sembilan', 'sepuluh', 'sebelas', 'duabelas', 'tigabelas', |  | ||||||
|               'empatbelas', 'limabelas', 'enambelas', 'tujuhbelas', 'delapanbelas', |  | ||||||
|               'sembilanbelas', 'duapuluh', 'seratus', 'seribu', 'sejuta', |  | ||||||
|               'ribu', 'rb', 'juta', 'jt', 'miliar', 'biliun', 'triliun', |  | ||||||
|               'kuadriliun', 'kuintiliun', 'sekstiliun', 'septiliun', 'oktiliun', |  | ||||||
|               'noniliun', 'desiliun'] |  | ||||||
| 
 | 
 | ||||||
| 
 | 
 | ||||||
| def like_num(text): | def like_num(text): | ||||||
|  |  | ||||||
|  | @ -1,14 +1,7 @@ | ||||||
| # coding: utf8 | # coding: utf8 | ||||||
| from __future__ import unicode_literals | from __future__ import unicode_literals | ||||||
| 
 | 
 | ||||||
| _exc = { | _exc = {} | ||||||
|     "Rp": "$", |  | ||||||
|     "IDR": "$", |  | ||||||
|     "RMB": "$", |  | ||||||
|     "USD": "$", |  | ||||||
|     "AUD": "$", |  | ||||||
|     "GBP": "$", |  | ||||||
| } |  | ||||||
| 
 | 
 | ||||||
| NORM_EXCEPTIONS = {} | NORM_EXCEPTIONS = {} | ||||||
| 
 | 
 | ||||||
|  |  | ||||||
|  | @ -5,7 +5,7 @@ import regex as re | ||||||
| 
 | 
 | ||||||
| from ._tokenizer_exceptions_list import ID_BASE_EXCEPTIONS | from ._tokenizer_exceptions_list import ID_BASE_EXCEPTIONS | ||||||
| from ..tokenizer_exceptions import URL_PATTERN | from ..tokenizer_exceptions import URL_PATTERN | ||||||
| from ...symbols import ORTH | from ...symbols import ORTH, LEMMA, NORM | ||||||
| 
 | 
 | ||||||
| 
 | 
 | ||||||
| _exc = {} | _exc = {} | ||||||
|  | @ -29,17 +29,58 @@ for orth in ID_BASE_EXCEPTIONS: | ||||||
|         orth_caps = '-'.join([part.upper() for part in orth.split('-')]) |         orth_caps = '-'.join([part.upper() for part in orth.split('-')]) | ||||||
|         _exc[orth_caps] = [{ORTH: orth_caps}] |         _exc[orth_caps] = [{ORTH: orth_caps}] | ||||||
| 
 | 
 | ||||||
|  | for exc_data in [ | ||||||
|  |     {ORTH: "CKG", LEMMA: "Cakung", NORM: "Cakung"}, | ||||||
|  |     {ORTH: "CGP", LEMMA: "Grogol Petamburan", NORM: "Grogol Petamburan"}, | ||||||
|  |     {ORTH: "KSU", LEMMA: "Kepulauan Seribu Utara", NORM: "Kepulauan Seribu Utara"}, | ||||||
|  |     {ORTH: "KYB", LEMMA: "Kebayoran Baru", NORM: "Kebayoran Baru"}, | ||||||
|  |     {ORTH: "TJP", LEMMA: "Tanjungpriok", NORM: "Tanjungpriok"}, | ||||||
|  |     {ORTH: "TNA", LEMMA: "Tanah Abang", NORM: "Tanah Abang"}, | ||||||
|  | 
 | ||||||
|  |     {ORTH: "BEK", LEMMA: "Bengkayang", NORM: "Bengkayang"}, | ||||||
|  |     {ORTH: "KTP", LEMMA: "Ketapang", NORM: "Ketapang"}, | ||||||
|  |     {ORTH: "MPW", LEMMA: "Mempawah", NORM: "Mempawah"}, | ||||||
|  |     {ORTH: "NGP", LEMMA: "Nanga Pinoh", NORM: "Nanga Pinoh"}, | ||||||
|  |     {ORTH: "NBA", LEMMA: "Ngabang", NORM: "Ngabang"}, | ||||||
|  |     {ORTH: "PTK", LEMMA: "Pontianak", NORM: "Pontianak"}, | ||||||
|  |     {ORTH: "PTS", LEMMA: "Putussibau", NORM: "Putussibau"}, | ||||||
|  |     {ORTH: "SBS", LEMMA: "Sambas", NORM: "Sambas"}, | ||||||
|  |     {ORTH: "SAG", LEMMA: "Sanggau", NORM: "Sanggau"}, | ||||||
|  |     {ORTH: "SED", LEMMA: "Sekadau", NORM: "Sekadau"}, | ||||||
|  |     {ORTH: "SKW", LEMMA: "Singkawang", NORM: "Singkawang"}, | ||||||
|  |     {ORTH: "STG", LEMMA: "Sintang", NORM: "Sintang"}, | ||||||
|  |     {ORTH: "SKD", LEMMA: "Sukadane", NORM: "Sukadane"}, | ||||||
|  |     {ORTH: "SRY", LEMMA: "Sungai Raya", NORM: "Sungai Raya"}, | ||||||
|  | 
 | ||||||
|  |     {ORTH: "Jan.", LEMMA: "Januari", NORM: "Januari"}, | ||||||
|  |     {ORTH: "Feb.", LEMMA: "Februari", NORM: "Februari"}, | ||||||
|  |     {ORTH: "Mar.", LEMMA: "Maret", NORM: "Maret"}, | ||||||
|  |     {ORTH: "Apr.", LEMMA: "April", NORM: "April"}, | ||||||
|  |     {ORTH: "Jun.", LEMMA: "Juni", NORM: "Juni"}, | ||||||
|  |     {ORTH: "Jul.", LEMMA: "Juli", NORM: "Juli"}, | ||||||
|  |     {ORTH: "Agu.", LEMMA: "Agustus", NORM: "Agustus"}, | ||||||
|  |     {ORTH: "Ags.", LEMMA: "Agustus", NORM: "Agustus"}, | ||||||
|  |     {ORTH: "Sep.", LEMMA: "September", NORM: "September"}, | ||||||
|  |     {ORTH: "Okt.", LEMMA: "Oktober", NORM: "Oktober"}, | ||||||
|  |     {ORTH: "Nov.", LEMMA: "November", NORM: "November"}, | ||||||
|  |     {ORTH: "Des.", LEMMA: "Desember", NORM: "Desember"}]: | ||||||
|  |     _exc[exc_data[ORTH]] = [exc_data] | ||||||
| 
 | 
 | ||||||
| for orth in [ | for orth in [ | ||||||
|     "'d", "a.m.", "Adm.", "Bros.", "co.", "Co.", "Corp.", "D.C.", "Dr.", "e.g.", |     "A.AB.", "A.Ma.", "A.Md.", "A.Md.Keb.", "A.Md.Kep.", "A.P.", | ||||||
|     "E.g.", "E.G.", "Gen.", "Gov.", "i.e.", "I.e.", "I.E.", "Inc.", "Jr.", |  | ||||||
|     "Ltd.", "Md.", "Messrs.", "Mo.", "Mont.", "Mr.", "Mrs.", "Ms.", "p.m.", |  | ||||||
|     "Ph.D.", "Rep.", "Rev.", "Sen.", "St.", "vs.", |  | ||||||
|     "B.A.", "B.Ch.E.", "B.Sc.", "Dr.", "Dra.", "Drs.", "Hj.", "Ka.", "Kp.", |     "B.A.", "B.Ch.E.", "B.Sc.", "Dr.", "Dra.", "Drs.", "Hj.", "Ka.", "Kp.", | ||||||
|     "M.Ag.", "M.Hum.", "M.Kes,", "M.Kom.", "M.M.", "M.P.", "M.Pd.", "M.Sc.", |     "M.AB", "M.Ag.", "M.AP", "M.Arl", "M.A.R.S", "M.Hum.", "M.I.Kom.", "M.Kes,", | ||||||
|     "M.Si.", "M.Sn.", "M.T.", "M.Th.", "No.", "Pjs.", "Plt.", "R.A.", "S.Ag.", |     "M.Kom.", "M.M.", "M.P.", "M.Pd.", "M.Psi.", "M.Psi.T.", "M.Sc.", "M.SArl", | ||||||
|     "S.E.", "S.H.", "S.Hut.", "S.K.M.", "S.Kedg.", "S.Kedh.", "S.Kom.", |     "M.Si.", "M.Sn.", "M.T.", "M.Th.", "No.", "Pjs.", "Plt.", "R.A.", | ||||||
|     "S.Pd.", "S.Pol.", "S.Psi.", "S.S.", "S.Sos.", "S.T.", "S.Tekp.", "S.Th.", |     "S.AB", "S.AP", "S.Adm", "S.Ag.", "S.Agr", "S.Ant", "S.Arl", "S.Ars", | ||||||
|  |     "S.A.R.S", "S.Ds", "S.E.", "S.E.I.", "S.Farm", "S.Gz.", "S.H.", "S.Han", | ||||||
|  |     "S.H.Int", "S.Hum", "S.Hut.", "S.In.", "S.IK.", "S.I.Kom.", "S.I.P", "S.IP", | ||||||
|  |     "S.P.", "S.Pt", "S.Psi", "S.Ptk", "S.Keb", "S.Ked", "S.Kep", "S.KG", "S.KH", | ||||||
|  |     "S.Kel", "S.K.M.", "S.Kedg.", "S.Kedh.", "S.Kom.", "S.KPM", "S.Mb", "S.Mat", | ||||||
|  |     "S.Par", "S.Pd.", "S.Pd.I.", "S.Pd.SD", "S.Pol.", "S.Psi.", "S.S.", "S.SArl.", | ||||||
|  |     "S.Sn", "S.Si.", "S.Si.Teol.", "S.SI.", "S.ST.", "S.ST.Han", "S.STP", "S.Sos.", | ||||||
|  |     "S.Sy.", "S.T.", "S.T.Han", "S.Th.", "S.Th.I" "S.TI.", "S.T.P.", "S.TrK", | ||||||
|  |     "S.Tekp.", "S.Th.", | ||||||
|     "a.l.", "a.n.", "a.s.", "b.d.", "d.a.", "d.l.", "d/h", "dkk.", "dll.", |     "a.l.", "a.n.", "a.s.", "b.d.", "d.a.", "d.l.", "d/h", "dkk.", "dll.", | ||||||
|     "dr.", "drh.", "ds.", "dsb.", "dst.", "faks.", "fax.", "hlm.", "i/o", |     "dr.", "drh.", "ds.", "dsb.", "dst.", "faks.", "fax.", "hlm.", "i/o", | ||||||
|     "n.b.", "p.p." "pjs.", "s.d.", "tel.", "u.p.", |     "n.b.", "p.p." "pjs.", "s.d.", "tel.", "u.p.", | ||||||
|  |  | ||||||
							
								
								
									
										42
									
								
								spacy/lang/ro/lex_attrs.py
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										42
									
								
								spacy/lang/ro/lex_attrs.py
									
									
									
									
									
										Normal file
									
								
							|  | @ -0,0 +1,42 @@ | ||||||
|  | # coding: utf8 | ||||||
|  | from __future__ import unicode_literals | ||||||
|  | 
 | ||||||
|  | from ...attrs import LIKE_NUM | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
|  | _num_words = set(""" | ||||||
|  | zero unu doi două trei patru cinci șase șapte opt nouă zece | ||||||
|  | unsprezece doisprezece douăsprezece treisprezece patrusprezece cincisprezece șaisprezece șaptesprezece optsprezece nouăsprezece | ||||||
|  | douăzeci treizeci patruzeci cincizeci șaizeci șaptezeci optzeci nouăzeci | ||||||
|  | sută mie milion miliard bilion trilion cvadrilion catralion cvintilion sextilion septilion enșpemii | ||||||
|  | """.split()) | ||||||
|  | 
 | ||||||
|  | _ordinal_words = set(""" | ||||||
|  | primul doilea treilea patrulea cincilea șaselea șaptelea optulea nouălea zecelea | ||||||
|  | prima doua treia patra cincia șasea șaptea opta noua zecea | ||||||
|  | unsprezecelea doisprezecelea treisprezecelea patrusprezecelea cincisprezecelea șaisprezecelea șaptesprezecelea optsprezecelea nouăsprezecelea | ||||||
|  | unsprezecea douăsprezecea treisprezecea patrusprezecea cincisprezecea șaisprezecea șaptesprezecea optsprezecea nouăsprezecea | ||||||
|  | douăzecilea treizecilea patruzecilea cincizecilea șaizecilea șaptezecilea optzecilea nouăzecilea sutălea | ||||||
|  | douăzecea treizecea patruzecea cincizecea șaizecea șaptezecea optzecea nouăzecea suta | ||||||
|  | miilea mielea mia milionulea milioana miliardulea miliardelea miliarda enșpemia | ||||||
|  | """.split()) | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
|  | def like_num(text): | ||||||
|  |     text = text.replace(',', '').replace('.', '') | ||||||
|  |     if text.isdigit(): | ||||||
|  |         return True | ||||||
|  |     if text.count('/') == 1: | ||||||
|  |         num, denom = text.split('/') | ||||||
|  |         if num.isdigit() and denom.isdigit(): | ||||||
|  |             return True | ||||||
|  |     if text.lower() in _num_words: | ||||||
|  |         return True | ||||||
|  |     if text.lower() in _ordinal_words: | ||||||
|  |         return True | ||||||
|  |     return False | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
|  | LEX_ATTRS = { | ||||||
|  |     LIKE_NUM: like_num | ||||||
|  | } | ||||||
|  | @ -9,8 +9,9 @@ _exc = {} | ||||||
| 
 | 
 | ||||||
| # Source: https://en.wiktionary.org/wiki/Category:Romanian_abbreviations | # Source: https://en.wiktionary.org/wiki/Category:Romanian_abbreviations | ||||||
| for orth in [ | for orth in [ | ||||||
|     "1-a", "1-ul", "10-a", "10-lea", "2-a", "3-a", "3-lea", "6-lea", |     "1-a", "2-a", "3-a", "4-a", "5-a", "6-a", "7-a", "8-a", "9-a", "10-a", "11-a", "12-a", | ||||||
|     "d-voastră", "dvs.", "Rom.", "str."]: |     "1-ul", "2-lea", "3-lea", "4-lea", "5-lea", "6-lea", "7-lea", "8-lea", "9-lea", "10-lea", "11-lea", "12-lea", | ||||||
|  |     "d-voastră", "dvs.", "ing.", "dr.", "Rom.", "str.", "nr.", "etc.", "d.p.d.v.", "dpdv", "șamd.", "ș.a.m.d."]: | ||||||
|     _exc[orth] = [{ORTH: orth}] |     _exc[orth] = [{ORTH: orth}] | ||||||
| 
 | 
 | ||||||
| 
 | 
 | ||||||
|  |  | ||||||
|  | @ -15,7 +15,7 @@ from .. import util | ||||||
| # here if it's using spaCy's tokenizer (not a different library) | # here if it's using spaCy's tokenizer (not a different library) | ||||||
| # TODO: re-implement generic tokenizer tests | # TODO: re-implement generic tokenizer tests | ||||||
| _languages = ['bn', 'da', 'de', 'en', 'es', 'fi', 'fr', 'ga', 'he', 'hu', 'id', | _languages = ['bn', 'da', 'de', 'en', 'es', 'fi', 'fr', 'ga', 'he', 'hu', 'id', | ||||||
|               'it', 'nb', 'nl', 'pl', 'pt', 'ru', 'sv', 'tr', 'ar', 'xx'] |               'it', 'nb', 'nl', 'pl', 'pt', 'ro', 'ru', 'sv', 'tr', 'ar', 'xx'] | ||||||
| 
 | 
 | ||||||
| _models = {'en': ['en_core_web_sm'], | _models = {'en': ['en_core_web_sm'], | ||||||
|            'de': ['de_core_news_md'], |            'de': ['de_core_news_md'], | ||||||
|  |  | ||||||
							
								
								
									
										25
									
								
								spacy/tests/lang/ro/test_tokenizer.py
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										25
									
								
								spacy/tests/lang/ro/test_tokenizer.py
									
									
									
									
									
										Normal file
									
								
							|  | @ -0,0 +1,25 @@ | ||||||
|  | # coding: utf8 | ||||||
|  | from __future__ import unicode_literals | ||||||
|  | 
 | ||||||
|  | import pytest | ||||||
|  | 
 | ||||||
|  | DEFAULT_TESTS = [ | ||||||
|  |     ('Adresa este str. Principală nr. 5.', ['Adresa', 'este', 'str.', 'Principală', 'nr.', '5', '.']), | ||||||
|  |     ('Teste, etc.', ['Teste', ',', 'etc.']), | ||||||
|  |     ('Lista, ș.a.m.d.', ['Lista', ',', 'ș.a.m.d.']), | ||||||
|  |     ('Și d.p.d.v. al...', ['Și', 'd.p.d.v.', 'al', '...']) | ||||||
|  | ] | ||||||
|  | 
 | ||||||
|  | NUMBER_TESTS = [ | ||||||
|  |     ('Clasa a 4-a.', ['Clasa', 'a', '4-a', '.']), | ||||||
|  |     ('Al 12-lea ceas.', ['Al', '12-lea', 'ceas', '.']) | ||||||
|  | ] | ||||||
|  | 
 | ||||||
|  | TESTCASES = DEFAULT_TESTS + NUMBER_TESTS | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
|  | @pytest.mark.parametrize('text,expected_tokens', TESTCASES) | ||||||
|  | def test_tokenizer_handles_testcases(ro_tokenizer, text, expected_tokens): | ||||||
|  |     tokens = ro_tokenizer(text) | ||||||
|  |     token_list = [token.text for token in tokens if not token.is_space] | ||||||
|  |     assert expected_tokens == token_list | ||||||
|  | @ -53,7 +53,7 @@ p | ||||||
|     +tag-new(2) |     +tag-new(2) | ||||||
| 
 | 
 | ||||||
| p | p | ||||||
|     |  The populate a model's vocabulary, you can use the |     |  To populate a model's vocabulary, you can use the | ||||||
|     |  #[+api("cli#vocab") #[code spacy vocab]] command and load in a |     |  #[+api("cli#vocab") #[code spacy vocab]] command and load in a | ||||||
|     |  #[+a("https://jsonlines.readthedocs.io/en/latest/") newline-delimited JSON] |     |  #[+a("https://jsonlines.readthedocs.io/en/latest/") newline-delimited JSON] | ||||||
|     |  (JSONL) file containing one lexical entry per line. The first line |     |  (JSONL) file containing one lexical entry per line. The first line | ||||||
|  |  | ||||||
|  | @ -16,7 +16,9 @@ | ||||||
| 
 | 
 | ||||||
|     +qs({package: 'source'}) git clone https://github.com/explosion/spaCy |     +qs({package: 'source'}) git clone https://github.com/explosion/spaCy | ||||||
|     +qs({package: 'source'}) cd spaCy |     +qs({package: 'source'}) cd spaCy | ||||||
|     +qs({package: 'source'}) export PYTHONPATH=`pwd` |     +qs({package: 'source', os: 'mac'}) export PYTHONPATH=`pwd` | ||||||
|  |     +qs({package: 'source', os: 'linux'}) export PYTHONPATH=`pwd` | ||||||
|  |     +qs({package: 'source', os: 'windows'}) set PYTHONPATH=/path/to/spaCy | ||||||
|     +qs({package: 'source'}) pip install -r requirements.txt |     +qs({package: 'source'}) pip install -r requirements.txt | ||||||
|     +qs({package: 'source'}) python setup.py build_ext --inplace |     +qs({package: 'source'}) python setup.py build_ext --inplace | ||||||
| 
 | 
 | ||||||
|  |  | ||||||
|  | @ -184,7 +184,7 @@ p | ||||||
| 
 | 
 | ||||||
| p | p | ||||||
|     |  In versions before v2.1.0, the semantics of the #[code +] and #[code *] operators |     |  In versions before v2.1.0, the semantics of the #[code +] and #[code *] operators | ||||||
|     |  behave inconsistently. They were usually interpretted |     |  behave inconsistently. They were usually interpreted | ||||||
|     |  "greedily", i.e. longer matches are returned where possible. However, if |     |  "greedily", i.e. longer matches are returned where possible. However, if | ||||||
|     |  you specify two #[code +] and #[code *] patterns in a row and their |     |  you specify two #[code +] and #[code *] patterns in a row and their | ||||||
|     |  matches overlap, the first operator will behave non-greedily. This quirk |     |  matches overlap, the first operator will behave non-greedily. This quirk | ||||||
|  | @ -260,41 +260,6 @@ p | ||||||
|     doc = nlp(u"This is a text about Google I/O 2015.") |     doc = nlp(u"This is a text about Google I/O 2015.") | ||||||
|     matches = matcher(doc) |     matches = matcher(doc) | ||||||
| 
 | 
 | ||||||
| p |  | ||||||
|     |  In addition to mentions of "Google I/O", your data also contains some |  | ||||||
|     |  annoying pre-processing artefacts, like leftover HTML line breaks |  | ||||||
|     |  (e.g. #[code <br>] or #[code <BR/>]). While you're at it, |  | ||||||
|     |  you want to merge those into one token and flag them, to make sure you |  | ||||||
|     |  can easily ignore them later. So you add a second pattern and pass in a |  | ||||||
|     |  function #[code merge_and_flag]: |  | ||||||
| 
 |  | ||||||
| +code-exec. |  | ||||||
|     import spacy |  | ||||||
|     from spacy.matcher import Matcher |  | ||||||
|     from spacy.tokens import Token |  | ||||||
| 
 |  | ||||||
|     nlp = spacy.load('en_core_web_sm') |  | ||||||
|     matcher = Matcher(nlp.vocab) |  | ||||||
|     # register a new token extension to flag bad HTML |  | ||||||
|     Token.set_extension('bad_html', default=False) |  | ||||||
| 
 |  | ||||||
|     def merge_and_flag(matcher, doc, i, matches): |  | ||||||
|         match_id, start, end = matches[i] |  | ||||||
|         span = doc[start : end] |  | ||||||
|         span.merge(is_stop=True) # merge (and mark it as a stop word, just in case) |  | ||||||
|         for token in span: |  | ||||||
|             token._.bad_html = True  # mark token as bad HTML |  | ||||||
|         print(span.text) |  | ||||||
| 
 |  | ||||||
|     matcher.add('BAD_HTML', merge_and_flag, |  | ||||||
|                 [{'ORTH': '<'}, {'LOWER': 'br'}, {'ORTH': '>'}], |  | ||||||
|                 [{'ORTH': '<'}, {'LOWER': 'br/'}, {'ORTH': '>'}]) |  | ||||||
| 
 |  | ||||||
|     doc = nlp(u"Hello<br>world!") |  | ||||||
|     matches = matcher(doc) |  | ||||||
|     for token in doc: |  | ||||||
|         print(token.text, token._.bad_html) |  | ||||||
| 
 |  | ||||||
| +aside("Tip: Visualizing matches") | +aside("Tip: Visualizing matches") | ||||||
|     |  When working with entities, you can use #[+api("top-level#displacy") displaCy] |     |  When working with entities, you can use #[+api("top-level#displacy") displaCy] | ||||||
|     |  to quickly generate a NER visualization from your updated #[code Doc], |     |  to quickly generate a NER visualization from your updated #[code Doc], | ||||||
|  | @ -315,7 +280,7 @@ p | ||||||
|     |  that was matched, and invoke it. |     |  that was matched, and invoke it. | ||||||
| 
 | 
 | ||||||
| +code. | +code. | ||||||
|     doc = nlp(LOTS_OF_TEXT) |     doc = nlp(YOUR_TEXT_HERE) | ||||||
|     matcher(doc) |     matcher(doc) | ||||||
| 
 | 
 | ||||||
| p | p | ||||||
|  | @ -348,6 +313,69 @@ p | ||||||
|             |  A list of #[code (match_id, start, end)] tuples, describing the |             |  A list of #[code (match_id, start, end)] tuples, describing the | ||||||
|             |  matches. A match tuple describes a span #[code doc[start:end]]. |             |  matches. A match tuple describes a span #[code doc[start:end]]. | ||||||
| 
 | 
 | ||||||
|  | +h(3, "matcher-pipeline") Using custom pipeline components | ||||||
|  | 
 | ||||||
|  | p | ||||||
|  |     |  Let's say your data also contains some annoying pre-processing artefacts, | ||||||
|  |     |  like leftover HTML line breaks (e.g. #[code <br>] or | ||||||
|  |     |  #[code <BR/>]). To make your text easier to analyse, you want to | ||||||
|  |     |  merge those into one token and flag them, to make sure you | ||||||
|  |     |  can ignore them later. Ideally, this should all be done automatically | ||||||
|  |     |  as you process the text. You can achieve this by adding a | ||||||
|  |     |  #[+a("/usage/processing-pipelines#custom-components") custom pipeline component] | ||||||
|  |     |  that's called on each #[code Doc] object, merges the leftover HTML spans | ||||||
|  |     |  and sets an attribute #[code bad_html] on the token. | ||||||
|  | 
 | ||||||
|  | +code-exec. | ||||||
|  |     import spacy | ||||||
|  |     from spacy.matcher import Matcher | ||||||
|  |     from spacy.tokens import Token | ||||||
|  | 
 | ||||||
|  |     # we're using a class because the component needs to be initialised with | ||||||
|  |     # the shared vocab via the nlp object | ||||||
|  |     class BadHTMLMerger(object): | ||||||
|  |         def __init__(self, nlp): | ||||||
|  |             # register a new token extension to flag bad HTML | ||||||
|  |             Token.set_extension('bad_html', default=False) | ||||||
|  |             self.matcher = Matcher(nlp.vocab) | ||||||
|  |             self.matcher.add('BAD_HTML', None, | ||||||
|  |                 [{'ORTH': '<'}, {'LOWER': 'br'}, {'ORTH': '>'}], | ||||||
|  |                 [{'ORTH': '<'}, {'LOWER': 'br/'}, {'ORTH': '>'}]) | ||||||
|  | 
 | ||||||
|  |         def __call__(self, doc): | ||||||
|  |             # this method is invoked when the component is called on a Doc | ||||||
|  |             matches = self.matcher(doc) | ||||||
|  |             spans = []  # collect the matched spans here | ||||||
|  |             for match_id, start, end in matches: | ||||||
|  |                 spans.append(doc[start:end]) | ||||||
|  |             for span in spans: | ||||||
|  |                 span.merge(is_stop=True) # merge (and mark it as a stop word) | ||||||
|  |                 for token in span: | ||||||
|  |                     token._.bad_html = True  # mark token as bad HTML | ||||||
|  |             return doc | ||||||
|  | 
 | ||||||
|  |     nlp = spacy.load('en_core_web_sm') | ||||||
|  |     html_merger = BadHTMLMerger(nlp) | ||||||
|  |     nlp.add_pipe(html_merger, last=True)  # add component to the pipeline | ||||||
|  |     doc = nlp(u"Hello<br>world! <br/> This is a test.") | ||||||
|  |     for token in doc: | ||||||
|  |         print(token.text, token._.bad_html) | ||||||
|  | 
 | ||||||
|  | p | ||||||
|  |     |  Instead of hard-coding the patterns into the component, you could also | ||||||
|  |     |  make it take a path to a JSON file containing the patterns. This lets | ||||||
|  |     |  you reuse the component with different patterns, depending on your | ||||||
|  |     |  application: | ||||||
|  | 
 | ||||||
|  | +code. | ||||||
|  |     html_merger = BadHTMLMerger(nlp, path='/path/to/patterns.json') | ||||||
|  | 
 | ||||||
|  | +infobox | ||||||
|  |     |  For more details and examples of how to | ||||||
|  |     |  #[strong create custom pipeline components] and | ||||||
|  |     |  #[strong extension attributes], see the | ||||||
|  |     |  #[+a("/usage/processing-pipelines") usage guide]. | ||||||
|  | 
 | ||||||
| +h(3, "regex") Using regular expressions | +h(3, "regex") Using regular expressions | ||||||
| 
 | 
 | ||||||
| p | p | ||||||
|  |  | ||||||
|  | @ -52,7 +52,7 @@ p | ||||||
| 
 | 
 | ||||||
| +code(false, "bash"). | +code(false, "bash"). | ||||||
|     wget https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.la.300.vec.gz |     wget https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.la.300.vec.gz | ||||||
|     python -m spacy init-model /tmp/la_vectors_wiki_lg --vectors-loc cc.la.300.vec.gz |     python -m spacy init-model en /tmp/la_vectors_wiki_lg --vectors-loc cc.la.300.vec.gz | ||||||
| 
 | 
 | ||||||
| p | p | ||||||
|     |  This will output a spaCy model in the directory |     |  This will output a spaCy model in the directory | ||||||
|  |  | ||||||
		Loading…
	
		Reference in New Issue
	
	Block a user