mirror of
				https://github.com/explosion/spaCy.git
				synced 2025-10-24 20:51:30 +03:00 
			
		
		
		
	Merge branch 'master' of github.com:GregDubbin/spaCy
This commit is contained in:
		
						commit
						441f490c1c
					
				
							
								
								
									
										106
									
								
								.github/contributors/fucking-signup.md
									
									
									
									
										vendored
									
									
										Normal file
									
								
							
							
						
						
									
										106
									
								
								.github/contributors/fucking-signup.md
									
									
									
									
										vendored
									
									
										Normal file
									
								
							|  | @ -0,0 +1,106 @@ | |||
| # spaCy contributor agreement | ||||
| 
 | ||||
| This spaCy Contributor Agreement (**"SCA"**) is based on the | ||||
| [Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf). | ||||
| The SCA applies to any contribution that you make to any product or project | ||||
| managed by us (the **"project"**), and sets out the intellectual property rights | ||||
| you grant to us in the contributed materials. The term **"us"** shall mean | ||||
| [ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term | ||||
| **"you"** shall mean the person or entity identified below. | ||||
| 
 | ||||
| If you agree to be bound by these terms, fill in the information requested | ||||
| below and include the filled-in version with your first pull request, under the | ||||
| folder [`.github/contributors/`](/.github/contributors/). The name of the file | ||||
| should be your GitHub username, with the extension `.md`. For example, the user | ||||
| example_user would create the file `.github/contributors/example_user.md`. | ||||
| 
 | ||||
| Read this agreement carefully before signing. These terms and conditions | ||||
| constitute a binding legal agreement. | ||||
| 
 | ||||
| ## Contributor Agreement | ||||
| 
 | ||||
| 1. The term "contribution" or "contributed materials" means any source code, | ||||
| object code, patch, tool, sample, graphic, specification, manual, | ||||
| documentation, or any other material posted or submitted by you to the project. | ||||
| 
 | ||||
| 2. With respect to any worldwide copyrights, or copyright applications and | ||||
| registrations, in your contribution: | ||||
| 
 | ||||
|     * you hereby assign to us joint ownership, and to the extent that such | ||||
|     assignment is or becomes invalid, ineffective or unenforceable, you hereby | ||||
|     grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge, | ||||
|     royalty-free, unrestricted license to exercise all rights under those | ||||
|     copyrights. This includes, at our option, the right to sublicense these same | ||||
|     rights to third parties through multiple levels of sublicensees or other | ||||
|     licensing arrangements; | ||||
| 
 | ||||
|     * you agree that each of us can do all things in relation to your | ||||
|     contribution as if each of us were the sole owners, and if one of us makes | ||||
|     a derivative work of your contribution, the one who makes the derivative | ||||
|     work (or has it made will be the sole owner of that derivative work; | ||||
| 
 | ||||
|     * you agree that you will not assert any moral rights in your contribution | ||||
|     against us, our licensees or transferees; | ||||
| 
 | ||||
|     * you agree that we may register a copyright in your contribution and | ||||
|     exercise all ownership rights associated with it; and | ||||
| 
 | ||||
|     * you agree that neither of us has any duty to consult with, obtain the | ||||
|     consent of, pay or render an accounting to the other for any use or | ||||
|     distribution of your contribution. | ||||
| 
 | ||||
| 3. With respect to any patents you own, or that you can license without payment | ||||
| to any third party, you hereby grant to us a perpetual, irrevocable, | ||||
| non-exclusive, worldwide, no-charge, royalty-free license to: | ||||
| 
 | ||||
|     * make, have made, use, sell, offer to sell, import, and otherwise transfer | ||||
|     your contribution in whole or in part, alone or in combination with or | ||||
|     included in any product, work or materials arising out of the project to | ||||
|     which your contribution was submitted, and | ||||
| 
 | ||||
|     * at our option, to sublicense these same rights to third parties through | ||||
|     multiple levels of sublicensees or other licensing arrangements. | ||||
| 
 | ||||
| 4. Except as set out above, you keep all right, title, and interest in your | ||||
| contribution. The rights that you grant to us under these terms are effective | ||||
| on the date you first submitted a contribution to us, even if your submission | ||||
| took place before the date you sign these terms. | ||||
| 
 | ||||
| 5. You covenant, represent, warrant and agree that: | ||||
| 
 | ||||
|     * Each contribution that you submit is and shall be an original work of | ||||
|     authorship and you can legally grant the rights set out in this SCA; | ||||
| 
 | ||||
|     * to the best of your knowledge, each contribution will not violate any | ||||
|     third party's copyrights, trademarks, patents, or other intellectual | ||||
|     property rights; and | ||||
| 
 | ||||
|     * each contribution shall be in compliance with U.S. export control laws and | ||||
|     other applicable export and import laws. You agree to notify us if you | ||||
|     become aware of any circumstance which would make any of the foregoing | ||||
|     representations inaccurate in any respect. We may publicly disclose your | ||||
|     participation in the project, including the fact that you have signed the SCA. | ||||
| 
 | ||||
| 6. This SCA is governed by the laws of the State of California and applicable | ||||
| U.S. Federal law. Any choice of law rules will not apply. | ||||
| 
 | ||||
| 7. Please place an “x” on one of the applicable statement below. Please do NOT | ||||
| mark both statements: | ||||
| 
 | ||||
|     * [x] I am signing on behalf of myself as an individual and no other person | ||||
|     or entity, including my employer, has or will have rights with respect to my | ||||
|     contributions. | ||||
| 
 | ||||
|     * [ ] I am signing on behalf of my employer or a legal entity and I have the | ||||
|     actual authority to contractually bind that entity. | ||||
| 
 | ||||
| ## Contributor Details | ||||
| 
 | ||||
| | Field                          | Entry                | | ||||
| |------------------------------- | -------------------- | | ||||
| | Name                           | Kit                  | | ||||
| | Company name (if applicable)   | -                    | | ||||
| | Title or role (if applicable)  | -                    | | ||||
| | Date                           | 2018/01/08           | | ||||
| | GitHub username                | fucking-signup       | | ||||
| | Website (optional)             | -                    | | ||||
							
								
								
									
										106
									
								
								.github/contributors/pbnsilva.md
									
									
									
									
										vendored
									
									
										Normal file
									
								
							
							
						
						
									
										106
									
								
								.github/contributors/pbnsilva.md
									
									
									
									
										vendored
									
									
										Normal file
									
								
							|  | @ -0,0 +1,106 @@ | |||
| # spaCy contributor agreement | ||||
| 
 | ||||
| This spaCy Contributor Agreement (**"SCA"**) is based on the | ||||
| [Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf). | ||||
| The SCA applies to any contribution that you make to any product or project | ||||
| managed by us (the **"project"**), and sets out the intellectual property rights | ||||
| you grant to us in the contributed materials. The term **"us"** shall mean | ||||
| [ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term | ||||
| **"you"** shall mean the person or entity identified below. | ||||
| 
 | ||||
| If you agree to be bound by these terms, fill in the information requested | ||||
| below and include the filled-in version with your first pull request, under the | ||||
| folder [`.github/contributors/`](/.github/contributors/). The name of the file | ||||
| should be your GitHub username, with the extension `.md`. For example, the user | ||||
| example_user would create the file `.github/contributors/example_user.md`. | ||||
| 
 | ||||
| Read this agreement carefully before signing. These terms and conditions | ||||
| constitute a binding legal agreement. | ||||
| 
 | ||||
| ## Contributor Agreement | ||||
| 
 | ||||
| 1. The term "contribution" or "contributed materials" means any source code, | ||||
| object code, patch, tool, sample, graphic, specification, manual, | ||||
| documentation, or any other material posted or submitted by you to the project. | ||||
| 
 | ||||
| 2. With respect to any worldwide copyrights, or copyright applications and | ||||
| registrations, in your contribution: | ||||
| 
 | ||||
|     * you hereby assign to us joint ownership, and to the extent that such | ||||
|     assignment is or becomes invalid, ineffective or unenforceable, you hereby | ||||
|     grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge, | ||||
|     royalty-free, unrestricted license to exercise all rights under those | ||||
|     copyrights. This includes, at our option, the right to sublicense these same | ||||
|     rights to third parties through multiple levels of sublicensees or other | ||||
|     licensing arrangements; | ||||
| 
 | ||||
|     * you agree that each of us can do all things in relation to your | ||||
|     contribution as if each of us were the sole owners, and if one of us makes | ||||
|     a derivative work of your contribution, the one who makes the derivative | ||||
|     work (or has it made will be the sole owner of that derivative work; | ||||
| 
 | ||||
|     * you agree that you will not assert any moral rights in your contribution | ||||
|     against us, our licensees or transferees; | ||||
| 
 | ||||
|     * you agree that we may register a copyright in your contribution and | ||||
|     exercise all ownership rights associated with it; and | ||||
| 
 | ||||
|     * you agree that neither of us has any duty to consult with, obtain the | ||||
|     consent of, pay or render an accounting to the other for any use or | ||||
|     distribution of your contribution. | ||||
| 
 | ||||
| 3. With respect to any patents you own, or that you can license without payment | ||||
| to any third party, you hereby grant to us a perpetual, irrevocable, | ||||
| non-exclusive, worldwide, no-charge, royalty-free license to: | ||||
| 
 | ||||
|     * make, have made, use, sell, offer to sell, import, and otherwise transfer | ||||
|     your contribution in whole or in part, alone or in combination with or | ||||
|     included in any product, work or materials arising out of the project to | ||||
|     which your contribution was submitted, and | ||||
| 
 | ||||
|     * at our option, to sublicense these same rights to third parties through | ||||
|     multiple levels of sublicensees or other licensing arrangements. | ||||
| 
 | ||||
| 4. Except as set out above, you keep all right, title, and interest in your | ||||
| contribution. The rights that you grant to us under these terms are effective | ||||
| on the date you first submitted a contribution to us, even if your submission | ||||
| took place before the date you sign these terms. | ||||
| 
 | ||||
| 5. You covenant, represent, warrant and agree that: | ||||
| 
 | ||||
|     * Each contribution that you submit is and shall be an original work of | ||||
|     authorship and you can legally grant the rights set out in this SCA; | ||||
| 
 | ||||
|     * to the best of your knowledge, each contribution will not violate any | ||||
|     third party's copyrights, trademarks, patents, or other intellectual | ||||
|     property rights; and | ||||
| 
 | ||||
|     * each contribution shall be in compliance with U.S. export control laws and | ||||
|     other applicable export and import laws. You agree to notify us if you | ||||
|     become aware of any circumstance which would make any of the foregoing | ||||
|     representations inaccurate in any respect. We may publicly disclose your  | ||||
|     participation in the project, including the fact that you have signed the SCA. | ||||
| 
 | ||||
| 6. This SCA is governed by the laws of the State of California and applicable | ||||
| U.S. Federal law. Any choice of law rules will not apply. | ||||
| 
 | ||||
| 7. Please place an “x” on one of the applicable statement below. Please do NOT | ||||
| mark both statements: | ||||
| 
 | ||||
|     * [x] I am signing on behalf of myself as an individual and no other person | ||||
|     or entity, including my employer, has or will have rights with respect to my | ||||
|     contributions. | ||||
| 
 | ||||
|     * [ ] I am signing on behalf of my employer or a legal entity and I have the | ||||
|     actual authority to contractually bind that entity. | ||||
| 
 | ||||
| ## Contributor Details | ||||
| 
 | ||||
| | Field                          | Entry                | | ||||
| |------------------------------- | -------------------- | | ||||
| | Name                           | Pedro Silva          | | ||||
| | Company name (if applicable)   |                      | | ||||
| | Title or role (if applicable)  |                      | | ||||
| | Date                           | 2018-01-11           | | ||||
| | GitHub username                | pbnsilva             | | ||||
| | Website (optional)             |                      | | ||||
							
								
								
									
										106
									
								
								.github/contributors/savkov.md
									
									
									
									
										vendored
									
									
										Normal file
									
								
							
							
						
						
									
										106
									
								
								.github/contributors/savkov.md
									
									
									
									
										vendored
									
									
										Normal file
									
								
							|  | @ -0,0 +1,106 @@ | |||
| # spaCy contributor agreement | ||||
| 
 | ||||
| This spaCy Contributor Agreement (**"SCA"**) is based on the | ||||
| [Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf). | ||||
| The SCA applies to any contribution that you make to any product or project | ||||
| managed by us (the **"project"**), and sets out the intellectual property rights | ||||
| you grant to us in the contributed materials. The term **"us"** shall mean | ||||
| [ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term | ||||
| **"you"** shall mean the person or entity identified below. | ||||
| 
 | ||||
| If you agree to be bound by these terms, fill in the information requested | ||||
| below and include the filled-in version with your first pull request, under the | ||||
| folder [`.github/contributors/`](/.github/contributors/). The name of the file | ||||
| should be your GitHub username, with the extension `.md`. For example, the user | ||||
| example_user would create the file `.github/contributors/example_user.md`. | ||||
| 
 | ||||
| Read this agreement carefully before signing. These terms and conditions | ||||
| constitute a binding legal agreement. | ||||
| 
 | ||||
| ## Contributor Agreement | ||||
| 
 | ||||
| 1. The term "contribution" or "contributed materials" means any source code, | ||||
| object code, patch, tool, sample, graphic, specification, manual, | ||||
| documentation, or any other material posted or submitted by you to the project. | ||||
| 
 | ||||
| 2. With respect to any worldwide copyrights, or copyright applications and | ||||
| registrations, in your contribution: | ||||
| 
 | ||||
|     * you hereby assign to us joint ownership, and to the extent that such | ||||
|     assignment is or becomes invalid, ineffective or unenforceable, you hereby | ||||
|     grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge, | ||||
|     royalty-free, unrestricted license to exercise all rights under those | ||||
|     copyrights. This includes, at our option, the right to sublicense these same | ||||
|     rights to third parties through multiple levels of sublicensees or other | ||||
|     licensing arrangements; | ||||
| 
 | ||||
|     * you agree that each of us can do all things in relation to your | ||||
|     contribution as if each of us were the sole owners, and if one of us makes | ||||
|     a derivative work of your contribution, the one who makes the derivative | ||||
|     work (or has it made will be the sole owner of that derivative work; | ||||
| 
 | ||||
|     * you agree that you will not assert any moral rights in your contribution | ||||
|     against us, our licensees or transferees; | ||||
| 
 | ||||
|     * you agree that we may register a copyright in your contribution and | ||||
|     exercise all ownership rights associated with it; and | ||||
| 
 | ||||
|     * you agree that neither of us has any duty to consult with, obtain the | ||||
|     consent of, pay or render an accounting to the other for any use or | ||||
|     distribution of your contribution. | ||||
| 
 | ||||
| 3. With respect to any patents you own, or that you can license without payment | ||||
| to any third party, you hereby grant to us a perpetual, irrevocable, | ||||
| non-exclusive, worldwide, no-charge, royalty-free license to: | ||||
| 
 | ||||
|     * make, have made, use, sell, offer to sell, import, and otherwise transfer | ||||
|     your contribution in whole or in part, alone or in combination with or | ||||
|     included in any product, work or materials arising out of the project to | ||||
|     which your contribution was submitted, and | ||||
| 
 | ||||
|     * at our option, to sublicense these same rights to third parties through | ||||
|     multiple levels of sublicensees or other licensing arrangements. | ||||
| 
 | ||||
| 4. Except as set out above, you keep all right, title, and interest in your | ||||
| contribution. The rights that you grant to us under these terms are effective | ||||
| on the date you first submitted a contribution to us, even if your submission | ||||
| took place before the date you sign these terms. | ||||
| 
 | ||||
| 5. You covenant, represent, warrant and agree that: | ||||
| 
 | ||||
|     * Each contribution that you submit is and shall be an original work of | ||||
|     authorship and you can legally grant the rights set out in this SCA; | ||||
| 
 | ||||
|     * to the best of your knowledge, each contribution will not violate any | ||||
|     third party's copyrights, trademarks, patents, or other intellectual | ||||
|     property rights; and | ||||
| 
 | ||||
|     * each contribution shall be in compliance with U.S. export control laws and | ||||
|     other applicable export and import laws. You agree to notify us if you | ||||
|     become aware of any circumstance which would make any of the foregoing | ||||
|     representations inaccurate in any respect. We may publicly disclose your  | ||||
|     participation in the project, including the fact that you have signed the SCA. | ||||
| 
 | ||||
| 6. This SCA is governed by the laws of the State of California and applicable | ||||
| U.S. Federal law. Any choice of law rules will not apply. | ||||
| 
 | ||||
| 7. Please place an “x” on one of the applicable statement below. Please do NOT | ||||
| mark both statements: | ||||
| 
 | ||||
|     * [x] I am signing on behalf of myself as an individual and no other person | ||||
|     or entity, including my employer, has or will have rights with respect to my | ||||
|     contributions. | ||||
| 
 | ||||
|     * [ ] I am signing on behalf of my employer or a legal entity and I have the | ||||
|     actual authority to contractually bind that entity. | ||||
| 
 | ||||
| ## Contributor Details | ||||
| 
 | ||||
| | Field                          | Entry                | | ||||
| |------------------------------- | -------------------- | | ||||
| | Name                           | Aleksandar Savkov    | | ||||
| | Company name (if applicable)   |                      | | ||||
| | Title or role (if applicable)  |                      | | ||||
| | Date                           | 11.01.2018           | | ||||
| | GitHub username                | savkov               | | ||||
| | Website (optional)             | sasho.io             | | ||||
							
								
								
									
										5
									
								
								setup.py
									
									
									
									
									
								
							
							
						
						
									
										5
									
								
								setup.py
									
									
									
									
									
								
							|  | @ -46,9 +46,8 @@ MOD_NAMES = [ | |||
| 
 | ||||
| COMPILE_OPTIONS =  { | ||||
|     'msvc': ['/Ox', '/EHsc'], | ||||
|     'mingw32' : ['-O3', '-Wno-strict-prototypes', '-Wno-unused-function'], | ||||
|     'other' : ['-O3', '-Wno-strict-prototypes', '-Wno-unused-function', | ||||
|                '-march=native'] | ||||
|     'mingw32' : ['-O2', '-Wno-strict-prototypes', '-Wno-unused-function'], | ||||
|     'other' : ['-O2', '-Wno-strict-prototypes', '-Wno-unused-function'] | ||||
| } | ||||
| 
 | ||||
| 
 | ||||
|  |  | |||
|  | @ -31,24 +31,28 @@ def download(model, direct=False): | |||
|         version = get_version(model_name, compatibility) | ||||
|         dl = download_model('{m}-{v}/{m}-{v}.tar.gz'.format(m=model_name, | ||||
|                                                             v=version)) | ||||
|         if dl == 0: | ||||
|             try: | ||||
|                 # Get package path here because link uses | ||||
|                 # pip.get_installed_distributions() to check if model is a | ||||
|                 # package, which fails if model was just installed via | ||||
|                 # subprocess | ||||
|                 package_path = get_package_path(model_name) | ||||
|                 link(model_name, model, force=True, model_path=package_path) | ||||
|             except: | ||||
|                 # Dirty, but since spacy.download and the auto-linking is | ||||
|                 # mostly a convenience wrapper, it's best to show a success | ||||
|                 # message and loading instructions, even if linking fails. | ||||
|                 prints( | ||||
|                     "Creating a shortcut link for 'en' didn't work (maybe " | ||||
|                     "you don't have admin permissions?), but you can still " | ||||
|                     "load the model via its full package name:", | ||||
|                     "nlp = spacy.load('%s')" % model_name, | ||||
|                     title="Download successful") | ||||
|         if dl != 0: | ||||
|             # if download subprocess doesn't return 0, exit with the respective | ||||
|             # exit code before doing anything else | ||||
|             sys.exit(dl) | ||||
|         try: | ||||
|             # Get package path here because link uses | ||||
|             # pip.get_installed_distributions() to check if model is a | ||||
|             # package, which fails if model was just installed via | ||||
|             # subprocess | ||||
|             package_path = get_package_path(model_name) | ||||
|             link(None, model_name, model, force=True, | ||||
|                     model_path=package_path) | ||||
|         except: | ||||
|             # Dirty, but since spacy.download and the auto-linking is | ||||
|             # mostly a convenience wrapper, it's best to show a success | ||||
|             # message and loading instructions, even if linking fails. | ||||
|             prints( | ||||
|                 "Creating a shortcut link for 'en' didn't work (maybe " | ||||
|                 "you don't have admin permissions?), but you can still " | ||||
|                 "load the model via its full package name:", | ||||
|                 "nlp = spacy.load('%s')" % model_name, | ||||
|                 title="Download successful but linking failed") | ||||
| 
 | ||||
| 
 | ||||
| def get_json(url, desc): | ||||
|  | @ -84,5 +88,5 @@ def get_version(model, comp): | |||
| def download_model(filename): | ||||
|     download_url = about.__download_url__ + '/' + filename | ||||
|     return subprocess.call( | ||||
|         [sys.executable, '-m', 'pip', 'install', '--no-cache-dir', | ||||
|         [sys.executable, '-m', 'pip', 'install', '--no-cache-dir', '--no-deps', | ||||
|          download_url], env=os.environ.copy()) | ||||
|  |  | |||
|  | @ -34,11 +34,18 @@ def link(origin, link_name, force=False, model_path=None): | |||
|                "located here:", path2str(spacy_loc), exits=1, | ||||
|                title="Can't find the spaCy data path to create model symlink") | ||||
|     link_path = util.get_data_path() / link_name | ||||
|     if link_path.exists() and not force: | ||||
|     if link_path.is_symlink() and not force: | ||||
|         prints("To overwrite an existing link, use the --force flag.", | ||||
|                title="Link %s already exists" % link_name, exits=1) | ||||
|     elif link_path.exists(): | ||||
|     elif link_path.is_symlink():  # does a symlink exist? | ||||
|         # NB: It's important to check for is_symlink here and not for exists, | ||||
|         # because invalid/outdated symlinks would return False otherwise. | ||||
|         link_path.unlink() | ||||
|     elif link_path.exists(): # does it exist otherwise? | ||||
|         # NB: Check this last because valid symlinks also "exist". | ||||
|         prints("This can happen if your data directory contains a directory " | ||||
|                "or file of the same name.", link_path, | ||||
|                title="Can't overwrite symlink %s" % link_name, exits=1) | ||||
|     try: | ||||
|         symlink_to(link_path, model_path) | ||||
|     except: | ||||
|  |  | |||
|  | @ -4,6 +4,7 @@ from __future__ import unicode_literals, print_function | |||
| import requests | ||||
| import pkg_resources | ||||
| from pathlib import Path | ||||
| import sys | ||||
| 
 | ||||
| from ..compat import path2str, locale_escape | ||||
| from ..util import prints, get_data_path, read_json | ||||
|  | @ -62,6 +63,9 @@ def validate(): | |||
|                "them from the data directory. Data path: {}" | ||||
|                .format(path2str(get_data_path()))) | ||||
| 
 | ||||
|     if incompat_models or incompat_links: | ||||
|         sys.exit(1) | ||||
| 
 | ||||
| 
 | ||||
| def get_model_links(compat): | ||||
|     links = {} | ||||
|  |  | |||
|  | @ -41,9 +41,9 @@ def like_num(text): | |||
|         num, denom = text.split('/') | ||||
|         if num.isdigit() and denom.isdigit(): | ||||
|             return True | ||||
|     if text in _num_words: | ||||
|     if text.lower() in _num_words: | ||||
|         return True | ||||
|     if text in _ordinal_words: | ||||
|     if text.lower() in _ordinal_words: | ||||
|         return True | ||||
|     return False | ||||
| 
 | ||||
|  |  | |||
|  | @ -20,7 +20,7 @@ def like_num(text): | |||
|         num, denom = text.split('/') | ||||
|         if num.isdigit() and denom.isdigit(): | ||||
|             return True | ||||
|     if text in _num_words: | ||||
|     if text.lower() in _num_words: | ||||
|         return True | ||||
|     return False | ||||
| 
 | ||||
|  |  | |||
|  | @ -31,7 +31,9 @@ def like_num(text): | |||
|         num, denom = text.split('/') | ||||
|         if num.isdigit() and denom.isdigit(): | ||||
|             return True | ||||
|     if text in _num_words: | ||||
|     if text.lower() in _num_words: | ||||
|         return True | ||||
|     if text.lower() in _ordinal_words: | ||||
|         return True | ||||
|     return False | ||||
| 
 | ||||
|  |  | |||
|  | @ -27,7 +27,7 @@ def like_num(text): | |||
|         num, denom = text.split('/') | ||||
|         if num.isdigit() and denom.isdigit(): | ||||
|             return True | ||||
|     if text in _num_words: | ||||
|     if text.lower() in _num_words: | ||||
|         return True | ||||
|     if text.count('-') == 1: | ||||
|         _, num = text.split('-') | ||||
|  |  | |||
|  | @ -30,7 +30,9 @@ def like_num(text): | |||
|         num, denom = text.split('/') | ||||
|         if num.isdigit() and denom.isdigit(): | ||||
|             return True | ||||
|     if text in _num_words: | ||||
|     if text.lower() in _num_words: | ||||
|         return True | ||||
|     if text.lower() in _ordinal_words: | ||||
|         return True | ||||
|     return False | ||||
| 
 | ||||
|  |  | |||
|  | @ -11,13 +11,13 @@ _num_words = ['zero', 'um', 'dois', 'três', 'quatro', 'cinco', 'seis', 'sete', | |||
|               'oitenta', 'noventa', 'cem', 'mil', 'milhão', 'bilião', 'trilião', | ||||
|               'quadrilião'] | ||||
| 
 | ||||
| _ord_words = ['primeiro', 'segundo', 'terceiro', 'quarto', 'quinto', 'sexto', | ||||
|               'sétimo', 'oitavo', 'nono', 'décimo', 'vigésimo', 'trigésimo', | ||||
|               'quadragésimo', 'quinquagésimo', 'sexagésimo', 'septuagésimo', | ||||
|               'octogésimo', 'nonagésimo', 'centésimo', 'ducentésimo', | ||||
|               'trecentésimo', 'quadringentésimo', 'quingentésimo', 'sexcentésimo', | ||||
|               'septingentésimo', 'octingentésimo', 'nongentésimo', 'milésimo', | ||||
|               'milionésimo', 'bilionésimo'] | ||||
| _ordinal_words = ['primeiro', 'segundo', 'terceiro', 'quarto', 'quinto', 'sexto', | ||||
|                   'sétimo', 'oitavo', 'nono', 'décimo', 'vigésimo', 'trigésimo', | ||||
|                   'quadragésimo', 'quinquagésimo', 'sexagésimo', 'septuagésimo', | ||||
|                   'octogésimo', 'nonagésimo', 'centésimo', 'ducentésimo', | ||||
|                   'trecentésimo', 'quadringentésimo', 'quingentésimo', 'sexcentésimo', | ||||
|                   'septingentésimo', 'octingentésimo', 'nongentésimo', 'milésimo', | ||||
|                   'milionésimo', 'bilionésimo'] | ||||
| 
 | ||||
| 
 | ||||
| def like_num(text): | ||||
|  | @ -28,7 +28,9 @@ def like_num(text): | |||
|         num, denom = text.split('/') | ||||
|         if num.isdigit() and denom.isdigit(): | ||||
|             return True | ||||
|     if text in _num_words: | ||||
|     if text.lower() in _num_words: | ||||
|         return True | ||||
|     if text.lower() in _ordinal_words: | ||||
|         return True | ||||
|     return False | ||||
| 
 | ||||
|  |  | |||
|  | @ -25,7 +25,7 @@ def like_num(text): | |||
|         num, denom = text.split('/') | ||||
|         if num.isdigit() and denom.isdigit(): | ||||
|             return True | ||||
|     if text in _num_words: | ||||
|     if text.lower() in _num_words: | ||||
|         return True | ||||
|     return False | ||||
| 
 | ||||
|  |  | |||
|  | @ -40,6 +40,11 @@ cdef class Lexeme: | |||
|         assert self.c.orth == orth | ||||
| 
 | ||||
|     def __richcmp__(self, other, int op): | ||||
|         if other is None: | ||||
|             if op == 0 or op == 1 or op == 2: | ||||
|                 return False | ||||
|             else: | ||||
|                 return True | ||||
|         if isinstance(other, Lexeme): | ||||
|             a = self.orth | ||||
|             b = other.orth | ||||
|  | @ -107,6 +112,14 @@ cdef class Lexeme: | |||
|             `Span`, `Token` and `Lexeme` objects. | ||||
|         RETURNS (float): A scalar similarity score. Higher is more similar. | ||||
|         """ | ||||
|         # Return 1.0 similarity for matches | ||||
|         if hasattr(other, 'orth'): | ||||
|             if self.c.orth == other.orth: | ||||
|                 return 1.0 | ||||
|         elif hasattr(other, '__len__') and len(other) == 1 \ | ||||
|         and hasattr(other[0], 'orth'): | ||||
|             if self.c.orth == other[0].orth: | ||||
|                 return 1.0 | ||||
|         if self.vector_norm == 0 or other.vector_norm == 0: | ||||
|             return 0.0 | ||||
|         return (numpy.dot(self.vector, other.vector) / | ||||
|  |  | |||
|  | @ -217,6 +217,16 @@ def test_doc_api_has_vector(): | |||
|     doc = Doc(vocab, words=['kitten']) | ||||
|     assert doc.has_vector | ||||
| 
 | ||||
| 
 | ||||
| def test_doc_api_similarity_match(): | ||||
|     doc = Doc(Vocab(), words=['a']) | ||||
|     assert doc.similarity(doc[0]) == 1.0 | ||||
|     assert doc.similarity(doc.vocab['a']) == 1.0 | ||||
|     doc2 = Doc(doc.vocab, words=['a', 'b', 'c']) | ||||
|     assert doc.similarity(doc2[:1]) == 1.0 | ||||
|     assert doc.similarity(doc2) == 0.0 | ||||
| 
 | ||||
| 
 | ||||
| def test_lowest_common_ancestor(en_tokenizer): | ||||
|     tokens = en_tokenizer('the lazy dog slept') | ||||
|     doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=[2, 1, 1, 0]) | ||||
|  | @ -225,6 +235,7 @@ def test_lowest_common_ancestor(en_tokenizer): | |||
|     assert(lca[0, 1] == 2) | ||||
|     assert(lca[1, 2] == 2) | ||||
| 
 | ||||
| 
 | ||||
| def test_parse_tree(en_tokenizer): | ||||
|     """Tests doc.print_tree() method.""" | ||||
|     text = 'I like New York in Autumn.' | ||||
|  |  | |||
|  | @ -3,6 +3,8 @@ from __future__ import unicode_literals | |||
| 
 | ||||
| from ..util import get_doc | ||||
| from ...attrs import ORTH, LENGTH | ||||
| from ...tokens import Doc | ||||
| from ...vocab import Vocab | ||||
| 
 | ||||
| import pytest | ||||
| 
 | ||||
|  | @ -66,6 +68,15 @@ def test_spans_lca_matrix(en_tokenizer): | |||
|     assert(lca[1, 1] == 1) | ||||
| 
 | ||||
| 
 | ||||
| def test_span_similarity_match(): | ||||
|     doc = Doc(Vocab(), words=['a', 'b', 'a', 'b']) | ||||
|     span1 = doc[:2] | ||||
|     span2 = doc[2:] | ||||
|     assert span1.similarity(span2) == 1.0 | ||||
|     assert span1.similarity(doc) == 0.0 | ||||
|     assert span1[:1].similarity(doc.vocab['a']) == 1.0 | ||||
| 
 | ||||
| 
 | ||||
| def test_spans_default_sentiment(en_tokenizer): | ||||
|     """Test span.sentiment property's default averaging behaviour""" | ||||
|     text = "good stuff bad stuff" | ||||
|  |  | |||
|  | @ -160,8 +160,5 @@ def test_is_sent_start(en_tokenizer): | |||
|     assert doc[5].is_sent_start is None | ||||
|     doc[5].is_sent_start = True | ||||
|     assert doc[5].is_sent_start is True | ||||
|     # Backwards compatibility | ||||
|     with pytest.warns(DeprecationWarning): | ||||
|         assert doc[0].sent_start is False | ||||
|     doc.is_parsed = True | ||||
|     assert len(list(doc.sents)) == 2 | ||||
|  |  | |||
							
								
								
									
										31
									
								
								spacy/tests/regression/test_issue1537.py
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										31
									
								
								spacy/tests/regression/test_issue1537.py
									
									
									
									
									
										Normal file
									
								
							|  | @ -0,0 +1,31 @@ | |||
| '''Test Span.as_doc() doesn't segfault''' | ||||
| from __future__ import unicode_literals | ||||
| from ...tokens import Doc  | ||||
| from ...vocab import Vocab | ||||
| from ... import load as load_spacy | ||||
| 
 | ||||
| 
 | ||||
| def test_issue1537(): | ||||
|     string = 'The sky is blue . The man is pink . The dog is purple .' | ||||
|     doc = Doc(Vocab(), words=string.split()) | ||||
|     doc[0].sent_start = True | ||||
|     for word in doc[1:]: | ||||
|         if word.nbor(-1).text == '.': | ||||
|             word.sent_start = True | ||||
|         else: | ||||
|             word.sent_start = False | ||||
| 
 | ||||
|     sents = list(doc.sents) | ||||
|     sent0 = sents[0].as_doc() | ||||
|     sent1 = sents[1].as_doc() | ||||
|     assert isinstance(sent0, Doc) | ||||
|     assert isinstance(sent1, Doc) | ||||
| 
 | ||||
| 
 | ||||
| # Currently segfaulting, due to l_edge and r_edge misalignment | ||||
| #def test_issue1537_model(): | ||||
| #    nlp = load_spacy('en') | ||||
| #    doc = nlp(u'The sky is blue. The man is pink. The dog is purple.') | ||||
| #    sents = [s.as_doc() for s in doc.sents] | ||||
| #    print(list(sents[0].noun_chunks)) | ||||
| #    print(list(sents[1].noun_chunks)) | ||||
							
								
								
									
										10
									
								
								spacy/tests/regression/test_issue1539.py
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										10
									
								
								spacy/tests/regression/test_issue1539.py
									
									
									
									
									
										Normal file
									
								
							|  | @ -0,0 +1,10 @@ | |||
| '''Ensure vectors.resize() doesn't try to modify dictionary during iteration.''' | ||||
| from __future__ import unicode_literals | ||||
| 
 | ||||
| from ...vectors import Vectors | ||||
| 
 | ||||
| 
 | ||||
| def test_issue1539(): | ||||
|     v = Vectors(shape=(10, 10), keys=[5,3,98,100]) | ||||
|     v.resize((100,100)) | ||||
| 
 | ||||
							
								
								
									
										18
									
								
								spacy/tests/regression/test_issue1757.py
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										18
									
								
								spacy/tests/regression/test_issue1757.py
									
									
									
									
									
										Normal file
									
								
							|  | @ -0,0 +1,18 @@ | |||
| '''Test comparison against None doesn't cause segfault''' | ||||
| from __future__ import unicode_literals | ||||
| 
 | ||||
| from ...tokens import Doc | ||||
| from ...vocab import Vocab | ||||
| 
 | ||||
| def test_issue1757(): | ||||
|     doc = Doc(Vocab(), words=['a', 'b', 'c']) | ||||
|     assert not doc[0] < None | ||||
|     assert not doc[0] == None | ||||
|     assert doc[0] >= None | ||||
|     span = doc[:2] | ||||
|     assert not span < None | ||||
|     assert not span == None | ||||
|     assert span >= None | ||||
|     lex = doc.vocab['a'] | ||||
|     assert not lex == None | ||||
|     assert not lex < None | ||||
							
								
								
									
										61
									
								
								spacy/tests/regression/test_issue1769.py
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										61
									
								
								spacy/tests/regression/test_issue1769.py
									
									
									
									
									
										Normal file
									
								
							|  | @ -0,0 +1,61 @@ | |||
| # coding: utf-8 | ||||
| from __future__ import unicode_literals | ||||
| from ...util import get_lang_class | ||||
| from ...attrs import LIKE_NUM | ||||
| 
 | ||||
| import pytest | ||||
| 
 | ||||
| 
 | ||||
| @pytest.mark.parametrize('word', ['eleven']) | ||||
| def test_en_lex_attrs(word): | ||||
|     lang = get_lang_class('en') | ||||
|     like_num = lang.Defaults.lex_attr_getters[LIKE_NUM] | ||||
|     assert like_num(word) == like_num(word.upper()) | ||||
| 
 | ||||
| 
 | ||||
| @pytest.mark.slow | ||||
| @pytest.mark.parametrize('word', ['elleve', 'første']) | ||||
| def test_da_lex_attrs(word): | ||||
|     lang = get_lang_class('da') | ||||
|     like_num = lang.Defaults.lex_attr_getters[LIKE_NUM] | ||||
|     assert like_num(word) == like_num(word.upper()) | ||||
| 
 | ||||
| 
 | ||||
| @pytest.mark.slow | ||||
| @pytest.mark.parametrize('word', ['onze', 'onzième']) | ||||
| def test_fr_lex_attrs(word): | ||||
|     lang = get_lang_class('fr') | ||||
|     like_num = lang.Defaults.lex_attr_getters[LIKE_NUM] | ||||
|     assert like_num(word) == like_num(word.upper()) | ||||
| 
 | ||||
| 
 | ||||
| @pytest.mark.slow | ||||
| @pytest.mark.parametrize('word', ['sebelas']) | ||||
| def test_id_lex_attrs(word): | ||||
|     lang = get_lang_class('id') | ||||
|     like_num = lang.Defaults.lex_attr_getters[LIKE_NUM] | ||||
|     assert like_num(word) == like_num(word.upper()) | ||||
| 
 | ||||
| 
 | ||||
| @pytest.mark.slow | ||||
| @pytest.mark.parametrize('word', ['elf', 'elfde']) | ||||
| def test_nl_lex_attrs(word): | ||||
|     lang = get_lang_class('nl') | ||||
|     like_num = lang.Defaults.lex_attr_getters[LIKE_NUM] | ||||
|     assert like_num(word) == like_num(word.upper()) | ||||
| 
 | ||||
| 
 | ||||
| @pytest.mark.slow | ||||
| @pytest.mark.parametrize('word', ['onze', 'quadragésimo']) | ||||
| def test_pt_lex_attrs(word): | ||||
|     lang = get_lang_class('pt') | ||||
|     like_num = lang.Defaults.lex_attr_getters[LIKE_NUM] | ||||
|     assert like_num(word) == like_num(word.upper()) | ||||
| 
 | ||||
| 
 | ||||
| @pytest.mark.slow | ||||
| @pytest.mark.parametrize('word', ['одиннадцать']) | ||||
| def test_ru_lex_attrs(word): | ||||
|     lang = get_lang_class('ru') | ||||
|     like_num = lang.Defaults.lex_attr_getters[LIKE_NUM] | ||||
|     assert like_num(word) == like_num(word.upper()) | ||||
							
								
								
									
										14
									
								
								spacy/tests/regression/test_issue1807.py
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										14
									
								
								spacy/tests/regression/test_issue1807.py
									
									
									
									
									
										Normal file
									
								
							|  | @ -0,0 +1,14 @@ | |||
| '''Test vocab.set_vector also adds the word to the vocab.''' | ||||
| from __future__ import unicode_literals | ||||
| from ...vocab import Vocab | ||||
| 
 | ||||
| import numpy  | ||||
| 
 | ||||
| 
 | ||||
| def test_issue1807(): | ||||
|     vocab = Vocab() | ||||
|     arr = numpy.ones((50,), dtype='f') | ||||
|     assert 'hello' not in vocab | ||||
|     vocab.set_vector('hello', arr) | ||||
|     assert 'hello' in vocab | ||||
| 
 | ||||
|  | @ -295,6 +295,17 @@ cdef class Doc: | |||
|         """ | ||||
|         if 'similarity' in self.user_hooks: | ||||
|             return self.user_hooks['similarity'](self, other) | ||||
|         if isinstance(other, (Lexeme, Token)) and self.length == 1: | ||||
|             if self.c[0].lex.orth == other.orth: | ||||
|                 return 1.0 | ||||
|         elif isinstance(other, (Span, Doc)): | ||||
|             if len(self) == len(other): | ||||
|                 for i in range(self.length): | ||||
|                     if self[i].orth != other[i].orth: | ||||
|                         break | ||||
|                 else: | ||||
|                     return 1.0 | ||||
|   | ||||
|         if self.vector_norm == 0 or other.vector_norm == 0: | ||||
|             return 0.0 | ||||
|         return numpy.dot(self.vector, other.vector) / (self.vector_norm * other.vector_norm) | ||||
|  | @ -508,13 +519,18 @@ cdef class Doc: | |||
|                 yield from self.user_hooks['sents'](self) | ||||
|                 return | ||||
| 
 | ||||
|             if not self.is_parsed: | ||||
|                 raise ValueError( | ||||
|                     "Sentence boundary detection requires the dependency " | ||||
|                     "parse, which requires a statistical model to be " | ||||
|                     "installed and loaded. For more info, see the " | ||||
|                     "documentation: \n%s\n" % about.__docs_models__) | ||||
|             cdef int i | ||||
|             if not self.is_parsed: | ||||
|                 for i in range(1, self.length): | ||||
|                     if self.c[i].sent_start != 0: | ||||
|                         break | ||||
|                 else: | ||||
|                     raise ValueError( | ||||
|                         "Sentence boundaries unset. You can add the 'sentencizer' " | ||||
|                         "component to the pipeline with: " | ||||
|                         "nlp.add_pipe(nlp.create_pipe('sentencizer')) " | ||||
|                         "Alternatively, add the dependency parser, or set " | ||||
|                         "sentence boundaries by setting doc[i].sent_start") | ||||
|             start = 0 | ||||
|             for i in range(1, self.length): | ||||
|                 if self.c[i].sent_start == 1: | ||||
|  |  | |||
|  | @ -64,6 +64,11 @@ cdef class Span: | |||
|         self._vector_norm = vector_norm | ||||
| 
 | ||||
|     def __richcmp__(self, Span other, int op): | ||||
|         if other is None: | ||||
|             if op == 0 or op == 1 or op == 2: | ||||
|                 return False | ||||
|             else: | ||||
|                 return True | ||||
|         # Eq | ||||
|         if op == 0: | ||||
|             return self.start_char < other.start_char | ||||
|  | @ -179,6 +184,15 @@ cdef class Span: | |||
|         """ | ||||
|         if 'similarity' in self.doc.user_span_hooks: | ||||
|             self.doc.user_span_hooks['similarity'](self, other) | ||||
|         if len(self) == 1 and hasattr(other, 'orth'): | ||||
|             if self[0].orth == other.orth: | ||||
|                 return 1.0 | ||||
|         elif hasattr(other, '__len__') and len(self) == len(other): | ||||
|             for i in range(len(self)): | ||||
|                 if self[i].orth != getattr(other[i], 'orth', None): | ||||
|                     break | ||||
|             else: | ||||
|                 return 1.0 | ||||
|         if self.vector_norm == 0.0 or other.vector_norm == 0.0: | ||||
|             return 0.0 | ||||
|         return numpy.dot(self.vector, other.vector) / (self.vector_norm * other.vector_norm) | ||||
|  | @ -261,6 +275,11 @@ cdef class Span: | |||
|             self.start = start | ||||
|             self.end = end + 1 | ||||
| 
 | ||||
|     property vocab: | ||||
|         """RETURNS (Vocab): The Span's Doc's vocab.""" | ||||
|         def __get__(self): | ||||
|             return self.doc.vocab | ||||
| 
 | ||||
|     property sent: | ||||
|         """RETURNS (Span): The sentence span that the span is a part of.""" | ||||
|         def __get__(self): | ||||
|  |  | |||
|  | @ -78,10 +78,15 @@ cdef class Token: | |||
| 
 | ||||
|     def __richcmp__(self, Token other, int op): | ||||
|         # http://cython.readthedocs.io/en/latest/src/userguide/special_methods.html | ||||
|         if other is None: | ||||
|             if op in (0, 1, 2): | ||||
|                 return False | ||||
|             else: | ||||
|                 return True | ||||
|         cdef Doc my_doc = self.doc | ||||
|         cdef Doc other_doc = other.doc | ||||
|         my = self.idx | ||||
|         their = other.idx if other is not None else None | ||||
|         their = other.idx | ||||
|         if op == 0: | ||||
|             return my < their | ||||
|         elif op == 2: | ||||
|  | @ -144,6 +149,12 @@ cdef class Token: | |||
|         """ | ||||
|         if 'similarity' in self.doc.user_token_hooks: | ||||
|             return self.doc.user_token_hooks['similarity'](self) | ||||
|         if hasattr(other, '__len__') and len(other) == 1: | ||||
|             if self.c.lex.orth == getattr(other[0], 'orth', None): | ||||
|                 return 1.0 | ||||
|         elif hasattr(other, 'orth'): | ||||
|             if self.c.lex.orth == other.orth: | ||||
|                 return 1.0 | ||||
|         if self.vector_norm == 0 or other.vector_norm == 0: | ||||
|             return 0.0 | ||||
|         return (numpy.dot(self.vector, other.vector) / | ||||
|  | @ -341,19 +352,20 @@ cdef class Token: | |||
| 
 | ||||
|     property sent_start: | ||||
|         def __get__(self): | ||||
|             util.deprecated( | ||||
|                 "Token.sent_start is now deprecated. Use Token.is_sent_start " | ||||
|                 "instead, which returns a boolean value or None if the answer " | ||||
|                 "is unknown – instead of a misleading 0 for False and 1 for " | ||||
|                 "True. It also fixes a quirk in the old logic that would " | ||||
|                 "always set the property to 0 for the first word of the " | ||||
|                 "document.") | ||||
|             # Raising a deprecation warning causes errors for autocomplete | ||||
|             #util.deprecated( | ||||
|             #    "Token.sent_start is now deprecated. Use Token.is_sent_start " | ||||
|             #    "instead, which returns a boolean value or None if the answer " | ||||
|             #    "is unknown – instead of a misleading 0 for False and 1 for " | ||||
|             #    "True. It also fixes a quirk in the old logic that would " | ||||
|             #    "always set the property to 0 for the first word of the " | ||||
|             #    "document.") | ||||
|             # Handle broken backwards compatibility case: doc[0].sent_start | ||||
|             # was False. | ||||
|             if self.i == 0: | ||||
|                 return False | ||||
|             else: | ||||
|                 return self.sent_start | ||||
|                 return self.c.sent_start | ||||
| 
 | ||||
|         def __set__(self, value): | ||||
|             self.is_sent_start = value | ||||
|  |  | |||
|  | @ -151,7 +151,7 @@ cdef class Vectors: | |||
|         filled = {row for row in self.key2row.values()} | ||||
|         self._unset = {row for row in range(shape[0]) if row not in filled} | ||||
|         removed_items = [] | ||||
|         for key, row in self.key2row.items(): | ||||
|         for key, row in list(self.key2row.items()): | ||||
|             if row >= shape[0]: | ||||
|                 self.key2row.pop(key) | ||||
|                 removed_items.append((key, row)) | ||||
|  |  | |||
|  | @ -335,6 +335,7 @@ cdef class Vocab: | |||
|             else: | ||||
|                 width = self.vectors.shape[1] | ||||
|             self.vectors.resize((new_rows, width)) | ||||
|             lex = self[orth] # Adds worse to vocab | ||||
|             self.vectors.add(orth, vector=vector) | ||||
|         self.vectors.add(orth, vector=vector) | ||||
| 
 | ||||
|  |  | |||
|  | @ -11,5 +11,6 @@ form.o-grid#mc-embedded-subscribe-form(action="//#{MAILCHIMP.user}.list-manage.c | |||
|         input(type="text" name="b_#{MAILCHIMP.id}_#{MAILCHIMP.list}" tabindex="-1" value="") | ||||
| 
 | ||||
|     .o-grid-col.o-grid.o-grid--nowrap.o-field.u-padding-small | ||||
|         input#mce-EMAIL.o-field__input.u-text(type="email" name="EMAIL" placeholder="Your email" aria-label="Your email") | ||||
|         div | ||||
|             input#mce-EMAIL.o-field__input.u-text(type="email" name="EMAIL" placeholder="Your email" aria-label="Your email") | ||||
|         button#mc-embedded-subscribe.o-field__button.u-text-label.u-color-theme.u-nowrap(type="submit" name="subscribe") Sign up | ||||
|  |  | |||
|  | @ -46,7 +46,7 @@ p | |||
| 
 | ||||
|     +table(["Tag", "POS", "Morphology", "Description"]) | ||||
|         +pos-row("-LRB-", "PUNCT", "PunctType=brck PunctSide=ini", "left round bracket") | ||||
|         +pos-row("-PRB-", "PUNCT", "PunctType=brck PunctSide=fin", "right round bracket") | ||||
|         +pos-row("-RRB-", "PUNCT", "PunctType=brck PunctSide=fin", "right round bracket") | ||||
|         +pos-row(",", "PUNCT", "PunctType=comm", "punctuation mark, comma") | ||||
|         +pos-row(":", "PUNCT", "", "punctuation mark, colon or ellipsis") | ||||
|         +pos-row(".", "PUNCT", "PunctType=peri", "punctuation mark, sentence closer") | ||||
|  | @ -86,7 +86,7 @@ p | |||
|         +pos-row("RBR", "ADV", "Degree=comp", "adverb, comparative") | ||||
|         +pos-row("RBS", "ADV", "Degree=sup", "adverb, superlative") | ||||
|         +pos-row("RP", "PART", "", "adverb, particle") | ||||
|         +pos-row("SP", "SPACE", "", "space") | ||||
|         +pos-row("_SP", "SPACE", "", "space") | ||||
|         +pos-row("SYM", "SYM", "", "symbol") | ||||
|         +pos-row("TO", "PART", "PartType=inf VerbForm=inf", "infinitival to") | ||||
|         +pos-row("UH", "INTJ", "", "interjection") | ||||
|  |  | |||
|  | @ -17,6 +17,17 @@ p | |||
|     |  Direct downloads don't perform any compatibility checks and require the | ||||
|     |  model name to be specified with its version (e.g., #[code en_core_web_sm-1.2.0]). | ||||
| 
 | ||||
| +aside("Downloading best practices") | ||||
|     |  The #[code download] command is mostly intended as a convenient, | ||||
|     |  interactive wrapper – it performs compatibility checks and prints | ||||
|     |  detailed messages in case things go wrong. It's #[strong not recommended] | ||||
|     |  to use this command as part of an automated process. If you know which | ||||
|     |  model your project needs, you should consider a | ||||
|     |  #[+a("/usage/models#download-pip") direct download via pip], or | ||||
|     |  uploading the model to a local PyPi installation and fetching it straight | ||||
|     |  from there. This will also allow you to add it as a versioned package | ||||
|     |  dependency to your project. | ||||
| 
 | ||||
| +code(false, "bash", "$"). | ||||
|     python -m spacy download [model] [--direct] | ||||
| 
 | ||||
|  | @ -43,17 +54,6 @@ p | |||
|             |  The installed model package in your #[code site-packages] | ||||
|             |  directory and a shortcut link as a symlink in #[code spacy/data]. | ||||
| 
 | ||||
| +aside("Downloading best practices") | ||||
|     |  The #[code download] command is mostly intended as a convenient, | ||||
|     |  interactive wrapper – it performs compatibility checks and prints | ||||
|     |  detailed messages in case things go wrong. It's #[strong not recommended] | ||||
|     |  to use this command as part of an automated process. If you know which | ||||
|     |  model your project needs, you should consider a | ||||
|     |  #[+a("/usage/models#download-pip") direct download via pip], or | ||||
|     |  uploading the model to a local PyPi installation and fetching it straight | ||||
|     |  from there. This will also allow you to add it as a versioned package | ||||
|     |  dependency to your project. | ||||
| 
 | ||||
| +h(3, "link") Link | ||||
| 
 | ||||
| p | ||||
|  | @ -144,8 +144,14 @@ p | |||
|     |  #[code pip install -U spacy] to ensure that all installed models are | ||||
|     |  can be used with the new version. The command is also useful to detect | ||||
|     |  out-of-sync model links resulting from links created in different virtual | ||||
|     |  environments. Prints a list of models, the installed versions, the latest | ||||
|     |  compatible version (if out of date) and the commands for updating. | ||||
|     |  environments. It will a list of models, the installed versions, the | ||||
|     |  latest compatible version (if out of date) and the commands for updating. | ||||
| 
 | ||||
| +aside("Automated validation") | ||||
|     |  You can also use the #[code validate] command as part of your build | ||||
|     |  process or test suite, to ensure all models are up to date before | ||||
|     |  proceeding. If incompatible models or shortcut links are found, it will | ||||
|     |  return #[code 1]. | ||||
| 
 | ||||
| +code(false, "bash", "$"). | ||||
|     python -m spacy validate | ||||
|  | @ -335,8 +341,12 @@ p | |||
|     |  for your custom #[code train] command while still being able to easily | ||||
|     |  tweak the hyperparameters. For example: | ||||
| 
 | ||||
| +code(false, "bash"). | ||||
|     parser_hidden_depth=2 parser_maxout_pieces=1 train-parser | ||||
| +code(false, "bash", "$"). | ||||
|     parser_hidden_depth=2 parser_maxout_pieces=1 spacy train [...] | ||||
| 
 | ||||
| +code("Usage with alias", "bash", "$"). | ||||
|     alias train-parser="spacy train en /output /data /train /dev -n 1000" | ||||
|     parser_maxout_pieces=1 train-parser | ||||
| 
 | ||||
| +table(["Name", "Description", "Default"]) | ||||
|     +row | ||||
|  |  | |||
|  | @ -28,7 +28,7 @@ p Create the rule-based #[code PhraseMatcher]. | |||
|     +row | ||||
|         +cell #[code max_length] | ||||
|         +cell int | ||||
|         +cell Mamimum length of a phrase pattern to add. | ||||
|         +cell Maximum length of a phrase pattern to add. | ||||
| 
 | ||||
|     +row("foot") | ||||
|         +cell returns | ||||
|  |  | |||
|  | @ -394,7 +394,7 @@ p | |||
|             num, denom = text.split('/') | ||||
|             if num.isdigit() and denom.isdigit(): | ||||
|                 return True | ||||
|         if text in _num_words: | ||||
|         if text.lower() in _num_words: | ||||
|             return True | ||||
|         return False | ||||
| 
 | ||||
|  |  | |||
|  | @ -148,7 +148,7 @@ p | |||
|         +cell Negate the pattern, by requiring it to match exactly 0 times. | ||||
| 
 | ||||
|     +row | ||||
|         +cell #[code *] | ||||
|         +cell #[code ?] | ||||
|         +cell Make the pattern optional, by allowing it to match 0 or 1 times. | ||||
| 
 | ||||
|     +row | ||||
|  | @ -156,8 +156,8 @@ p | |||
|         +cell Require the pattern to match 1 or more times. | ||||
| 
 | ||||
|     +row | ||||
|         +cell #[code ?] | ||||
|         +cell Allow the pattern to zero or more times. | ||||
|         +cell #[code *] | ||||
|         +cell Allow the pattern to match zero or more times. | ||||
| 
 | ||||
| p | ||||
|     |  The #[code +] and #[code *] operators are usually interpretted | ||||
|  | @ -305,6 +305,54 @@ p | |||
|             |  A list of #[code (match_id, start, end)] tuples, describing the | ||||
|             |  matches. A match tuple describes a span #[code doc[start:end]]. | ||||
| 
 | ||||
| +h(3, "regex") Using regular expressions | ||||
| 
 | ||||
| p | ||||
|     |  In some cases, only matching tokens and token attributes isn't enough – | ||||
|     |  for example, you might want to match different spellings of a word, | ||||
|     |  without having to add a new pattern for each spelling. A simple solution | ||||
|     |  is to match a regular expression on the #[code Doc]'s #[code text] and | ||||
|     |  use the #[+api("doc#char_span") #[code Doc.char_span]] method to | ||||
|     |  create a #[code Span] from the character indices of the match: | ||||
| 
 | ||||
| +code. | ||||
|     import spacy | ||||
|     import re | ||||
| 
 | ||||
|     nlp = spacy.load('en') | ||||
|     doc = nlp(u'The spelling is "definitely", not "definately" or "deffinitely".') | ||||
| 
 | ||||
|     DEFINITELY_PATTERN = re.compile(r'deff?in[ia]tely') | ||||
| 
 | ||||
|     for match in re.finditer(DEFINITELY_PATTERN, doc.text): | ||||
|         start, end = match.span()         # get matched indices | ||||
|         span = doc.char_span(start, end)  # create Span from indices | ||||
| 
 | ||||
| p | ||||
|     |  You can also use the regular expression with spaCy's #[code Matcher] by | ||||
|     |  converting it to a token flag. To ensure efficiency, the | ||||
|     |  #[code Matcher] can only access the C-level data. This means that it can | ||||
|     |  either use built-in token attributes or #[strong binary flags]. | ||||
|     |  #[+api("vocab#add_flag") #[code Vocab.add_flag]] returns a flag ID which | ||||
|     |  you can use as a key of a token match pattern. Tokens that match the | ||||
|     |  regular expression will return #[code True] for the #[code IS_DEFINITELY] | ||||
|     |  flag. | ||||
| 
 | ||||
| +code. | ||||
|     IS_DEFINITELY = nlp.vocab.add_flag(re.compile(r'deff?in[ia]tely').match) | ||||
| 
 | ||||
|     matcher = Matcher(nlp.vocab) | ||||
|     matcher.add('DEFINITELY', None, [{IS_DEFINITELY: True}]) | ||||
| 
 | ||||
| p | ||||
|     |  Providing the regular expressions as binary flags also lets you use them | ||||
|     |  in combination with other token patterns – for example, to match the | ||||
|     |  word "definitely" in various spellings, followed by a case-insensitive | ||||
|     |  "not" and and adjective: | ||||
| 
 | ||||
| +code. | ||||
|     [{IS_DEFINITELY: True}, {'LOWER': 'not'}, {'POS': 'ADJ'}] | ||||
| 
 | ||||
| +h(3, "example1") Example: Using linguistic annotations | ||||
| 
 | ||||
| p | ||||
|  | @ -354,7 +402,7 @@ p | |||
|         # append mock entity for match in displaCy style to matched_sents | ||||
|         # get the match span by ofsetting the start and end of the span with the | ||||
|         # start and end of the sentence in the doc | ||||
|         match_ents = [{'start': span.start_char - sent.start_char,  | ||||
|         match_ents = [{'start': span.start_char - sent.start_char, | ||||
|                        'end': span.end_char - sent.start_char, | ||||
|                        'label': 'MATCH'}] | ||||
|         matched_sents.append({'text': sent.text, 'ents': match_ents }) | ||||
|  |  | |||
|  | @ -48,9 +48,9 @@ p | |||
|     |  those IDs back to strings. | ||||
| 
 | ||||
| +code. | ||||
|     moby_dick = open('moby_dick.txt', 'r') # open a large document | ||||
|     doc = nlp(moby_dick) # process it | ||||
|     doc.to_disk('/moby_dick.bin') # save the processed Doc | ||||
|     text = open('customer_feedback_627.txt', 'r').read() # open a document | ||||
|     doc = nlp(text) # process it | ||||
|     doc.to_disk('/customer_feedback_627.bin') # save the processed Doc | ||||
| 
 | ||||
| p | ||||
|     |  If you need it again later, you can load it back into an empty #[code Doc] | ||||
|  | @ -61,4 +61,4 @@ p | |||
|     from spacy.tokens import Doc # to create empty Doc | ||||
|     from spacy.vocab import Vocab # to create empty Vocab | ||||
| 
 | ||||
|     doc = Doc(Vocab()).from_disk('/moby_dick.bin') # load processed Doc | ||||
|     doc = Doc(Vocab()).from_disk('/customer_feedback_627.bin') # load processed Doc | ||||
|  |  | |||
|  | @ -8,7 +8,7 @@ p | |||
|     |  Collecting training data may sound incredibly painful – and it can be, | ||||
|     |  if you're planning a large-scale annotation project. However, if your main | ||||
|     |  goal is to update an existing model's predictions – for example, spaCy's | ||||
|     |  named entity recognition – the hard is part usually not creating the | ||||
|     |  named entity recognition – the hard part is usually not creating the | ||||
|     |  actual annotations. It's finding representative examples and | ||||
|     |  #[strong extracting potential candidates]. The good news is, if you've | ||||
|     |  been noticing bad performance on your data, you likely | ||||
|  |  | |||
|  | @ -106,6 +106,10 @@ p | |||
|             |  #[+api("tagger#from_disk") #[code Tagger.from_disk]] | ||||
|             |  #[+api("tagger#from_bytes") #[code Tagger.from_bytes]] | ||||
| 
 | ||||
|     +row | ||||
|         +cell #[code Tagger.tag_names] | ||||
|         +cell #[code Tagger.labels] | ||||
| 
 | ||||
|     +row | ||||
|         +cell #[code DependencyParser.load] | ||||
|         +cell | ||||
|  |  | |||
|  | @ -37,6 +37,9 @@ include ../_includes/_mixins | |||
|         +card("spacy-api-docker", "https://github.com/jgontrum/spacy-api-docker", "Johannes Gontrum", "github") | ||||
|             |  spaCy accessed by a REST API, wrapped in a Docker container. | ||||
| 
 | ||||
|         +card("languagecrunch", "https://github.com/artpar/languagecrunch", "Parth Mudgal", "github") | ||||
|             |  NLP server for spaCy, WordNet and NeuralCoref as a Docker image. | ||||
| 
 | ||||
|         +card("spacy-nlp-zeromq", "https://github.com/pasupulaphani/spacy-nlp-docker", "Phaninder Pasupula", "github") | ||||
|             |  Docker image exposing spaCy with ZeroMQ bindings. | ||||
| 
 | ||||
|  | @ -69,6 +72,10 @@ include ../_includes/_mixins | |||
|             |  Add language detection to your spaCy pipeline using Compact | ||||
|             |  Language Detector 2 via PYCLD2. | ||||
| 
 | ||||
|         +card("spacy-lookup", "https://github.com/mpuig/spacy-lookup", "Marc Puig", "github") | ||||
|             |  A powerful entity matcher for very large dictionaries, using the | ||||
|             |  FlashText module. | ||||
| 
 | ||||
|     .u-text-right | ||||
|         +button("https://github.com/topics/spacy-extension?o=desc&s=stars", false, "primary", "small") See more extensions on GitHub | ||||
| 
 | ||||
|  |  | |||
		Loading…
	
		Reference in New Issue
	
	Block a user