mirror of
				https://github.com/explosion/spaCy.git
				synced 2025-10-26 21:51:24 +03:00 
			
		
		
		
	Merge branch 'master' into spacy.io
This commit is contained in:
		
						commit
						07ba9b4aa2
					
				
							
								
								
									
										106
									
								
								.github/contributors/GiorgioPorgio.md
									
									
									
									
										vendored
									
									
										Normal file
									
								
							
							
						
						
									
										106
									
								
								.github/contributors/GiorgioPorgio.md
									
									
									
									
										vendored
									
									
										Normal file
									
								
							|  | @ -0,0 +1,106 @@ | ||||||
|  | # spaCy contributor agreement | ||||||
|  | 
 | ||||||
|  | This spaCy Contributor Agreement (**"SCA"**) is based on the | ||||||
|  | [Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf). | ||||||
|  | The SCA applies to any contribution that you make to any product or project | ||||||
|  | managed by us (the **"project"**), and sets out the intellectual property rights | ||||||
|  | you grant to us in the contributed materials. The term **"us"** shall mean | ||||||
|  | [ExplosionAI GmbH](https://explosion.ai/legal). The term | ||||||
|  | **"you"** shall mean the person or entity identified below. | ||||||
|  | 
 | ||||||
|  | If you agree to be bound by these terms, fill in the information requested | ||||||
|  | below and include the filled-in version with your first pull request, under the | ||||||
|  | folder [`.github/contributors/`](/.github/contributors/). The name of the file | ||||||
|  | should be your GitHub username, with the extension `.md`. For example, the user | ||||||
|  | example_user would create the file `.github/contributors/example_user.md`. | ||||||
|  | 
 | ||||||
|  | Read this agreement carefully before signing. These terms and conditions | ||||||
|  | constitute a binding legal agreement. | ||||||
|  | 
 | ||||||
|  | ## Contributor Agreement | ||||||
|  | 
 | ||||||
|  | 1. The term "contribution" or "contributed materials" means any source code, | ||||||
|  | object code, patch, tool, sample, graphic, specification, manual, | ||||||
|  | documentation, or any other material posted or submitted by you to the project. | ||||||
|  | 
 | ||||||
|  | 2. With respect to any worldwide copyrights, or copyright applications and | ||||||
|  | registrations, in your contribution: | ||||||
|  | 
 | ||||||
|  |     * you hereby assign to us joint ownership, and to the extent that such | ||||||
|  |     assignment is or becomes invalid, ineffective or unenforceable, you hereby | ||||||
|  |     grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge, | ||||||
|  |     royalty-free, unrestricted license to exercise all rights under those | ||||||
|  |     copyrights. This includes, at our option, the right to sublicense these same | ||||||
|  |     rights to third parties through multiple levels of sublicensees or other | ||||||
|  |     licensing arrangements; | ||||||
|  | 
 | ||||||
|  |     * you agree that each of us can do all things in relation to your | ||||||
|  |     contribution as if each of us were the sole owners, and if one of us makes | ||||||
|  |     a derivative work of your contribution, the one who makes the derivative | ||||||
|  |     work (or has it made will be the sole owner of that derivative work; | ||||||
|  | 
 | ||||||
|  |     * you agree that you will not assert any moral rights in your contribution | ||||||
|  |     against us, our licensees or transferees; | ||||||
|  | 
 | ||||||
|  |     * you agree that we may register a copyright in your contribution and | ||||||
|  |     exercise all ownership rights associated with it; and | ||||||
|  | 
 | ||||||
|  |     * you agree that neither of us has any duty to consult with, obtain the | ||||||
|  |     consent of, pay or render an accounting to the other for any use or | ||||||
|  |     distribution of your contribution. | ||||||
|  | 
 | ||||||
|  | 3. With respect to any patents you own, or that you can license without payment | ||||||
|  | to any third party, you hereby grant to us a perpetual, irrevocable, | ||||||
|  | non-exclusive, worldwide, no-charge, royalty-free license to: | ||||||
|  | 
 | ||||||
|  |     * make, have made, use, sell, offer to sell, import, and otherwise transfer | ||||||
|  |     your contribution in whole or in part, alone or in combination with or | ||||||
|  |     included in any product, work or materials arising out of the project to | ||||||
|  |     which your contribution was submitted, and | ||||||
|  | 
 | ||||||
|  |     * at our option, to sublicense these same rights to third parties through | ||||||
|  |     multiple levels of sublicensees or other licensing arrangements. | ||||||
|  | 
 | ||||||
|  | 4. Except as set out above, you keep all right, title, and interest in your | ||||||
|  | contribution. The rights that you grant to us under these terms are effective | ||||||
|  | on the date you first submitted a contribution to us, even if your submission | ||||||
|  | took place before the date you sign these terms. | ||||||
|  | 
 | ||||||
|  | 5. You covenant, represent, warrant and agree that: | ||||||
|  | 
 | ||||||
|  |     * Each contribution that you submit is and shall be an original work of | ||||||
|  |     authorship and you can legally grant the rights set out in this SCA; | ||||||
|  | 
 | ||||||
|  |     * to the best of your knowledge, each contribution will not violate any | ||||||
|  |     third party's copyrights, trademarks, patents, or other intellectual | ||||||
|  |     property rights; and | ||||||
|  | 
 | ||||||
|  |     * each contribution shall be in compliance with U.S. export control laws and | ||||||
|  |     other applicable export and import laws. You agree to notify us if you | ||||||
|  |     become aware of any circumstance which would make any of the foregoing | ||||||
|  |     representations inaccurate in any respect. We may publicly disclose your | ||||||
|  |     participation in the project, including the fact that you have signed the SCA. | ||||||
|  | 
 | ||||||
|  | 6. This SCA is governed by the laws of the State of California and applicable | ||||||
|  | U.S. Federal law. Any choice of law rules will not apply. | ||||||
|  | 
 | ||||||
|  | 7. Please place an “x” on one of the applicable statement below. Please do NOT | ||||||
|  | mark both statements: | ||||||
|  | 
 | ||||||
|  |     * [x] I am signing on behalf of myself as an individual and no other person | ||||||
|  |     or entity, including my employer, has or will have rights with respect to my | ||||||
|  |     contributions. | ||||||
|  | 
 | ||||||
|  |     * [ ] I am signing on behalf of my employer or a legal entity and I have the | ||||||
|  |     actual authority to contractually bind that entity. | ||||||
|  | 
 | ||||||
|  | ## Contributor Details | ||||||
|  | 
 | ||||||
|  | | Field                          | Entry                | | ||||||
|  | |------------------------------- | -------------------- | | ||||||
|  | | Name                           |   George Ketsopoulos | | ||||||
|  | | Company name (if applicable)   |                      | | ||||||
|  | | Title or role (if applicable)  |                      | | ||||||
|  | | Date                           |   23 October 2019    | | ||||||
|  | | GitHub username                |   GiorgioPorgio      | | ||||||
|  | | Website (optional)             |                      | | ||||||
							
								
								
									
										106
									
								
								.github/contributors/zhuorulin.md
									
									
									
									
										vendored
									
									
										Normal file
									
								
							
							
						
						
									
										106
									
								
								.github/contributors/zhuorulin.md
									
									
									
									
										vendored
									
									
										Normal file
									
								
							|  | @ -0,0 +1,106 @@ | ||||||
|  | # spaCy contributor agreement | ||||||
|  | 
 | ||||||
|  | This spaCy Contributor Agreement (**"SCA"**) is based on the | ||||||
|  | [Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf). | ||||||
|  | The SCA applies to any contribution that you make to any product or project | ||||||
|  | managed by us (the **"project"**), and sets out the intellectual property rights | ||||||
|  | you grant to us in the contributed materials. The term **"us"** shall mean | ||||||
|  | [ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term | ||||||
|  | **"you"** shall mean the person or entity identified below. | ||||||
|  | 
 | ||||||
|  | If you agree to be bound by these terms, fill in the information requested | ||||||
|  | below and include the filled-in version with your first pull request, under the | ||||||
|  | folder [`.github/contributors/`](/.github/contributors/). The name of the file | ||||||
|  | should be your GitHub username, with the extension `.md`. For example, the user | ||||||
|  | example_user would create the file `.github/contributors/example_user.md`. | ||||||
|  | 
 | ||||||
|  | Read this agreement carefully before signing. These terms and conditions | ||||||
|  | constitute a binding legal agreement. | ||||||
|  | 
 | ||||||
|  | ## Contributor Agreement | ||||||
|  | 
 | ||||||
|  | 1. The term "contribution" or "contributed materials" means any source code, | ||||||
|  | object code, patch, tool, sample, graphic, specification, manual, | ||||||
|  | documentation, or any other material posted or submitted by you to the project. | ||||||
|  | 
 | ||||||
|  | 2. With respect to any worldwide copyrights, or copyright applications and | ||||||
|  | registrations, in your contribution: | ||||||
|  | 
 | ||||||
|  |     * you hereby assign to us joint ownership, and to the extent that such | ||||||
|  |     assignment is or becomes invalid, ineffective or unenforceable, you hereby | ||||||
|  |     grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge, | ||||||
|  |     royalty-free, unrestricted license to exercise all rights under those | ||||||
|  |     copyrights. This includes, at our option, the right to sublicense these same | ||||||
|  |     rights to third parties through multiple levels of sublicensees or other | ||||||
|  |     licensing arrangements; | ||||||
|  | 
 | ||||||
|  |     * you agree that each of us can do all things in relation to your | ||||||
|  |     contribution as if each of us were the sole owners, and if one of us makes | ||||||
|  |     a derivative work of your contribution, the one who makes the derivative | ||||||
|  |     work (or has it made will be the sole owner of that derivative work; | ||||||
|  | 
 | ||||||
|  |     * you agree that you will not assert any moral rights in your contribution | ||||||
|  |     against us, our licensees or transferees; | ||||||
|  | 
 | ||||||
|  |     * you agree that we may register a copyright in your contribution and | ||||||
|  |     exercise all ownership rights associated with it; and | ||||||
|  | 
 | ||||||
|  |     * you agree that neither of us has any duty to consult with, obtain the | ||||||
|  |     consent of, pay or render an accounting to the other for any use or | ||||||
|  |     distribution of your contribution. | ||||||
|  | 
 | ||||||
|  | 3. With respect to any patents you own, or that you can license without payment | ||||||
|  | to any third party, you hereby grant to us a perpetual, irrevocable, | ||||||
|  | non-exclusive, worldwide, no-charge, royalty-free license to: | ||||||
|  | 
 | ||||||
|  |     * make, have made, use, sell, offer to sell, import, and otherwise transfer | ||||||
|  |     your contribution in whole or in part, alone or in combination with or | ||||||
|  |     included in any product, work or materials arising out of the project to | ||||||
|  |     which your contribution was submitted, and | ||||||
|  | 
 | ||||||
|  |     * at our option, to sublicense these same rights to third parties through | ||||||
|  |     multiple levels of sublicensees or other licensing arrangements. | ||||||
|  | 
 | ||||||
|  | 4. Except as set out above, you keep all right, title, and interest in your | ||||||
|  | contribution. The rights that you grant to us under these terms are effective | ||||||
|  | on the date you first submitted a contribution to us, even if your submission | ||||||
|  | took place before the date you sign these terms. | ||||||
|  | 
 | ||||||
|  | 5. You covenant, represent, warrant and agree that: | ||||||
|  | 
 | ||||||
|  |     * Each contribution that you submit is and shall be an original work of | ||||||
|  |     authorship and you can legally grant the rights set out in this SCA; | ||||||
|  | 
 | ||||||
|  |     * to the best of your knowledge, each contribution will not violate any | ||||||
|  |     third party's copyrights, trademarks, patents, or other intellectual | ||||||
|  |     property rights; and | ||||||
|  | 
 | ||||||
|  |     * each contribution shall be in compliance with U.S. export control laws and | ||||||
|  |     other applicable export and import laws. You agree to notify us if you | ||||||
|  |     become aware of any circumstance which would make any of the foregoing | ||||||
|  |     representations inaccurate in any respect. We may publicly disclose your  | ||||||
|  |     participation in the project, including the fact that you have signed the SCA. | ||||||
|  | 
 | ||||||
|  | 6. This SCA is governed by the laws of the State of California and applicable | ||||||
|  | U.S. Federal law. Any choice of law rules will not apply. | ||||||
|  | 
 | ||||||
|  | 7. Please place an “x” on one of the applicable statement below. Please do NOT | ||||||
|  | mark both statements: | ||||||
|  | 
 | ||||||
|  |     * [x] I am signing on behalf of myself as an individual and no other person | ||||||
|  |     or entity, including my employer, has or will have rights with respect to my | ||||||
|  |     contributions. | ||||||
|  | 
 | ||||||
|  |     * [ ] I am signing on behalf of my employer or a legal entity and I have the | ||||||
|  |     actual authority to contractually bind that entity. | ||||||
|  | 
 | ||||||
|  | ## Contributor Details | ||||||
|  | 
 | ||||||
|  | | Field                          | Entry                    | | ||||||
|  | |------------------------------- | ------------------------ | | ||||||
|  | | Name                           | Zhuoru Lin               | | ||||||
|  | | Company name (if applicable)   | Bombora Inc.             | | ||||||
|  | | Title or role (if applicable)  | Data Scientist           | | ||||||
|  | | Date                           | 2017-11-13               | | ||||||
|  | | GitHub username                | ZhuoruLin                | | ||||||
|  | | Website (optional)             |                          | | ||||||
							
								
								
									
										2
									
								
								Makefile
									
									
									
									
									
								
							
							
						
						
									
										2
									
								
								Makefile
									
									
									
									
									
								
							|  | @ -9,7 +9,7 @@ dist/spacy.pex : dist/spacy-$(sha).pex | ||||||
| 
 | 
 | ||||||
| dist/spacy-$(sha).pex : dist/$(wheel) | dist/spacy-$(sha).pex : dist/$(wheel) | ||||||
| 	env3.6/bin/python -m pip install pex==1.5.3 | 	env3.6/bin/python -m pip install pex==1.5.3 | ||||||
| 	env3.6/bin/pex pytest dist/$(wheel) -e spacy -o dist/spacy-$(sha).pex | 	env3.6/bin/pex pytest dist/$(wheel) spacy_lookups_data -e spacy -o dist/spacy-$(sha).pex | ||||||
| 
 | 
 | ||||||
| dist/$(wheel) : setup.py spacy/*.py* spacy/*/*.py* | dist/$(wheel) : setup.py spacy/*.py* spacy/*/*.py* | ||||||
| 	python3.6 -m venv env3.6 | 	python3.6 -m venv env3.6 | ||||||
|  |  | ||||||
							
								
								
									
										13
									
								
								README.md
									
									
									
									
									
								
							
							
						
						
									
										13
									
								
								README.md
									
									
									
									
									
								
							|  | @ -135,8 +135,7 @@ Thanks to our great community, we've finally re-added conda support. You can now | ||||||
| install spaCy via `conda-forge`: | install spaCy via `conda-forge`: | ||||||
| 
 | 
 | ||||||
| ```bash | ```bash | ||||||
| conda config --add channels conda-forge | conda install -c conda-forge spacy | ||||||
| conda install spacy |  | ||||||
| ``` | ``` | ||||||
| 
 | 
 | ||||||
| For the feedstock including the build recipe and configuration, check out | For the feedstock including the build recipe and configuration, check out | ||||||
|  | @ -214,16 +213,6 @@ doc = nlp("This is a sentence.") | ||||||
| 📖 **For more info and examples, check out the | 📖 **For more info and examples, check out the | ||||||
| [models documentation](https://spacy.io/docs/usage/models).** | [models documentation](https://spacy.io/docs/usage/models).** | ||||||
| 
 | 
 | ||||||
| ### Support for older versions |  | ||||||
| 
 |  | ||||||
| If you're using an older version (`v1.6.0` or below), you can still download and |  | ||||||
| install the old models from within spaCy using `python -m spacy.en.download all` |  | ||||||
| or `python -m spacy.de.download all`. The `.tar.gz` archives are also |  | ||||||
| [attached to the v1.6.0 release](https://github.com/explosion/spaCy/tree/v1.6.0). |  | ||||||
| To download and install the models manually, unpack the archive, drop the |  | ||||||
| contained directory into `spacy/data` and load the model via `spacy.load('en')` |  | ||||||
| or `spacy.load('de')`. |  | ||||||
| 
 |  | ||||||
| ## Compile from source | ## Compile from source | ||||||
| 
 | 
 | ||||||
| The other way to install spaCy is to clone its | The other way to install spaCy is to clone its | ||||||
|  |  | ||||||
|  | @ -84,7 +84,7 @@ def read_conllu(file_): | ||||||
| def evaluate(nlp, text_loc, gold_loc, sys_loc, limit=None): | def evaluate(nlp, text_loc, gold_loc, sys_loc, limit=None): | ||||||
|     if text_loc.parts[-1].endswith(".conllu"): |     if text_loc.parts[-1].endswith(".conllu"): | ||||||
|         docs = [] |         docs = [] | ||||||
|         with text_loc.open() as file_: |         with text_loc.open(encoding="utf8") as file_: | ||||||
|             for conllu_doc in read_conllu(file_): |             for conllu_doc in read_conllu(file_): | ||||||
|                 for conllu_sent in conllu_doc: |                 for conllu_sent in conllu_doc: | ||||||
|                     words = [line[1] for line in conllu_sent] |                     words = [line[1] for line in conllu_sent] | ||||||
|  |  | ||||||
|  | @ -203,7 +203,7 @@ def golds_to_gold_tuples(docs, golds): | ||||||
| def evaluate(nlp, text_loc, gold_loc, sys_loc, limit=None): | def evaluate(nlp, text_loc, gold_loc, sys_loc, limit=None): | ||||||
|     if text_loc.parts[-1].endswith(".conllu"): |     if text_loc.parts[-1].endswith(".conllu"): | ||||||
|         docs = [] |         docs = [] | ||||||
|         with text_loc.open() as file_: |         with text_loc.open(encoding="utf8") as file_: | ||||||
|             for conllu_doc in read_conllu(file_): |             for conllu_doc in read_conllu(file_): | ||||||
|                 for conllu_sent in conllu_doc: |                 for conllu_sent in conllu_doc: | ||||||
|                     words = [line[1] for line in conllu_sent] |                     words = [line[1] for line in conllu_sent] | ||||||
|  | @ -378,7 +378,7 @@ def _load_pretrained_tok2vec(nlp, loc): | ||||||
|     """Load pretrained weights for the 'token-to-vector' part of the component |     """Load pretrained weights for the 'token-to-vector' part of the component | ||||||
|     models, which is typically a CNN. See 'spacy pretrain'. Experimental. |     models, which is typically a CNN. See 'spacy pretrain'. Experimental. | ||||||
|     """ |     """ | ||||||
|     with Path(loc).open("rb") as file_: |     with Path(loc).open("rb", encoding="utf8") as file_: | ||||||
|         weights_data = file_.read() |         weights_data = file_.read() | ||||||
|     loaded = [] |     loaded = [] | ||||||
|     for name, component in nlp.pipeline: |     for name, component in nlp.pipeline: | ||||||
|  | @ -519,8 +519,8 @@ def main( | ||||||
|     for i in range(config.nr_epoch): |     for i in range(config.nr_epoch): | ||||||
|         docs, golds = read_data( |         docs, golds = read_data( | ||||||
|             nlp, |             nlp, | ||||||
|             paths.train.conllu.open(), |             paths.train.conllu.open(encoding="utf8"), | ||||||
|             paths.train.text.open(), |             paths.train.text.open(encoding="utf8"), | ||||||
|             max_doc_length=config.max_doc_length, |             max_doc_length=config.max_doc_length, | ||||||
|             limit=limit, |             limit=limit, | ||||||
|             oracle_segments=use_oracle_segments, |             oracle_segments=use_oracle_segments, | ||||||
|  | @ -560,7 +560,7 @@ def main( | ||||||
| 
 | 
 | ||||||
| def _render_parses(i, to_render): | def _render_parses(i, to_render): | ||||||
|     to_render[0].user_data["title"] = "Batch %d" % i |     to_render[0].user_data["title"] = "Batch %d" % i | ||||||
|     with Path("/tmp/parses.html").open("w") as file_: |     with Path("/tmp/parses.html").open("w", encoding="utf8") as file_: | ||||||
|         html = displacy.render(to_render[:5], style="dep", page=True) |         html = displacy.render(to_render[:5], style="dep", page=True) | ||||||
|         file_.write(html) |         file_.write(html) | ||||||
| 
 | 
 | ||||||
|  |  | ||||||
|  | @ -77,6 +77,8 @@ def main( | ||||||
|     if labels_discard: |     if labels_discard: | ||||||
|         labels_discard = [x.strip() for x in labels_discard.split(",")] |         labels_discard = [x.strip() for x in labels_discard.split(",")] | ||||||
|         logger.info("Discarding {} NER types: {}".format(len(labels_discard), labels_discard)) |         logger.info("Discarding {} NER types: {}".format(len(labels_discard), labels_discard)) | ||||||
|  |     else: | ||||||
|  |         labels_discard = [] | ||||||
| 
 | 
 | ||||||
|     train_data = wikipedia_processor.read_training( |     train_data = wikipedia_processor.read_training( | ||||||
|         nlp=nlp, |         nlp=nlp, | ||||||
|  |  | ||||||
|  | @ -18,19 +18,21 @@ during training. We discard the auxiliary model before run-time. | ||||||
| The specific example here is not necessarily a good idea --- but it shows | The specific example here is not necessarily a good idea --- but it shows | ||||||
| how an arbitrary objective function for some word can be used. | how an arbitrary objective function for some word can be used. | ||||||
| 
 | 
 | ||||||
| Developed and tested for spaCy 2.0.6 | Developed and tested for spaCy 2.0.6. Updated for v2.2.2 | ||||||
| """ | """ | ||||||
| import random | import random | ||||||
| import plac | import plac | ||||||
| import spacy | import spacy | ||||||
| import os.path | import os.path | ||||||
|  | from spacy.tokens import Doc | ||||||
| from spacy.gold import read_json_file, GoldParse | from spacy.gold import read_json_file, GoldParse | ||||||
| 
 | 
 | ||||||
| random.seed(0) | random.seed(0) | ||||||
| 
 | 
 | ||||||
| PWD = os.path.dirname(__file__) | PWD = os.path.dirname(__file__) | ||||||
| 
 | 
 | ||||||
| TRAIN_DATA = list(read_json_file(os.path.join(PWD, "training-data.json"))) | TRAIN_DATA = list(read_json_file( | ||||||
|  |     os.path.join(PWD, "ner_example_data", "ner-sent-per-line.json"))) | ||||||
| 
 | 
 | ||||||
| 
 | 
 | ||||||
| def get_position_label(i, words, tags, heads, labels, ents): | def get_position_label(i, words, tags, heads, labels, ents): | ||||||
|  | @ -55,6 +57,7 @@ def main(n_iter=10): | ||||||
|     ner = nlp.create_pipe("ner") |     ner = nlp.create_pipe("ner") | ||||||
|     ner.add_multitask_objective(get_position_label) |     ner.add_multitask_objective(get_position_label) | ||||||
|     nlp.add_pipe(ner) |     nlp.add_pipe(ner) | ||||||
|  |     print(nlp.pipeline) | ||||||
| 
 | 
 | ||||||
|     print("Create data", len(TRAIN_DATA)) |     print("Create data", len(TRAIN_DATA)) | ||||||
|     optimizer = nlp.begin_training(get_gold_tuples=lambda: TRAIN_DATA) |     optimizer = nlp.begin_training(get_gold_tuples=lambda: TRAIN_DATA) | ||||||
|  | @ -62,23 +65,24 @@ def main(n_iter=10): | ||||||
|         random.shuffle(TRAIN_DATA) |         random.shuffle(TRAIN_DATA) | ||||||
|         losses = {} |         losses = {} | ||||||
|         for text, annot_brackets in TRAIN_DATA: |         for text, annot_brackets in TRAIN_DATA: | ||||||
|             annotations, _ = annot_brackets |             for annotations, _ in annot_brackets: | ||||||
|             doc = nlp.make_doc(text) |                 doc = Doc(nlp.vocab, words=annotations[1]) | ||||||
|             gold = GoldParse.from_annot_tuples(doc, annotations[0]) |                 gold = GoldParse.from_annot_tuples(doc, annotations) | ||||||
|             nlp.update( |                 nlp.update( | ||||||
|                 [doc],  # batch of texts |                     [doc],  # batch of texts | ||||||
|                 [gold],  # batch of annotations |                     [gold],  # batch of annotations | ||||||
|                 drop=0.2,  # dropout - make it harder to memorise data |                     drop=0.2,  # dropout - make it harder to memorise data | ||||||
|                 sgd=optimizer,  # callable to update weights |                     sgd=optimizer,  # callable to update weights | ||||||
|                 losses=losses, |                     losses=losses, | ||||||
|             ) |                 ) | ||||||
|         print(losses.get("nn_labeller", 0.0), losses["ner"]) |         print(losses.get("nn_labeller", 0.0), losses["ner"]) | ||||||
| 
 | 
 | ||||||
|     # test the trained model |     # test the trained model | ||||||
|     for text, _ in TRAIN_DATA: |     for text, _ in TRAIN_DATA: | ||||||
|         doc = nlp(text) |         if text is not None: | ||||||
|         print("Entities", [(ent.text, ent.label_) for ent in doc.ents]) |             doc = nlp(text) | ||||||
|         print("Tokens", [(t.text, t.ent_type_, t.ent_iob) for t in doc]) |             print("Entities", [(ent.text, ent.label_) for ent in doc.ents]) | ||||||
|  |             print("Tokens", [(t.text, t.ent_type_, t.ent_iob) for t in doc]) | ||||||
| 
 | 
 | ||||||
| 
 | 
 | ||||||
| if __name__ == "__main__": | if __name__ == "__main__": | ||||||
|  |  | ||||||
|  | @ -1,10 +1,10 @@ | ||||||
| # Our libraries | # Our libraries | ||||||
| cymem>=2.0.2,<2.1.0 | cymem>=2.0.2,<2.1.0 | ||||||
| preshed>=3.0.2,<3.1.0 | preshed>=3.0.2,<3.1.0 | ||||||
| thinc>=7.2.0,<7.3.0 | thinc>=7.3.0,<7.4.0 | ||||||
| blis>=0.4.0,<0.5.0 | blis>=0.4.0,<0.5.0 | ||||||
| murmurhash>=0.28.0,<1.1.0 | murmurhash>=0.28.0,<1.1.0 | ||||||
| wasabi>=0.2.0,<1.1.0 | wasabi>=0.3.0,<1.1.0 | ||||||
| srsly>=0.1.0,<1.1.0 | srsly>=0.1.0,<1.1.0 | ||||||
| # Third party dependencies | # Third party dependencies | ||||||
| numpy>=1.15.0 | numpy>=1.15.0 | ||||||
|  |  | ||||||
|  | @ -38,18 +38,18 @@ setup_requires = | ||||||
|     cymem>=2.0.2,<2.1.0 |     cymem>=2.0.2,<2.1.0 | ||||||
|     preshed>=3.0.2,<3.1.0 |     preshed>=3.0.2,<3.1.0 | ||||||
|     murmurhash>=0.28.0,<1.1.0 |     murmurhash>=0.28.0,<1.1.0 | ||||||
|     thinc>=7.2.0,<7.3.0 |     thinc>=7.3.0,<7.4.0 | ||||||
| install_requires = | install_requires = | ||||||
|     setuptools |     setuptools | ||||||
|     numpy>=1.15.0 |     numpy>=1.15.0 | ||||||
|     murmurhash>=0.28.0,<1.1.0 |     murmurhash>=0.28.0,<1.1.0 | ||||||
|     cymem>=2.0.2,<2.1.0 |     cymem>=2.0.2,<2.1.0 | ||||||
|     preshed>=3.0.2,<3.1.0 |     preshed>=3.0.2,<3.1.0 | ||||||
|     thinc>=7.2.0,<7.3.0 |     thinc>=7.3.0,<7.4.0 | ||||||
|     blis>=0.4.0,<0.5.0 |     blis>=0.4.0,<0.5.0 | ||||||
|     plac>=0.9.6,<1.2.0 |     plac>=0.9.6,<1.2.0 | ||||||
|     requests>=2.13.0,<3.0.0 |     requests>=2.13.0,<3.0.0 | ||||||
|     wasabi>=0.2.0,<1.1.0 |     wasabi>=0.3.0,<1.1.0 | ||||||
|     srsly>=0.1.0,<1.1.0 |     srsly>=0.1.0,<1.1.0 | ||||||
|     pathlib==1.0.1; python_version < "3.4" |     pathlib==1.0.1; python_version < "3.4" | ||||||
|     importlib_metadata>=0.20; python_version < "3.8" |     importlib_metadata>=0.20; python_version < "3.8" | ||||||
|  |  | ||||||
|  | @ -9,12 +9,14 @@ warnings.filterwarnings("ignore", message="numpy.ufunc size changed") | ||||||
| # These are imported as part of the API | # These are imported as part of the API | ||||||
| from thinc.neural.util import prefer_gpu, require_gpu | from thinc.neural.util import prefer_gpu, require_gpu | ||||||
| 
 | 
 | ||||||
|  | from . import pipeline | ||||||
| from .cli.info import info as cli_info | from .cli.info import info as cli_info | ||||||
| from .glossary import explain | from .glossary import explain | ||||||
| from .about import __version__ | from .about import __version__ | ||||||
| from .errors import Errors, Warnings, deprecation_warning | from .errors import Errors, Warnings, deprecation_warning | ||||||
| from . import util | from . import util | ||||||
| from .util import register_architecture, get_architecture | from .util import register_architecture, get_architecture | ||||||
|  | from .language import component | ||||||
| 
 | 
 | ||||||
| 
 | 
 | ||||||
| if sys.maxunicode == 65535: | if sys.maxunicode == 65535: | ||||||
|  |  | ||||||
							
								
								
									
										161
									
								
								spacy/_ml.py
									
									
									
									
									
								
							
							
						
						
									
										161
									
								
								spacy/_ml.py
									
									
									
									
									
								
							|  | @ -3,16 +3,14 @@ from __future__ import unicode_literals | ||||||
| 
 | 
 | ||||||
| import numpy | import numpy | ||||||
| from thinc.v2v import Model, Maxout, Softmax, Affine, ReLu | from thinc.v2v import Model, Maxout, Softmax, Affine, ReLu | ||||||
| from thinc.i2v import HashEmbed, StaticVectors |  | ||||||
| from thinc.t2t import ExtractWindow, ParametricAttention | from thinc.t2t import ExtractWindow, ParametricAttention | ||||||
| from thinc.t2v import Pooling, sum_pool, mean_pool | from thinc.t2v import Pooling, sum_pool, mean_pool | ||||||
| from thinc.misc import Residual | from thinc.i2v import HashEmbed | ||||||
|  | from thinc.misc import Residual, FeatureExtracter | ||||||
| from thinc.misc import LayerNorm as LN | from thinc.misc import LayerNorm as LN | ||||||
| from thinc.misc import FeatureExtracter |  | ||||||
| from thinc.api import add, layerize, chain, clone, concatenate, with_flatten | from thinc.api import add, layerize, chain, clone, concatenate, with_flatten | ||||||
| from thinc.api import with_getitem, flatten_add_lengths | from thinc.api import with_getitem, flatten_add_lengths | ||||||
| from thinc.api import uniqued, wrap, noop | from thinc.api import uniqued, wrap, noop | ||||||
| from thinc.api import with_square_sequences |  | ||||||
| from thinc.linear.linear import LinearModel | from thinc.linear.linear import LinearModel | ||||||
| from thinc.neural.ops import NumpyOps, CupyOps | from thinc.neural.ops import NumpyOps, CupyOps | ||||||
| from thinc.neural.util import get_array_module, copy_array | from thinc.neural.util import get_array_module, copy_array | ||||||
|  | @ -26,14 +24,13 @@ import thinc.extra.load_nlp | ||||||
| from .attrs import ID, ORTH, LOWER, NORM, PREFIX, SUFFIX, SHAPE | from .attrs import ID, ORTH, LOWER, NORM, PREFIX, SUFFIX, SHAPE | ||||||
| from .errors import Errors, user_warning, Warnings | from .errors import Errors, user_warning, Warnings | ||||||
| from . import util | from . import util | ||||||
|  | from . import ml as new_ml | ||||||
|  | from .ml import _legacy_tok2vec | ||||||
| 
 | 
 | ||||||
| try: |  | ||||||
|     import torch.nn |  | ||||||
|     from thinc.extra.wrappers import PyTorchWrapperRNN |  | ||||||
| except ImportError: |  | ||||||
|     torch = None |  | ||||||
| 
 | 
 | ||||||
| VECTORS_KEY = "spacy_pretrained_vectors" | VECTORS_KEY = "spacy_pretrained_vectors" | ||||||
|  | # Backwards compatibility with <2.2.2 | ||||||
|  | USE_MODEL_REGISTRY_TOK2VEC = False | ||||||
| 
 | 
 | ||||||
| 
 | 
 | ||||||
| def cosine(vec1, vec2): | def cosine(vec1, vec2): | ||||||
|  | @ -310,6 +307,10 @@ def link_vectors_to_models(vocab): | ||||||
| 
 | 
 | ||||||
| 
 | 
 | ||||||
| def PyTorchBiLSTM(nO, nI, depth, dropout=0.2): | def PyTorchBiLSTM(nO, nI, depth, dropout=0.2): | ||||||
|  |     import torch.nn | ||||||
|  |     from thinc.api import with_square_sequences | ||||||
|  |     from thinc.extra.wrappers import PyTorchWrapperRNN | ||||||
|  | 
 | ||||||
|     if depth == 0: |     if depth == 0: | ||||||
|         return layerize(noop()) |         return layerize(noop()) | ||||||
|     model = torch.nn.LSTM(nI, nO // 2, depth, bidirectional=True, dropout=dropout) |     model = torch.nn.LSTM(nI, nO // 2, depth, bidirectional=True, dropout=dropout) | ||||||
|  | @ -317,81 +318,89 @@ def PyTorchBiLSTM(nO, nI, depth, dropout=0.2): | ||||||
| 
 | 
 | ||||||
| 
 | 
 | ||||||
| def Tok2Vec(width, embed_size, **kwargs): | def Tok2Vec(width, embed_size, **kwargs): | ||||||
|  |     if not USE_MODEL_REGISTRY_TOK2VEC: | ||||||
|  |         # Preserve prior tok2vec for backwards compat, in v2.2.2 | ||||||
|  |         return _legacy_tok2vec.Tok2Vec(width, embed_size, **kwargs) | ||||||
|     pretrained_vectors = kwargs.get("pretrained_vectors", None) |     pretrained_vectors = kwargs.get("pretrained_vectors", None) | ||||||
|     cnn_maxout_pieces = kwargs.get("cnn_maxout_pieces", 3) |     cnn_maxout_pieces = kwargs.get("cnn_maxout_pieces", 3) | ||||||
|     subword_features = kwargs.get("subword_features", True) |     subword_features = kwargs.get("subword_features", True) | ||||||
|     char_embed = kwargs.get("char_embed", False) |     char_embed = kwargs.get("char_embed", False) | ||||||
|     if char_embed: |  | ||||||
|         subword_features = False |  | ||||||
|     conv_depth = kwargs.get("conv_depth", 4) |     conv_depth = kwargs.get("conv_depth", 4) | ||||||
|     bilstm_depth = kwargs.get("bilstm_depth", 0) |     bilstm_depth = kwargs.get("bilstm_depth", 0) | ||||||
|     cols = [ID, NORM, PREFIX, SUFFIX, SHAPE, ORTH] |     conv_window = kwargs.get("conv_window", 1) | ||||||
|     with Model.define_operators( |  | ||||||
|         {">>": chain, "|": concatenate, "**": clone, "+": add, "*": reapply} |  | ||||||
|     ): |  | ||||||
|         norm = HashEmbed(width, embed_size, column=cols.index(NORM), name="embed_norm") |  | ||||||
|         if subword_features: |  | ||||||
|             prefix = HashEmbed( |  | ||||||
|                 width, embed_size // 2, column=cols.index(PREFIX), name="embed_prefix" |  | ||||||
|             ) |  | ||||||
|             suffix = HashEmbed( |  | ||||||
|                 width, embed_size // 2, column=cols.index(SUFFIX), name="embed_suffix" |  | ||||||
|             ) |  | ||||||
|             shape = HashEmbed( |  | ||||||
|                 width, embed_size // 2, column=cols.index(SHAPE), name="embed_shape" |  | ||||||
|             ) |  | ||||||
|         else: |  | ||||||
|             prefix, suffix, shape = (None, None, None) |  | ||||||
|         if pretrained_vectors is not None: |  | ||||||
|             glove = StaticVectors(pretrained_vectors, width, column=cols.index(ID)) |  | ||||||
| 
 | 
 | ||||||
|             if subword_features: |     cols = ["ID", "NORM", "PREFIX", "SUFFIX", "SHAPE", "ORTH"] | ||||||
|                 embed = uniqued( |  | ||||||
|                     (glove | norm | prefix | suffix | shape) |  | ||||||
|                     >> LN(Maxout(width, width * 5, pieces=3)), |  | ||||||
|                     column=cols.index(ORTH), |  | ||||||
|                 ) |  | ||||||
|             else: |  | ||||||
|                 embed = uniqued( |  | ||||||
|                     (glove | norm) >> LN(Maxout(width, width * 2, pieces=3)), |  | ||||||
|                     column=cols.index(ORTH), |  | ||||||
|                 ) |  | ||||||
|         elif subword_features: |  | ||||||
|             embed = uniqued( |  | ||||||
|                 (norm | prefix | suffix | shape) |  | ||||||
|                 >> LN(Maxout(width, width * 4, pieces=3)), |  | ||||||
|                 column=cols.index(ORTH), |  | ||||||
|             ) |  | ||||||
|         elif char_embed: |  | ||||||
|             embed = concatenate_lists( |  | ||||||
|                 CharacterEmbed(nM=64, nC=8), |  | ||||||
|                 FeatureExtracter(cols) >> with_flatten(norm), |  | ||||||
|             ) |  | ||||||
|             reduce_dimensions = LN( |  | ||||||
|                 Maxout(width, 64 * 8 + width, pieces=cnn_maxout_pieces) |  | ||||||
|             ) |  | ||||||
|         else: |  | ||||||
|             embed = norm |  | ||||||
| 
 | 
 | ||||||
|         convolution = Residual( |     doc2feats_cfg = {"arch": "spacy.Doc2Feats.v1", "config": {"columns": cols}} | ||||||
|             ExtractWindow(nW=1) |     if char_embed: | ||||||
|             >> LN(Maxout(width, width * 3, pieces=cnn_maxout_pieces)) |         embed_cfg = { | ||||||
|         ) |             "arch": "spacy.CharacterEmbed.v1", | ||||||
|         if char_embed: |             "config": { | ||||||
|             tok2vec = embed >> with_flatten( |                 "width": 64, | ||||||
|                 reduce_dimensions >> convolution ** conv_depth, pad=conv_depth |                 "chars": 6, | ||||||
|             ) |                 "@mix": { | ||||||
|         else: |                     "arch": "spacy.LayerNormalizedMaxout.v1", | ||||||
|             tok2vec = FeatureExtracter(cols) >> with_flatten( |                     "config": {"width": width, "pieces": 3}, | ||||||
|                 embed >> convolution ** conv_depth, pad=conv_depth |                 }, | ||||||
|             ) |                 "@embed_features": None, | ||||||
| 
 |             }, | ||||||
|         if bilstm_depth >= 1: |         } | ||||||
|             tok2vec = tok2vec >> PyTorchBiLSTM(width, width, bilstm_depth) |     else: | ||||||
|         # Work around thinc API limitations :(. TODO: Revise in Thinc 7 |         embed_cfg = { | ||||||
|         tok2vec.nO = width |             "arch": "spacy.MultiHashEmbed.v1", | ||||||
|         tok2vec.embed = embed |             "config": { | ||||||
|     return tok2vec |                 "width": width, | ||||||
|  |                 "rows": embed_size, | ||||||
|  |                 "columns": cols, | ||||||
|  |                 "use_subwords": subword_features, | ||||||
|  |                 "@pretrained_vectors": None, | ||||||
|  |                 "@mix": { | ||||||
|  |                     "arch": "spacy.LayerNormalizedMaxout.v1", | ||||||
|  |                     "config": {"width": width, "pieces": 3}, | ||||||
|  |                 }, | ||||||
|  |             }, | ||||||
|  |         } | ||||||
|  |         if pretrained_vectors: | ||||||
|  |             embed_cfg["config"]["@pretrained_vectors"] = { | ||||||
|  |                 "arch": "spacy.PretrainedVectors.v1", | ||||||
|  |                 "config": { | ||||||
|  |                     "vectors_name": pretrained_vectors, | ||||||
|  |                     "width": width, | ||||||
|  |                     "column": cols.index("ID"), | ||||||
|  |                 }, | ||||||
|  |             } | ||||||
|  |     if cnn_maxout_pieces >= 2: | ||||||
|  |         cnn_cfg = { | ||||||
|  |             "arch": "spacy.MaxoutWindowEncoder.v1", | ||||||
|  |             "config": { | ||||||
|  |                 "width": width, | ||||||
|  |                 "window_size": conv_window, | ||||||
|  |                 "pieces": cnn_maxout_pieces, | ||||||
|  |                 "depth": conv_depth, | ||||||
|  |             }, | ||||||
|  |         } | ||||||
|  |     else: | ||||||
|  |         cnn_cfg = { | ||||||
|  |             "arch": "spacy.MishWindowEncoder.v1", | ||||||
|  |             "config": {"width": width, "window_size": conv_window, "depth": conv_depth}, | ||||||
|  |         } | ||||||
|  |     bilstm_cfg = { | ||||||
|  |         "arch": "spacy.TorchBiLSTMEncoder.v1", | ||||||
|  |         "config": {"width": width, "depth": bilstm_depth}, | ||||||
|  |     } | ||||||
|  |     if conv_depth == 0 and bilstm_depth == 0: | ||||||
|  |         encode_cfg = {} | ||||||
|  |     elif conv_depth >= 1 and bilstm_depth >= 1: | ||||||
|  |         encode_cfg = { | ||||||
|  |             "arch": "thinc.FeedForward.v1", | ||||||
|  |             "config": {"children": [cnn_cfg, bilstm_cfg]}, | ||||||
|  |         } | ||||||
|  |     elif conv_depth >= 1: | ||||||
|  |         encode_cfg = cnn_cfg | ||||||
|  |     else: | ||||||
|  |         encode_cfg = bilstm_cfg | ||||||
|  |     config = {"@doc2feats": doc2feats_cfg, "@embed": embed_cfg, "@encode": encode_cfg} | ||||||
|  |     return new_ml.Tok2Vec(config) | ||||||
| 
 | 
 | ||||||
| 
 | 
 | ||||||
| def reapply(layer, n_times): | def reapply(layer, n_times): | ||||||
|  |  | ||||||
|  | @ -1,6 +1,6 @@ | ||||||
| # fmt: off | # fmt: off | ||||||
| __title__ = "spacy" | __title__ = "spacy" | ||||||
| __version__ = "2.2.2.dev1" | __version__ = "2.2.2" | ||||||
| __release__ = True | __release__ = True | ||||||
| __download_url__ = "https://github.com/explosion/spacy-models/releases/download" | __download_url__ = "https://github.com/explosion/spacy-models/releases/download" | ||||||
| __compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json" | __compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json" | ||||||
|  |  | ||||||
							
								
								
									
										179
									
								
								spacy/analysis.py
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										179
									
								
								spacy/analysis.py
									
									
									
									
									
										Normal file
									
								
							|  | @ -0,0 +1,179 @@ | ||||||
|  | # coding: utf8 | ||||||
|  | from __future__ import unicode_literals | ||||||
|  | 
 | ||||||
|  | from collections import OrderedDict | ||||||
|  | from wasabi import Printer | ||||||
|  | 
 | ||||||
|  | from .tokens import Doc, Token, Span | ||||||
|  | from .errors import Errors, Warnings, user_warning | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
|  | def analyze_pipes(pipeline, name, pipe, index, warn=True): | ||||||
|  |     """Analyze a pipeline component with respect to its position in the current | ||||||
|  |     pipeline and the other components. Will check whether requirements are | ||||||
|  |     fulfilled (e.g. if previous components assign the attributes). | ||||||
|  | 
 | ||||||
|  |     pipeline (list): A list of (name, pipe) tuples e.g. nlp.pipeline. | ||||||
|  |     name (unicode): The name of the pipeline component to analyze. | ||||||
|  |     pipe (callable): The pipeline component function to analyze. | ||||||
|  |     index (int): The index of the component in the pipeline. | ||||||
|  |     warn (bool): Show user warning if problem is found. | ||||||
|  |     RETURNS (list): The problems found for the given pipeline component. | ||||||
|  |     """ | ||||||
|  |     assert pipeline[index][0] == name | ||||||
|  |     prev_pipes = pipeline[:index] | ||||||
|  |     pipe_requires = getattr(pipe, "requires", []) | ||||||
|  |     requires = OrderedDict([(annot, False) for annot in pipe_requires]) | ||||||
|  |     if requires: | ||||||
|  |         for prev_name, prev_pipe in prev_pipes: | ||||||
|  |             prev_assigns = getattr(prev_pipe, "assigns", []) | ||||||
|  |             for annot in prev_assigns: | ||||||
|  |                 requires[annot] = True | ||||||
|  |     problems = [] | ||||||
|  |     for annot, fulfilled in requires.items(): | ||||||
|  |         if not fulfilled: | ||||||
|  |             problems.append(annot) | ||||||
|  |             if warn: | ||||||
|  |                 user_warning(Warnings.W025.format(name=name, attr=annot)) | ||||||
|  |     return problems | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
|  | def analyze_all_pipes(pipeline, warn=True): | ||||||
|  |     """Analyze all pipes in the pipeline in order. | ||||||
|  | 
 | ||||||
|  |     pipeline (list): A list of (name, pipe) tuples e.g. nlp.pipeline. | ||||||
|  |     warn (bool): Show user warning if problem is found. | ||||||
|  |     RETURNS (dict): The problems found, keyed by component name. | ||||||
|  |     """ | ||||||
|  |     problems = {} | ||||||
|  |     for i, (name, pipe) in enumerate(pipeline): | ||||||
|  |         problems[name] = analyze_pipes(pipeline, name, pipe, i, warn=warn) | ||||||
|  |     return problems | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
|  | def dot_to_dict(values): | ||||||
|  |     """Convert dot notation to a dict. For example: ["token.pos", "token._.xyz"] | ||||||
|  |     become {"token": {"pos": True, "_": {"xyz": True }}}. | ||||||
|  | 
 | ||||||
|  |     values (iterable): The values to convert. | ||||||
|  |     RETURNS (dict): The converted values. | ||||||
|  |     """ | ||||||
|  |     result = {} | ||||||
|  |     for value in values: | ||||||
|  |         path = result | ||||||
|  |         parts = value.lower().split(".") | ||||||
|  |         for i, item in enumerate(parts): | ||||||
|  |             is_last = i == len(parts) - 1 | ||||||
|  |             path = path.setdefault(item, True if is_last else {}) | ||||||
|  |     return result | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
|  | def validate_attrs(values): | ||||||
|  |     """Validate component attributes provided to "assigns", "requires" etc. | ||||||
|  |     Raises error for invalid attributes and formatting. Doesn't check if | ||||||
|  |     custom extension attributes are registered, since this is something the | ||||||
|  |     user might want to do themselves later in the component. | ||||||
|  | 
 | ||||||
|  |     values (iterable): The string attributes to check, e.g. `["token.pos"]`. | ||||||
|  |     RETURNS (iterable): The checked attributes. | ||||||
|  |     """ | ||||||
|  |     data = dot_to_dict(values) | ||||||
|  |     objs = {"doc": Doc, "token": Token, "span": Span} | ||||||
|  |     for obj_key, attrs in data.items(): | ||||||
|  |         if obj_key == "span": | ||||||
|  |             # Support Span only for custom extension attributes | ||||||
|  |             span_attrs = [attr for attr in values if attr.startswith("span.")] | ||||||
|  |             span_attrs = [attr for attr in span_attrs if not attr.startswith("span._.")] | ||||||
|  |             if span_attrs: | ||||||
|  |                 raise ValueError(Errors.E180.format(attrs=", ".join(span_attrs))) | ||||||
|  |         if obj_key not in objs:  # first element is not doc/token/span | ||||||
|  |             invalid_attrs = ", ".join(a for a in values if a.startswith(obj_key)) | ||||||
|  |             raise ValueError(Errors.E181.format(obj=obj_key, attrs=invalid_attrs)) | ||||||
|  |         if not isinstance(attrs, dict):  # attr is something like "doc" | ||||||
|  |             raise ValueError(Errors.E182.format(attr=obj_key)) | ||||||
|  |         for attr, value in attrs.items(): | ||||||
|  |             if attr == "_": | ||||||
|  |                 if value is True:  # attr is something like "doc._" | ||||||
|  |                     raise ValueError(Errors.E182.format(attr="{}._".format(obj_key))) | ||||||
|  |                 for ext_attr, ext_value in value.items(): | ||||||
|  |                     # We don't check whether the attribute actually exists | ||||||
|  |                     if ext_value is not True:  # attr is something like doc._.x.y | ||||||
|  |                         good = "{}._.{}".format(obj_key, ext_attr) | ||||||
|  |                         bad = "{}.{}".format(good, ".".join(ext_value)) | ||||||
|  |                         raise ValueError(Errors.E183.format(attr=bad, solution=good)) | ||||||
|  |                 continue  # we can't validate those further | ||||||
|  |             if attr.endswith("_"):  # attr is something like "token.pos_" | ||||||
|  |                 raise ValueError(Errors.E184.format(attr=attr, solution=attr[:-1])) | ||||||
|  |             if value is not True:  # attr is something like doc.x.y | ||||||
|  |                 good = "{}.{}".format(obj_key, attr) | ||||||
|  |                 bad = "{}.{}".format(good, ".".join(value)) | ||||||
|  |                 raise ValueError(Errors.E183.format(attr=bad, solution=good)) | ||||||
|  |             obj = objs[obj_key] | ||||||
|  |             if not hasattr(obj, attr): | ||||||
|  |                 raise ValueError(Errors.E185.format(obj=obj_key, attr=attr)) | ||||||
|  |     return values | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
|  | def _get_feature_for_attr(pipeline, attr, feature): | ||||||
|  |     assert feature in ["assigns", "requires"] | ||||||
|  |     result = [] | ||||||
|  |     for pipe_name, pipe in pipeline: | ||||||
|  |         pipe_assigns = getattr(pipe, feature, []) | ||||||
|  |         if attr in pipe_assigns: | ||||||
|  |             result.append((pipe_name, pipe)) | ||||||
|  |     return result | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
|  | def get_assigns_for_attr(pipeline, attr): | ||||||
|  |     """Get all pipeline components that assign an attr, e.g. "doc.tensor". | ||||||
|  | 
 | ||||||
|  |     pipeline (list): A list of (name, pipe) tuples e.g. nlp.pipeline. | ||||||
|  |     attr (unicode): The attribute to check. | ||||||
|  |     RETURNS (list): (name, pipeline) tuples of components that assign the attr. | ||||||
|  |     """ | ||||||
|  |     return _get_feature_for_attr(pipeline, attr, "assigns") | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
|  | def get_requires_for_attr(pipeline, attr): | ||||||
|  |     """Get all pipeline components that require an attr, e.g. "doc.tensor". | ||||||
|  | 
 | ||||||
|  |     pipeline (list): A list of (name, pipe) tuples e.g. nlp.pipeline. | ||||||
|  |     attr (unicode): The attribute to check. | ||||||
|  |     RETURNS (list): (name, pipeline) tuples of components that require the attr. | ||||||
|  |     """ | ||||||
|  |     return _get_feature_for_attr(pipeline, attr, "requires") | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
|  | def print_summary(nlp, pretty=True, no_print=False): | ||||||
|  |     """Print a formatted summary for the current nlp object's pipeline. Shows | ||||||
|  |     a table with the pipeline components and why they assign and require, as | ||||||
|  |     well as any problems if available. | ||||||
|  | 
 | ||||||
|  |     nlp (Language): The nlp object. | ||||||
|  |     pretty (bool): Pretty-print the results (color etc). | ||||||
|  |     no_print (bool): Don't print anything, just return the data. | ||||||
|  |     RETURNS (dict): A dict with "overview" and "problems". | ||||||
|  |     """ | ||||||
|  |     msg = Printer(pretty=pretty, no_print=no_print) | ||||||
|  |     overview = [] | ||||||
|  |     problems = {} | ||||||
|  |     for i, (name, pipe) in enumerate(nlp.pipeline): | ||||||
|  |         requires = getattr(pipe, "requires", []) | ||||||
|  |         assigns = getattr(pipe, "assigns", []) | ||||||
|  |         retok = getattr(pipe, "retokenizes", False) | ||||||
|  |         overview.append((i, name, requires, assigns, retok)) | ||||||
|  |         problems[name] = analyze_pipes(nlp.pipeline, name, pipe, i, warn=False) | ||||||
|  |     msg.divider("Pipeline Overview") | ||||||
|  |     header = ("#", "Component", "Requires", "Assigns", "Retokenizes") | ||||||
|  |     msg.table(overview, header=header, divider=True, multiline=True) | ||||||
|  |     n_problems = sum(len(p) for p in problems.values()) | ||||||
|  |     if any(p for p in problems.values()): | ||||||
|  |         msg.divider("Problems ({})".format(n_problems)) | ||||||
|  |         for name, problem in problems.items(): | ||||||
|  |             if problem: | ||||||
|  |                 problem = ", ".join(problem) | ||||||
|  |                 msg.warn("'{}' requirements not met: {}".format(name, problem)) | ||||||
|  |     else: | ||||||
|  |         msg.good("No problems found.") | ||||||
|  |     if no_print: | ||||||
|  |         return {"overview": overview, "problems": problems} | ||||||
|  | @ -57,7 +57,7 @@ def convert( | ||||||
|     is written to stdout, so you can pipe them forward to a JSON file: |     is written to stdout, so you can pipe them forward to a JSON file: | ||||||
|     $ spacy convert some_file.conllu > some_file.json |     $ spacy convert some_file.conllu > some_file.json | ||||||
|     """ |     """ | ||||||
|     no_print = (output_dir == "-") |     no_print = output_dir == "-" | ||||||
|     msg = Printer(no_print=no_print) |     msg = Printer(no_print=no_print) | ||||||
|     input_path = Path(input_file) |     input_path = Path(input_file) | ||||||
|     if file_type not in FILE_TYPES: |     if file_type not in FILE_TYPES: | ||||||
|  |  | ||||||
|  | @ -9,7 +9,9 @@ from ...tokens.doc import Doc | ||||||
| from ...util import load_model | from ...util import load_model | ||||||
| 
 | 
 | ||||||
| 
 | 
 | ||||||
| def conll_ner2json(input_data, n_sents=10, seg_sents=False, model=None, no_print=False, **kwargs): | def conll_ner2json( | ||||||
|  |     input_data, n_sents=10, seg_sents=False, model=None, no_print=False, **kwargs | ||||||
|  | ): | ||||||
|     """ |     """ | ||||||
|     Convert files in the CoNLL-2003 NER format and similar |     Convert files in the CoNLL-2003 NER format and similar | ||||||
|     whitespace-separated columns into JSON format for use with train cli. |     whitespace-separated columns into JSON format for use with train cli. | ||||||
|  |  | ||||||
|  | @ -35,6 +35,10 @@ from .train import _load_pretrained_tok2vec | ||||||
|     output_dir=("Directory to write models to on each epoch", "positional", None, str), |     output_dir=("Directory to write models to on each epoch", "positional", None, str), | ||||||
|     width=("Width of CNN layers", "option", "cw", int), |     width=("Width of CNN layers", "option", "cw", int), | ||||||
|     depth=("Depth of CNN layers", "option", "cd", int), |     depth=("Depth of CNN layers", "option", "cd", int), | ||||||
|  |     cnn_window=("Window size for CNN layers", "option", "cW", int), | ||||||
|  |     cnn_pieces=("Maxout size for CNN layers. 1 for Mish", "option", "cP", int), | ||||||
|  |     use_chars=("Whether to use character-based embedding", "flag", "chr", bool), | ||||||
|  |     sa_depth=("Depth of self-attention layers", "option", "sa", int), | ||||||
|     bilstm_depth=("Depth of BiLSTM layers (requires PyTorch)", "option", "lstm", int), |     bilstm_depth=("Depth of BiLSTM layers (requires PyTorch)", "option", "lstm", int), | ||||||
|     embed_rows=("Number of embedding rows", "option", "er", int), |     embed_rows=("Number of embedding rows", "option", "er", int), | ||||||
|     loss_func=( |     loss_func=( | ||||||
|  | @ -81,7 +85,11 @@ def pretrain( | ||||||
|     output_dir, |     output_dir, | ||||||
|     width=96, |     width=96, | ||||||
|     depth=4, |     depth=4, | ||||||
|     bilstm_depth=2, |     bilstm_depth=0, | ||||||
|  |     cnn_pieces=3, | ||||||
|  |     sa_depth=0, | ||||||
|  |     use_chars=False, | ||||||
|  |     cnn_window=1, | ||||||
|     embed_rows=2000, |     embed_rows=2000, | ||||||
|     loss_func="cosine", |     loss_func="cosine", | ||||||
|     use_vectors=False, |     use_vectors=False, | ||||||
|  | @ -158,8 +166,8 @@ def pretrain( | ||||||
|             conv_depth=depth, |             conv_depth=depth, | ||||||
|             pretrained_vectors=pretrained_vectors, |             pretrained_vectors=pretrained_vectors, | ||||||
|             bilstm_depth=bilstm_depth,  # Requires PyTorch. Experimental. |             bilstm_depth=bilstm_depth,  # Requires PyTorch. Experimental. | ||||||
|             cnn_maxout_pieces=3,  # You can try setting this higher |             subword_features=not use_chars,  # Set to False for Chinese etc | ||||||
|             subword_features=True,  # Set to False for Chinese etc |             cnn_maxout_pieces=cnn_pieces,  # If set to 1, use Mish activation. | ||||||
|         ), |         ), | ||||||
|     ) |     ) | ||||||
|     # Load in pretrained weights |     # Load in pretrained weights | ||||||
|  |  | ||||||
|  | @ -156,8 +156,7 @@ def train( | ||||||
|                 "`lang` argument ('{}') ".format(nlp.lang, lang), |                 "`lang` argument ('{}') ".format(nlp.lang, lang), | ||||||
|                 exits=1, |                 exits=1, | ||||||
|             ) |             ) | ||||||
|         other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipeline] |         nlp.disable_pipes([p for p in nlp.pipe_names if p not in pipeline]) | ||||||
|         nlp.disable_pipes(*other_pipes) |  | ||||||
|         for pipe in pipeline: |         for pipe in pipeline: | ||||||
|             if pipe not in nlp.pipe_names: |             if pipe not in nlp.pipe_names: | ||||||
|                 if pipe == "parser": |                 if pipe == "parser": | ||||||
|  | @ -263,7 +262,11 @@ def train( | ||||||
|                 exits=1, |                 exits=1, | ||||||
|             ) |             ) | ||||||
|         train_docs = corpus.train_docs( |         train_docs = corpus.train_docs( | ||||||
|             nlp, noise_level=noise_level, gold_preproc=gold_preproc, max_length=0 |             nlp, | ||||||
|  |             noise_level=noise_level, | ||||||
|  |             gold_preproc=gold_preproc, | ||||||
|  |             max_length=0, | ||||||
|  |             ignore_misaligned=True, | ||||||
|         ) |         ) | ||||||
|         train_labels = set() |         train_labels = set() | ||||||
|         if textcat_multilabel: |         if textcat_multilabel: | ||||||
|  | @ -344,6 +347,7 @@ def train( | ||||||
|                 orth_variant_level=orth_variant_level, |                 orth_variant_level=orth_variant_level, | ||||||
|                 gold_preproc=gold_preproc, |                 gold_preproc=gold_preproc, | ||||||
|                 max_length=0, |                 max_length=0, | ||||||
|  |                 ignore_misaligned=True, | ||||||
|             ) |             ) | ||||||
|             if raw_text: |             if raw_text: | ||||||
|                 random.shuffle(raw_text) |                 random.shuffle(raw_text) | ||||||
|  | @ -382,7 +386,11 @@ def train( | ||||||
|                         if hasattr(component, "cfg"): |                         if hasattr(component, "cfg"): | ||||||
|                             component.cfg["beam_width"] = beam_width |                             component.cfg["beam_width"] = beam_width | ||||||
|                     dev_docs = list( |                     dev_docs = list( | ||||||
|                         corpus.dev_docs(nlp_loaded, gold_preproc=gold_preproc) |                         corpus.dev_docs( | ||||||
|  |                             nlp_loaded, | ||||||
|  |                             gold_preproc=gold_preproc, | ||||||
|  |                             ignore_misaligned=True, | ||||||
|  |                         ) | ||||||
|                     ) |                     ) | ||||||
|                     nwords = sum(len(doc_gold[0]) for doc_gold in dev_docs) |                     nwords = sum(len(doc_gold[0]) for doc_gold in dev_docs) | ||||||
|                     start_time = timer() |                     start_time = timer() | ||||||
|  | @ -399,7 +407,11 @@ def train( | ||||||
|                                 if hasattr(component, "cfg"): |                                 if hasattr(component, "cfg"): | ||||||
|                                     component.cfg["beam_width"] = beam_width |                                     component.cfg["beam_width"] = beam_width | ||||||
|                             dev_docs = list( |                             dev_docs = list( | ||||||
|                                 corpus.dev_docs(nlp_loaded, gold_preproc=gold_preproc) |                                 corpus.dev_docs( | ||||||
|  |                                     nlp_loaded, | ||||||
|  |                                     gold_preproc=gold_preproc, | ||||||
|  |                                     ignore_misaligned=True, | ||||||
|  |                                 ) | ||||||
|                             ) |                             ) | ||||||
|                             start_time = timer() |                             start_time = timer() | ||||||
|                             scorer = nlp_loaded.evaluate(dev_docs, verbose=verbose) |                             scorer = nlp_loaded.evaluate(dev_docs, verbose=verbose) | ||||||
|  |  | ||||||
|  | @ -12,6 +12,7 @@ import os | ||||||
| import sys | import sys | ||||||
| import itertools | import itertools | ||||||
| import ast | import ast | ||||||
|  | import types | ||||||
| 
 | 
 | ||||||
| from thinc.neural.util import copy_array | from thinc.neural.util import copy_array | ||||||
| 
 | 
 | ||||||
|  | @ -67,6 +68,7 @@ if is_python2: | ||||||
|     basestring_ = basestring  # noqa: F821 |     basestring_ = basestring  # noqa: F821 | ||||||
|     input_ = raw_input  # noqa: F821 |     input_ = raw_input  # noqa: F821 | ||||||
|     path2str = lambda path: str(path).decode("utf8") |     path2str = lambda path: str(path).decode("utf8") | ||||||
|  |     class_types = (type, types.ClassType) | ||||||
| 
 | 
 | ||||||
| elif is_python3: | elif is_python3: | ||||||
|     bytes_ = bytes |     bytes_ = bytes | ||||||
|  | @ -74,6 +76,7 @@ elif is_python3: | ||||||
|     basestring_ = str |     basestring_ = str | ||||||
|     input_ = input |     input_ = input | ||||||
|     path2str = lambda path: str(path) |     path2str = lambda path: str(path) | ||||||
|  |     class_types = (type, types.ClassType) if is_python_pre_3_5 else type | ||||||
| 
 | 
 | ||||||
| 
 | 
 | ||||||
| def b_to_str(b_str): | def b_to_str(b_str): | ||||||
|  |  | ||||||
|  | @ -44,14 +44,14 @@ TPL_ENTS = """ | ||||||
| 
 | 
 | ||||||
| 
 | 
 | ||||||
| TPL_ENT = """ | TPL_ENT = """ | ||||||
| <mark class="entity" style="background: {bg}; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone"> | <mark class="entity" style="background: {bg}; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;"> | ||||||
|     {text} |     {text} | ||||||
|     <span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem">{label}</span> |     <span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem">{label}</span> | ||||||
| </mark> | </mark> | ||||||
| """ | """ | ||||||
| 
 | 
 | ||||||
| TPL_ENT_RTL = """ | TPL_ENT_RTL = """ | ||||||
| <mark class="entity" style="background: {bg}; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;"> | <mark class="entity" style="background: {bg}; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em"> | ||||||
|     {text} |     {text} | ||||||
|     <span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-right: 0.5rem">{label}</span> |     <span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-right: 0.5rem">{label}</span> | ||||||
| </mark> | </mark> | ||||||
|  |  | ||||||
|  | @ -99,6 +99,8 @@ class Warnings(object): | ||||||
|             "'n_process' will be set to 1.") |             "'n_process' will be set to 1.") | ||||||
|     W024 = ("Entity '{entity}' - Alias '{alias}' combination already exists in " |     W024 = ("Entity '{entity}' - Alias '{alias}' combination already exists in " | ||||||
|             "the Knowledge Base.") |             "the Knowledge Base.") | ||||||
|  |     W025 = ("'{name}' requires '{attr}' to be assigned, but none of the " | ||||||
|  |             "previous components in the pipeline declare that they assign it.") | ||||||
| 
 | 
 | ||||||
| 
 | 
 | ||||||
| @add_codes | @add_codes | ||||||
|  | @ -504,6 +506,29 @@ class Errors(object): | ||||||
|     E175 = ("Can't remove rule for unknown match pattern ID: {key}") |     E175 = ("Can't remove rule for unknown match pattern ID: {key}") | ||||||
|     E176 = ("Alias '{alias}' is not defined in the Knowledge Base.") |     E176 = ("Alias '{alias}' is not defined in the Knowledge Base.") | ||||||
|     E177 = ("Ill-formed IOB input detected: {tag}") |     E177 = ("Ill-formed IOB input detected: {tag}") | ||||||
|  |     E178 = ("Invalid pattern. Expected list of dicts but got: {pat}. Maybe you " | ||||||
|  |             "accidentally passed a single pattern to Matcher.add instead of a " | ||||||
|  |             "list of patterns? If you only want to add one pattern, make sure " | ||||||
|  |             "to wrap it in a list. For example: matcher.add('{key}', [pattern])") | ||||||
|  |     E179 = ("Invalid pattern. Expected a list of Doc objects but got a single " | ||||||
|  |             "Doc. If you only want to add one pattern, make sure to wrap it " | ||||||
|  |             "in a list. For example: matcher.add('{key}', [doc])") | ||||||
|  |     E180 = ("Span attributes can't be declared as required or assigned by " | ||||||
|  |             "components, since spans are only views of the Doc. Use Doc and " | ||||||
|  |             "Token attributes (or custom extension attributes) only and remove " | ||||||
|  |             "the following: {attrs}") | ||||||
|  |     E181 = ("Received invalid attributes for unkown object {obj}: {attrs}. " | ||||||
|  |             "Only Doc and Token attributes are supported.") | ||||||
|  |     E182 = ("Received invalid attribute declaration: {attr}\nDid you forget " | ||||||
|  |             "to define the attribute? For example: {attr}.???") | ||||||
|  |     E183 = ("Received invalid attribute declaration: {attr}\nOnly top-level " | ||||||
|  |             "attributes are supported, for example: {solution}") | ||||||
|  |     E184 = ("Only attributes without underscores are supported in component " | ||||||
|  |             "attribute declarations (because underscore and non-underscore " | ||||||
|  |             "attributes are connected anyways): {attr} -> {solution}") | ||||||
|  |     E185 = ("Received invalid attribute in component attribute declaration: " | ||||||
|  |             "{obj}.{attr}\nAttribute '{attr}' does not exist on {obj}.") | ||||||
|  |     E186 = ("'{tok_a}' and '{tok_b}' are different texts.") | ||||||
| 
 | 
 | ||||||
| 
 | 
 | ||||||
| @add_codes | @add_codes | ||||||
|  | @ -536,6 +561,10 @@ class MatchPatternError(ValueError): | ||||||
|         ValueError.__init__(self, msg) |         ValueError.__init__(self, msg) | ||||||
| 
 | 
 | ||||||
| 
 | 
 | ||||||
|  | class AlignmentError(ValueError): | ||||||
|  |     pass | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
| class ModelsWarning(UserWarning): | class ModelsWarning(UserWarning): | ||||||
|     pass |     pass | ||||||
| 
 | 
 | ||||||
|  |  | ||||||
|  | @ -80,7 +80,7 @@ GLOSSARY = { | ||||||
|     "RBR": "adverb, comparative", |     "RBR": "adverb, comparative", | ||||||
|     "RBS": "adverb, superlative", |     "RBS": "adverb, superlative", | ||||||
|     "RP": "adverb, particle", |     "RP": "adverb, particle", | ||||||
|     "TO": "infinitival to", |     "TO": 'infinitival "to"', | ||||||
|     "UH": "interjection", |     "UH": "interjection", | ||||||
|     "VB": "verb, base form", |     "VB": "verb, base form", | ||||||
|     "VBD": "verb, past tense", |     "VBD": "verb, past tense", | ||||||
|  | @ -279,6 +279,12 @@ GLOSSARY = { | ||||||
|     "re": "repeated element", |     "re": "repeated element", | ||||||
|     "rs": "reported speech", |     "rs": "reported speech", | ||||||
|     "sb": "subject", |     "sb": "subject", | ||||||
|  |     "sb": "subject", | ||||||
|  |     "sbp": "passivized subject (PP)", | ||||||
|  |     "sp": "subject or predicate", | ||||||
|  |     "svp": "separable verb prefix", | ||||||
|  |     "uc": "unit component", | ||||||
|  |     "vo": "vocative", | ||||||
|     # Named Entity Recognition |     # Named Entity Recognition | ||||||
|     # OntoNotes 5 |     # OntoNotes 5 | ||||||
|     # https://catalog.ldc.upenn.edu/docs/LDC2013T19/OntoNotes-Release-5.0.pdf |     # https://catalog.ldc.upenn.edu/docs/LDC2013T19/OntoNotes-Release-5.0.pdf | ||||||
|  |  | ||||||
							
								
								
									
										178
									
								
								spacy/gold.pyx
									
									
									
									
									
								
							
							
						
						
									
										178
									
								
								spacy/gold.pyx
									
									
									
									
									
								
							|  | @ -11,10 +11,9 @@ import itertools | ||||||
| from pathlib import Path | from pathlib import Path | ||||||
| import srsly | import srsly | ||||||
| 
 | 
 | ||||||
| from . import _align |  | ||||||
| from .syntax import nonproj | from .syntax import nonproj | ||||||
| from .tokens import Doc, Span | from .tokens import Doc, Span | ||||||
| from .errors import Errors | from .errors import Errors, AlignmentError | ||||||
| from .compat import path2str | from .compat import path2str | ||||||
| from . import util | from . import util | ||||||
| from .util import minibatch, itershuffle | from .util import minibatch, itershuffle | ||||||
|  | @ -22,6 +21,7 @@ from .util import minibatch, itershuffle | ||||||
| from libc.stdio cimport FILE, fopen, fclose, fread, fwrite, feof, fseek | from libc.stdio cimport FILE, fopen, fclose, fread, fwrite, feof, fseek | ||||||
| 
 | 
 | ||||||
| 
 | 
 | ||||||
|  | USE_NEW_ALIGN = False | ||||||
| punct_re = re.compile(r"\W") | punct_re = re.compile(r"\W") | ||||||
| 
 | 
 | ||||||
| 
 | 
 | ||||||
|  | @ -56,10 +56,10 @@ def tags_to_entities(tags): | ||||||
| 
 | 
 | ||||||
| def merge_sents(sents): | def merge_sents(sents): | ||||||
|     m_deps = [[], [], [], [], [], []] |     m_deps = [[], [], [], [], [], []] | ||||||
|  |     m_cats = {} | ||||||
|     m_brackets = [] |     m_brackets = [] | ||||||
|     m_cats = sents.pop() |  | ||||||
|     i = 0 |     i = 0 | ||||||
|     for (ids, words, tags, heads, labels, ner), brackets in sents: |     for (ids, words, tags, heads, labels, ner), (cats, brackets) in sents: | ||||||
|         m_deps[0].extend(id_ + i for id_ in ids) |         m_deps[0].extend(id_ + i for id_ in ids) | ||||||
|         m_deps[1].extend(words) |         m_deps[1].extend(words) | ||||||
|         m_deps[2].extend(tags) |         m_deps[2].extend(tags) | ||||||
|  | @ -68,12 +68,26 @@ def merge_sents(sents): | ||||||
|         m_deps[5].extend(ner) |         m_deps[5].extend(ner) | ||||||
|         m_brackets.extend((b["first"] + i, b["last"] + i, b["label"]) |         m_brackets.extend((b["first"] + i, b["last"] + i, b["label"]) | ||||||
|                           for b in brackets) |                           for b in brackets) | ||||||
|  |         m_cats.update(cats) | ||||||
|         i += len(ids) |         i += len(ids) | ||||||
|     m_deps.append(m_cats) |     return [(m_deps, (m_cats, m_brackets))] | ||||||
|     return [(m_deps, m_brackets)] |  | ||||||
| 
 | 
 | ||||||
| 
 | 
 | ||||||
| def align(tokens_a, tokens_b): | _ALIGNMENT_NORM_MAP = [("``", "'"), ("''", "'"), ('"', "'"), ("`", "'")] | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
|  | def _normalize_for_alignment(tokens): | ||||||
|  |     tokens = [w.replace(" ", "").lower() for w in tokens] | ||||||
|  |     output = [] | ||||||
|  |     for token in tokens: | ||||||
|  |         token = token.replace(" ", "").lower() | ||||||
|  |         for before, after in _ALIGNMENT_NORM_MAP: | ||||||
|  |             token = token.replace(before, after) | ||||||
|  |         output.append(token) | ||||||
|  |     return output | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
|  | def _align_before_v2_2_2(tokens_a, tokens_b): | ||||||
|     """Calculate alignment tables between two tokenizations, using the Levenshtein |     """Calculate alignment tables between two tokenizations, using the Levenshtein | ||||||
|     algorithm. The alignment is case-insensitive. |     algorithm. The alignment is case-insensitive. | ||||||
| 
 | 
 | ||||||
|  | @ -92,6 +106,7 @@ def align(tokens_a, tokens_b): | ||||||
|       * b2a_multi (Dict[int, int]): As with `a2b_multi`, but mapping the other |       * b2a_multi (Dict[int, int]): As with `a2b_multi`, but mapping the other | ||||||
|             direction. |             direction. | ||||||
|     """ |     """ | ||||||
|  |     from . import _align | ||||||
|     if tokens_a == tokens_b: |     if tokens_a == tokens_b: | ||||||
|         alignment = numpy.arange(len(tokens_a)) |         alignment = numpy.arange(len(tokens_a)) | ||||||
|         return 0, alignment, alignment, {}, {} |         return 0, alignment, alignment, {}, {} | ||||||
|  | @ -111,6 +126,82 @@ def align(tokens_a, tokens_b): | ||||||
|     return cost, i2j, j2i, i2j_multi, j2i_multi |     return cost, i2j, j2i, i2j_multi, j2i_multi | ||||||
| 
 | 
 | ||||||
| 
 | 
 | ||||||
|  | def align(tokens_a, tokens_b): | ||||||
|  |     """Calculate alignment tables between two tokenizations. | ||||||
|  | 
 | ||||||
|  |     tokens_a (List[str]): The candidate tokenization. | ||||||
|  |     tokens_b (List[str]): The reference tokenization. | ||||||
|  |     RETURNS: (tuple): A 5-tuple consisting of the following information: | ||||||
|  |       * cost (int): The number of misaligned tokens. | ||||||
|  |       * a2b (List[int]): Mapping of indices in `tokens_a` to indices in `tokens_b`. | ||||||
|  |         For instance, if `a2b[4] == 6`, that means that `tokens_a[4]` aligns | ||||||
|  |         to `tokens_b[6]`. If there's no one-to-one alignment for a token, | ||||||
|  |         it has the value -1. | ||||||
|  |       * b2a (List[int]): The same as `a2b`, but mapping the other direction. | ||||||
|  |       * a2b_multi (Dict[int, int]): A dictionary mapping indices in `tokens_a` | ||||||
|  |         to indices in `tokens_b`, where multiple tokens of `tokens_a` align to | ||||||
|  |         the same token of `tokens_b`. | ||||||
|  |       * b2a_multi (Dict[int, int]): As with `a2b_multi`, but mapping the other | ||||||
|  |             direction. | ||||||
|  |     """ | ||||||
|  |     if not USE_NEW_ALIGN: | ||||||
|  |         return _align_before_v2_2_2(tokens_a, tokens_b) | ||||||
|  |     tokens_a = _normalize_for_alignment(tokens_a) | ||||||
|  |     tokens_b = _normalize_for_alignment(tokens_b) | ||||||
|  |     cost = 0 | ||||||
|  |     a2b = numpy.empty(len(tokens_a), dtype="i") | ||||||
|  |     b2a = numpy.empty(len(tokens_b), dtype="i") | ||||||
|  |     a2b_multi = {} | ||||||
|  |     b2a_multi = {} | ||||||
|  |     i = 0 | ||||||
|  |     j = 0 | ||||||
|  |     offset_a = 0 | ||||||
|  |     offset_b = 0 | ||||||
|  |     while i < len(tokens_a) and j < len(tokens_b): | ||||||
|  |         a = tokens_a[i][offset_a:] | ||||||
|  |         b = tokens_b[j][offset_b:] | ||||||
|  |         a2b[i] =  b2a[j] = -1 | ||||||
|  |         if a == b: | ||||||
|  |             if offset_a == offset_b == 0: | ||||||
|  |                 a2b[i] = j | ||||||
|  |                 b2a[j] = i | ||||||
|  |             elif offset_a == 0: | ||||||
|  |                 cost += 2 | ||||||
|  |                 a2b_multi[i] = j | ||||||
|  |             elif offset_b == 0: | ||||||
|  |                 cost += 2 | ||||||
|  |                 b2a_multi[j] = i | ||||||
|  |             offset_a = offset_b = 0 | ||||||
|  |             i += 1 | ||||||
|  |             j += 1 | ||||||
|  |         elif a == "": | ||||||
|  |             assert offset_a == 0 | ||||||
|  |             cost += 1 | ||||||
|  |             i += 1 | ||||||
|  |         elif b == "": | ||||||
|  |             assert offset_b == 0 | ||||||
|  |             cost += 1 | ||||||
|  |             j += 1 | ||||||
|  |         elif b.startswith(a): | ||||||
|  |             cost += 1 | ||||||
|  |             if offset_a == 0: | ||||||
|  |                 a2b_multi[i] = j | ||||||
|  |             i += 1 | ||||||
|  |             offset_a = 0 | ||||||
|  |             offset_b += len(a) | ||||||
|  |         elif a.startswith(b): | ||||||
|  |             cost += 1 | ||||||
|  |             if offset_b == 0: | ||||||
|  |                 b2a_multi[j] = i | ||||||
|  |             j += 1 | ||||||
|  |             offset_b = 0 | ||||||
|  |             offset_a += len(b) | ||||||
|  |         else: | ||||||
|  |             assert "".join(tokens_a) != "".join(tokens_b) | ||||||
|  |             raise AlignmentError(Errors.E186.format(tok_a=tokens_a, tok_b=tokens_b)) | ||||||
|  |     return cost, a2b, b2a, a2b_multi, b2a_multi | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
| class GoldCorpus(object): | class GoldCorpus(object): | ||||||
|     """An annotated corpus, using the JSON file format. Manages |     """An annotated corpus, using the JSON file format. Manages | ||||||
|     annotations for tagging, dependency parsing and NER. |     annotations for tagging, dependency parsing and NER. | ||||||
|  | @ -176,6 +267,11 @@ class GoldCorpus(object): | ||||||
|                 gold_tuples = read_json_file(loc) |                 gold_tuples = read_json_file(loc) | ||||||
|             elif loc.parts[-1].endswith("jsonl"): |             elif loc.parts[-1].endswith("jsonl"): | ||||||
|                 gold_tuples = srsly.read_jsonl(loc) |                 gold_tuples = srsly.read_jsonl(loc) | ||||||
|  |                 first_gold_tuple = next(gold_tuples) | ||||||
|  |                 gold_tuples = itertools.chain([first_gold_tuple], gold_tuples) | ||||||
|  |                 # TODO: proper format checks with schemas | ||||||
|  |                 if isinstance(first_gold_tuple, dict): | ||||||
|  |                     gold_tuples = read_json_object(gold_tuples) | ||||||
|             elif loc.parts[-1].endswith("msg"): |             elif loc.parts[-1].endswith("msg"): | ||||||
|                 gold_tuples = srsly.read_msgpack(loc) |                 gold_tuples = srsly.read_msgpack(loc) | ||||||
|             else: |             else: | ||||||
|  | @ -201,7 +297,6 @@ class GoldCorpus(object): | ||||||
|         n = 0 |         n = 0 | ||||||
|         i = 0 |         i = 0 | ||||||
|         for raw_text, paragraph_tuples in self.train_tuples: |         for raw_text, paragraph_tuples in self.train_tuples: | ||||||
|             cats = paragraph_tuples.pop() |  | ||||||
|             for sent_tuples, brackets in paragraph_tuples: |             for sent_tuples, brackets in paragraph_tuples: | ||||||
|                 n += len(sent_tuples[1]) |                 n += len(sent_tuples[1]) | ||||||
|                 if self.limit and i >= self.limit: |                 if self.limit and i >= self.limit: | ||||||
|  | @ -210,7 +305,8 @@ class GoldCorpus(object): | ||||||
|         return n |         return n | ||||||
| 
 | 
 | ||||||
|     def train_docs(self, nlp, gold_preproc=False, max_length=None, |     def train_docs(self, nlp, gold_preproc=False, max_length=None, | ||||||
|                     noise_level=0.0, orth_variant_level=0.0): |                     noise_level=0.0, orth_variant_level=0.0, | ||||||
|  |                     ignore_misaligned=False): | ||||||
|         locs = list((self.tmp_dir / 'train').iterdir()) |         locs = list((self.tmp_dir / 'train').iterdir()) | ||||||
|         random.shuffle(locs) |         random.shuffle(locs) | ||||||
|         train_tuples = self.read_tuples(locs, limit=self.limit) |         train_tuples = self.read_tuples(locs, limit=self.limit) | ||||||
|  | @ -218,20 +314,23 @@ class GoldCorpus(object): | ||||||
|                                         max_length=max_length, |                                         max_length=max_length, | ||||||
|                                         noise_level=noise_level, |                                         noise_level=noise_level, | ||||||
|                                         orth_variant_level=orth_variant_level, |                                         orth_variant_level=orth_variant_level, | ||||||
|                                         make_projective=True) |                                         make_projective=True, | ||||||
|  |                                         ignore_misaligned=ignore_misaligned) | ||||||
|         yield from gold_docs |         yield from gold_docs | ||||||
| 
 | 
 | ||||||
|     def train_docs_without_preprocessing(self, nlp, gold_preproc=False): |     def train_docs_without_preprocessing(self, nlp, gold_preproc=False): | ||||||
|         gold_docs = self.iter_gold_docs(nlp, self.train_tuples, gold_preproc=gold_preproc) |         gold_docs = self.iter_gold_docs(nlp, self.train_tuples, gold_preproc=gold_preproc) | ||||||
|         yield from gold_docs |         yield from gold_docs | ||||||
| 
 | 
 | ||||||
|     def dev_docs(self, nlp, gold_preproc=False): |     def dev_docs(self, nlp, gold_preproc=False, ignore_misaligned=False): | ||||||
|         gold_docs = self.iter_gold_docs(nlp, self.dev_tuples, gold_preproc=gold_preproc) |         gold_docs = self.iter_gold_docs(nlp, self.dev_tuples, gold_preproc=gold_preproc, | ||||||
|  |                                         ignore_misaligned=ignore_misaligned) | ||||||
|         yield from gold_docs |         yield from gold_docs | ||||||
| 
 | 
 | ||||||
|     @classmethod |     @classmethod | ||||||
|     def iter_gold_docs(cls, nlp, tuples, gold_preproc, max_length=None, |     def iter_gold_docs(cls, nlp, tuples, gold_preproc, max_length=None, | ||||||
|                        noise_level=0.0, orth_variant_level=0.0, make_projective=False): |                        noise_level=0.0, orth_variant_level=0.0, make_projective=False, | ||||||
|  |                        ignore_misaligned=False): | ||||||
|         for raw_text, paragraph_tuples in tuples: |         for raw_text, paragraph_tuples in tuples: | ||||||
|             if gold_preproc: |             if gold_preproc: | ||||||
|                 raw_text = None |                 raw_text = None | ||||||
|  | @ -240,10 +339,12 @@ class GoldCorpus(object): | ||||||
|             docs, paragraph_tuples = cls._make_docs(nlp, raw_text, |             docs, paragraph_tuples = cls._make_docs(nlp, raw_text, | ||||||
|                     paragraph_tuples, gold_preproc, noise_level=noise_level, |                     paragraph_tuples, gold_preproc, noise_level=noise_level, | ||||||
|                     orth_variant_level=orth_variant_level) |                     orth_variant_level=orth_variant_level) | ||||||
|             golds = cls._make_golds(docs, paragraph_tuples, make_projective) |             golds = cls._make_golds(docs, paragraph_tuples, make_projective, | ||||||
|  |                                     ignore_misaligned=ignore_misaligned) | ||||||
|             for doc, gold in zip(docs, golds): |             for doc, gold in zip(docs, golds): | ||||||
|                 if (not max_length) or len(doc) < max_length: |                 if gold is not None: | ||||||
|                     yield doc, gold |                     if (not max_length) or len(doc) < max_length: | ||||||
|  |                         yield doc, gold | ||||||
| 
 | 
 | ||||||
|     @classmethod |     @classmethod | ||||||
|     def _make_docs(cls, nlp, raw_text, paragraph_tuples, gold_preproc, noise_level=0.0, orth_variant_level=0.0): |     def _make_docs(cls, nlp, raw_text, paragraph_tuples, gold_preproc, noise_level=0.0, orth_variant_level=0.0): | ||||||
|  | @ -259,14 +360,22 @@ class GoldCorpus(object): | ||||||
| 
 | 
 | ||||||
| 
 | 
 | ||||||
|     @classmethod |     @classmethod | ||||||
|     def _make_golds(cls, docs, paragraph_tuples, make_projective): |     def _make_golds(cls, docs, paragraph_tuples, make_projective, ignore_misaligned=False): | ||||||
|         if len(docs) != len(paragraph_tuples): |         if len(docs) != len(paragraph_tuples): | ||||||
|             n_annots = len(paragraph_tuples) |             n_annots = len(paragraph_tuples) | ||||||
|             raise ValueError(Errors.E070.format(n_docs=len(docs), n_annots=n_annots)) |             raise ValueError(Errors.E070.format(n_docs=len(docs), n_annots=n_annots)) | ||||||
|         return [GoldParse.from_annot_tuples(doc, sent_tuples, |         golds = [] | ||||||
|                                                 make_projective=make_projective) |         for doc, (sent_tuples, (cats, brackets)) in zip(docs, paragraph_tuples): | ||||||
|                     for doc, (sent_tuples, brackets) |             try: | ||||||
|                     in zip(docs, paragraph_tuples)] |                 gold = GoldParse.from_annot_tuples(doc, sent_tuples, cats=cats, | ||||||
|  |                     make_projective=make_projective) | ||||||
|  |             except AlignmentError: | ||||||
|  |                 if ignore_misaligned: | ||||||
|  |                     gold = None | ||||||
|  |                 else: | ||||||
|  |                     raise | ||||||
|  |             golds.append(gold) | ||||||
|  |         return golds | ||||||
| 
 | 
 | ||||||
| 
 | 
 | ||||||
| def make_orth_variants(nlp, raw, paragraph_tuples, orth_variant_level=0.0): | def make_orth_variants(nlp, raw, paragraph_tuples, orth_variant_level=0.0): | ||||||
|  | @ -281,7 +390,7 @@ def make_orth_variants(nlp, raw, paragraph_tuples, orth_variant_level=0.0): | ||||||
|     # modify words in paragraph_tuples |     # modify words in paragraph_tuples | ||||||
|     variant_paragraph_tuples = [] |     variant_paragraph_tuples = [] | ||||||
|     for sent_tuples, brackets in paragraph_tuples: |     for sent_tuples, brackets in paragraph_tuples: | ||||||
|         ids, words, tags, heads, labels, ner, cats = sent_tuples |         ids, words, tags, heads, labels, ner = sent_tuples | ||||||
|         if lower: |         if lower: | ||||||
|             words = [w.lower() for w in words] |             words = [w.lower() for w in words] | ||||||
|         # single variants |         # single variants | ||||||
|  | @ -310,7 +419,7 @@ def make_orth_variants(nlp, raw, paragraph_tuples, orth_variant_level=0.0): | ||||||
|                                 pair_idx = pair.index(words[word_idx]) |                                 pair_idx = pair.index(words[word_idx]) | ||||||
|                     words[word_idx] = punct_choices[punct_idx][pair_idx] |                     words[word_idx] = punct_choices[punct_idx][pair_idx] | ||||||
| 
 | 
 | ||||||
|         variant_paragraph_tuples.append(((ids, words, tags, heads, labels, ner, cats), brackets)) |         variant_paragraph_tuples.append(((ids, words, tags, heads, labels, ner), brackets)) | ||||||
|     # modify raw to match variant_paragraph_tuples |     # modify raw to match variant_paragraph_tuples | ||||||
|     if raw is not None: |     if raw is not None: | ||||||
|         variants = [] |         variants = [] | ||||||
|  | @ -329,7 +438,7 @@ def make_orth_variants(nlp, raw, paragraph_tuples, orth_variant_level=0.0): | ||||||
|             variant_raw += raw[raw_idx] |             variant_raw += raw[raw_idx] | ||||||
|             raw_idx += 1 |             raw_idx += 1 | ||||||
|         for sent_tuples, brackets in variant_paragraph_tuples: |         for sent_tuples, brackets in variant_paragraph_tuples: | ||||||
|             ids, words, tags, heads, labels, ner, cats = sent_tuples |             ids, words, tags, heads, labels, ner = sent_tuples | ||||||
|             for word in words: |             for word in words: | ||||||
|                 match_found = False |                 match_found = False | ||||||
|                 # add identical word |                 # add identical word | ||||||
|  | @ -400,6 +509,9 @@ def json_to_tuple(doc): | ||||||
|     paragraphs = [] |     paragraphs = [] | ||||||
|     for paragraph in doc["paragraphs"]: |     for paragraph in doc["paragraphs"]: | ||||||
|         sents = [] |         sents = [] | ||||||
|  |         cats = {} | ||||||
|  |         for cat in paragraph.get("cats", {}): | ||||||
|  |             cats[cat["label"]] = cat["value"] | ||||||
|         for sent in paragraph["sentences"]: |         for sent in paragraph["sentences"]: | ||||||
|             words = [] |             words = [] | ||||||
|             ids = [] |             ids = [] | ||||||
|  | @ -419,11 +531,7 @@ def json_to_tuple(doc): | ||||||
|                 ner.append(token.get("ner", "-")) |                 ner.append(token.get("ner", "-")) | ||||||
|             sents.append([ |             sents.append([ | ||||||
|                 [ids, words, tags, heads, labels, ner], |                 [ids, words, tags, heads, labels, ner], | ||||||
|                 sent.get("brackets", [])]) |                 [cats, sent.get("brackets", [])]]) | ||||||
|         cats = {} |  | ||||||
|         for cat in paragraph.get("cats", {}): |  | ||||||
|             cats[cat["label"]] = cat["value"] |  | ||||||
|         sents.append(cats) |  | ||||||
|         if sents: |         if sents: | ||||||
|             yield [paragraph.get("raw", None), sents] |             yield [paragraph.get("raw", None), sents] | ||||||
| 
 | 
 | ||||||
|  | @ -537,8 +645,8 @@ cdef class GoldParse: | ||||||
|     DOCS: https://spacy.io/api/goldparse |     DOCS: https://spacy.io/api/goldparse | ||||||
|     """ |     """ | ||||||
|     @classmethod |     @classmethod | ||||||
|     def from_annot_tuples(cls, doc, annot_tuples, make_projective=False): |     def from_annot_tuples(cls, doc, annot_tuples, cats=None, make_projective=False): | ||||||
|         _, words, tags, heads, deps, entities, cats = annot_tuples |         _, words, tags, heads, deps, entities = annot_tuples | ||||||
|         return cls(doc, words=words, tags=tags, heads=heads, deps=deps, |         return cls(doc, words=words, tags=tags, heads=heads, deps=deps, | ||||||
|                    entities=entities, cats=cats, |                    entities=entities, cats=cats, | ||||||
|                    make_projective=make_projective) |                    make_projective=make_projective) | ||||||
|  | @ -595,9 +703,9 @@ cdef class GoldParse: | ||||||
|             if morphology is None: |             if morphology is None: | ||||||
|                 morphology = [None for _ in words] |                 morphology = [None for _ in words] | ||||||
|             if entities is None: |             if entities is None: | ||||||
|                 entities = ["-" for _ in doc] |                 entities = ["-" for _ in words] | ||||||
|             elif len(entities) == 0: |             elif len(entities) == 0: | ||||||
|                 entities = ["O" for _ in doc] |                 entities = ["O" for _ in words] | ||||||
|             else: |             else: | ||||||
|                 # Translate the None values to '-', to make processing easier. |                 # Translate the None values to '-', to make processing easier. | ||||||
|                 # See Issue #2603 |                 # See Issue #2603 | ||||||
|  | @ -660,7 +768,9 @@ cdef class GoldParse: | ||||||
|                             self.heads[i] = i+1 |                             self.heads[i] = i+1 | ||||||
|                             self.labels[i] = "subtok" |                             self.labels[i] = "subtok" | ||||||
|                         else: |                         else: | ||||||
|                             self.heads[i] = self.gold_to_cand[heads[i2j_multi[i]]] |                             head_i = heads[i2j_multi[i]] | ||||||
|  |                             if head_i: | ||||||
|  |                                 self.heads[i] = self.gold_to_cand[head_i] | ||||||
|                             self.labels[i] = deps[i2j_multi[i]] |                             self.labels[i] = deps[i2j_multi[i]] | ||||||
|                         # Now set NER...This is annoying because if we've split |                         # Now set NER...This is annoying because if we've split | ||||||
|                         # got an entity word split into two, we need to adjust the |                         # got an entity word split into two, we need to adjust the | ||||||
|  |  | ||||||
|  | @ -1,8 +1,8 @@ | ||||||
| # coding: utf8 | # coding: utf8 | ||||||
| from __future__ import unicode_literals | from __future__ import unicode_literals | ||||||
| 
 | 
 | ||||||
| from ...symbols import POS, PUNCT, ADJ, CONJ, SCONJ, NUM, DET, ADV, ADP, X, VERB | from ...symbols import POS, PUNCT, ADJ, CCONJ, SCONJ, NUM, DET, ADV, ADP, X | ||||||
| from ...symbols import NOUN, PROPN, PART, INTJ, SPACE, PRON, AUX | from ...symbols import NOUN, PROPN, PART, INTJ, SPACE, PRON, AUX, VERB | ||||||
| 
 | 
 | ||||||
| 
 | 
 | ||||||
| TAG_MAP = { | TAG_MAP = { | ||||||
|  | @ -20,8 +20,8 @@ TAG_MAP = { | ||||||
|     "CARD": {POS: NUM, "NumType": "card"}, |     "CARD": {POS: NUM, "NumType": "card"}, | ||||||
|     "FM": {POS: X, "Foreign": "yes"}, |     "FM": {POS: X, "Foreign": "yes"}, | ||||||
|     "ITJ": {POS: INTJ}, |     "ITJ": {POS: INTJ}, | ||||||
|     "KOKOM": {POS: CONJ, "ConjType": "comp"}, |     "KOKOM": {POS: CCONJ, "ConjType": "comp"}, | ||||||
|     "KON": {POS: CONJ}, |     "KON": {POS: CCONJ}, | ||||||
|     "KOUI": {POS: SCONJ}, |     "KOUI": {POS: SCONJ}, | ||||||
|     "KOUS": {POS: SCONJ}, |     "KOUS": {POS: SCONJ}, | ||||||
|     "NE": {POS: PROPN}, |     "NE": {POS: PROPN}, | ||||||
|  | @ -43,7 +43,7 @@ TAG_MAP = { | ||||||
|     "PTKA": {POS: PART}, |     "PTKA": {POS: PART}, | ||||||
|     "PTKANT": {POS: PART, "PartType": "res"}, |     "PTKANT": {POS: PART, "PartType": "res"}, | ||||||
|     "PTKNEG": {POS: PART, "Polarity": "neg"}, |     "PTKNEG": {POS: PART, "Polarity": "neg"}, | ||||||
|     "PTKVZ": {POS: PART, "PartType": "vbp"}, |     "PTKVZ": {POS: ADP, "PartType": "vbp"}, | ||||||
|     "PTKZU": {POS: PART, "PartType": "inf"}, |     "PTKZU": {POS: PART, "PartType": "inf"}, | ||||||
|     "PWAT": {POS: DET, "PronType": "int"}, |     "PWAT": {POS: DET, "PronType": "int"}, | ||||||
|     "PWAV": {POS: ADV, "PronType": "int"}, |     "PWAV": {POS: ADV, "PronType": "int"}, | ||||||
|  |  | ||||||
|  | @ -2,7 +2,7 @@ | ||||||
| from __future__ import unicode_literals | from __future__ import unicode_literals | ||||||
| 
 | 
 | ||||||
| from ...symbols import POS, PUNCT, SYM, ADJ, CCONJ, NUM, DET, ADV, ADP, X, VERB | from ...symbols import POS, PUNCT, SYM, ADJ, CCONJ, NUM, DET, ADV, ADP, X, VERB | ||||||
| from ...symbols import NOUN, PROPN, PART, INTJ, SPACE, PRON, AUX | from ...symbols import NOUN, PROPN, PART, INTJ, SPACE, PRON | ||||||
| 
 | 
 | ||||||
| 
 | 
 | ||||||
| TAG_MAP = { | TAG_MAP = { | ||||||
|  | @ -28,8 +28,8 @@ TAG_MAP = { | ||||||
|     "JJR": {POS: ADJ, "Degree": "comp"}, |     "JJR": {POS: ADJ, "Degree": "comp"}, | ||||||
|     "JJS": {POS: ADJ, "Degree": "sup"}, |     "JJS": {POS: ADJ, "Degree": "sup"}, | ||||||
|     "LS": {POS: X, "NumType": "ord"}, |     "LS": {POS: X, "NumType": "ord"}, | ||||||
|     "MD": {POS: AUX, "VerbType": "mod"}, |     "MD": {POS: VERB, "VerbType": "mod"}, | ||||||
|     "NIL": {POS: ""}, |     "NIL": {POS: X}, | ||||||
|     "NN": {POS: NOUN, "Number": "sing"}, |     "NN": {POS: NOUN, "Number": "sing"}, | ||||||
|     "NNP": {POS: PROPN, "NounType": "prop", "Number": "sing"}, |     "NNP": {POS: PROPN, "NounType": "prop", "Number": "sing"}, | ||||||
|     "NNPS": {POS: PROPN, "NounType": "prop", "Number": "plur"}, |     "NNPS": {POS: PROPN, "NounType": "prop", "Number": "plur"}, | ||||||
|  | @ -37,7 +37,7 @@ TAG_MAP = { | ||||||
|     "PDT": {POS: DET}, |     "PDT": {POS: DET}, | ||||||
|     "POS": {POS: PART, "Poss": "yes"}, |     "POS": {POS: PART, "Poss": "yes"}, | ||||||
|     "PRP": {POS: PRON, "PronType": "prs"}, |     "PRP": {POS: PRON, "PronType": "prs"}, | ||||||
|     "PRP$": {POS: PRON, "PronType": "prs", "Poss": "yes"}, |     "PRP$": {POS: DET, "PronType": "prs", "Poss": "yes"}, | ||||||
|     "RB": {POS: ADV, "Degree": "pos"}, |     "RB": {POS: ADV, "Degree": "pos"}, | ||||||
|     "RBR": {POS: ADV, "Degree": "comp"}, |     "RBR": {POS: ADV, "Degree": "comp"}, | ||||||
|     "RBS": {POS: ADV, "Degree": "sup"}, |     "RBS": {POS: ADV, "Degree": "sup"}, | ||||||
|  | @ -58,9 +58,9 @@ TAG_MAP = { | ||||||
|         "Number": "sing", |         "Number": "sing", | ||||||
|         "Person": "three", |         "Person": "three", | ||||||
|     }, |     }, | ||||||
|     "WDT": {POS: PRON}, |     "WDT": {POS: DET}, | ||||||
|     "WP": {POS: PRON}, |     "WP": {POS: PRON}, | ||||||
|     "WP$": {POS: PRON, "Poss": "yes"}, |     "WP$": {POS: DET, "Poss": "yes"}, | ||||||
|     "WRB": {POS: ADV}, |     "WRB": {POS: ADV}, | ||||||
|     "ADD": {POS: X}, |     "ADD": {POS: X}, | ||||||
|     "NFP": {POS: PUNCT}, |     "NFP": {POS: PUNCT}, | ||||||
|  |  | ||||||
|  | @ -18,13 +18,8 @@ from .tokenizer import Tokenizer | ||||||
| from .vocab import Vocab | from .vocab import Vocab | ||||||
| from .lemmatizer import Lemmatizer | from .lemmatizer import Lemmatizer | ||||||
| from .lookups import Lookups | from .lookups import Lookups | ||||||
| from .pipeline import DependencyParser, Tagger | from .analysis import analyze_pipes, analyze_all_pipes, validate_attrs | ||||||
| from .pipeline import Tensorizer, EntityRecognizer, EntityLinker | from .compat import izip, basestring_, is_python2, class_types | ||||||
| from .pipeline import SimilarityHook, TextCategorizer, Sentencizer |  | ||||||
| from .pipeline import merge_noun_chunks, merge_entities, merge_subtokens |  | ||||||
| from .pipeline import EntityRuler |  | ||||||
| from .pipeline import Morphologizer |  | ||||||
| from .compat import izip, basestring_, is_python2 |  | ||||||
| from .gold import GoldParse | from .gold import GoldParse | ||||||
| from .scorer import Scorer | from .scorer import Scorer | ||||||
| from ._ml import link_vectors_to_models, create_default_optimizer | from ._ml import link_vectors_to_models, create_default_optimizer | ||||||
|  | @ -40,6 +35,9 @@ from . import util | ||||||
| from . import about | from . import about | ||||||
| 
 | 
 | ||||||
| 
 | 
 | ||||||
|  | ENABLE_PIPELINE_ANALYSIS = False | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
| class BaseDefaults(object): | class BaseDefaults(object): | ||||||
|     @classmethod |     @classmethod | ||||||
|     def create_lemmatizer(cls, nlp=None, lookups=None): |     def create_lemmatizer(cls, nlp=None, lookups=None): | ||||||
|  | @ -133,22 +131,7 @@ class Language(object): | ||||||
|     Defaults = BaseDefaults |     Defaults = BaseDefaults | ||||||
|     lang = None |     lang = None | ||||||
| 
 | 
 | ||||||
|     factories = { |     factories = {"tokenizer": lambda nlp: nlp.Defaults.create_tokenizer(nlp)} | ||||||
|         "tokenizer": lambda nlp: nlp.Defaults.create_tokenizer(nlp), |  | ||||||
|         "tensorizer": lambda nlp, **cfg: Tensorizer(nlp.vocab, **cfg), |  | ||||||
|         "tagger": lambda nlp, **cfg: Tagger(nlp.vocab, **cfg), |  | ||||||
|         "morphologizer": lambda nlp, **cfg: Morphologizer(nlp.vocab, **cfg), |  | ||||||
|         "parser": lambda nlp, **cfg: DependencyParser(nlp.vocab, **cfg), |  | ||||||
|         "ner": lambda nlp, **cfg: EntityRecognizer(nlp.vocab, **cfg), |  | ||||||
|         "entity_linker": lambda nlp, **cfg: EntityLinker(nlp.vocab, **cfg), |  | ||||||
|         "similarity": lambda nlp, **cfg: SimilarityHook(nlp.vocab, **cfg), |  | ||||||
|         "textcat": lambda nlp, **cfg: TextCategorizer(nlp.vocab, **cfg), |  | ||||||
|         "sentencizer": lambda nlp, **cfg: Sentencizer(**cfg), |  | ||||||
|         "merge_noun_chunks": lambda nlp, **cfg: merge_noun_chunks, |  | ||||||
|         "merge_entities": lambda nlp, **cfg: merge_entities, |  | ||||||
|         "merge_subtokens": lambda nlp, **cfg: merge_subtokens, |  | ||||||
|         "entity_ruler": lambda nlp, **cfg: EntityRuler(nlp, **cfg), |  | ||||||
|     } |  | ||||||
| 
 | 
 | ||||||
|     def __init__( |     def __init__( | ||||||
|         self, vocab=True, make_doc=True, max_length=10 ** 6, meta={}, **kwargs |         self, vocab=True, make_doc=True, max_length=10 ** 6, meta={}, **kwargs | ||||||
|  | @ -218,6 +201,7 @@ class Language(object): | ||||||
|             "name": self.vocab.vectors.name, |             "name": self.vocab.vectors.name, | ||||||
|         } |         } | ||||||
|         self._meta["pipeline"] = self.pipe_names |         self._meta["pipeline"] = self.pipe_names | ||||||
|  |         self._meta["factories"] = self.pipe_factories | ||||||
|         self._meta["labels"] = self.pipe_labels |         self._meta["labels"] = self.pipe_labels | ||||||
|         return self._meta |         return self._meta | ||||||
| 
 | 
 | ||||||
|  | @ -259,6 +243,17 @@ class Language(object): | ||||||
|         """ |         """ | ||||||
|         return [pipe_name for pipe_name, _ in self.pipeline] |         return [pipe_name for pipe_name, _ in self.pipeline] | ||||||
| 
 | 
 | ||||||
|  |     @property | ||||||
|  |     def pipe_factories(self): | ||||||
|  |         """Get the component factories for the available pipeline components. | ||||||
|  | 
 | ||||||
|  |         RETURNS (dict): Factory names, keyed by component names. | ||||||
|  |         """ | ||||||
|  |         factories = {} | ||||||
|  |         for pipe_name, pipe in self.pipeline: | ||||||
|  |             factories[pipe_name] = getattr(pipe, "factory", pipe_name) | ||||||
|  |         return factories | ||||||
|  | 
 | ||||||
|     @property |     @property | ||||||
|     def pipe_labels(self): |     def pipe_labels(self): | ||||||
|         """Get the labels set by the pipeline components, if available (if |         """Get the labels set by the pipeline components, if available (if | ||||||
|  | @ -327,33 +322,30 @@ class Language(object): | ||||||
|                 msg += Errors.E004.format(component=component) |                 msg += Errors.E004.format(component=component) | ||||||
|             raise ValueError(msg) |             raise ValueError(msg) | ||||||
|         if name is None: |         if name is None: | ||||||
|             if hasattr(component, "name"): |             name = util.get_component_name(component) | ||||||
|                 name = component.name |  | ||||||
|             elif hasattr(component, "__name__"): |  | ||||||
|                 name = component.__name__ |  | ||||||
|             elif hasattr(component, "__class__") and hasattr( |  | ||||||
|                 component.__class__, "__name__" |  | ||||||
|             ): |  | ||||||
|                 name = component.__class__.__name__ |  | ||||||
|             else: |  | ||||||
|                 name = repr(component) |  | ||||||
|         if name in self.pipe_names: |         if name in self.pipe_names: | ||||||
|             raise ValueError(Errors.E007.format(name=name, opts=self.pipe_names)) |             raise ValueError(Errors.E007.format(name=name, opts=self.pipe_names)) | ||||||
|         if sum([bool(before), bool(after), bool(first), bool(last)]) >= 2: |         if sum([bool(before), bool(after), bool(first), bool(last)]) >= 2: | ||||||
|             raise ValueError(Errors.E006) |             raise ValueError(Errors.E006) | ||||||
|  |         pipe_index = 0 | ||||||
|         pipe = (name, component) |         pipe = (name, component) | ||||||
|         if last or not any([first, before, after]): |         if last or not any([first, before, after]): | ||||||
|  |             pipe_index = len(self.pipeline) | ||||||
|             self.pipeline.append(pipe) |             self.pipeline.append(pipe) | ||||||
|         elif first: |         elif first: | ||||||
|             self.pipeline.insert(0, pipe) |             self.pipeline.insert(0, pipe) | ||||||
|         elif before and before in self.pipe_names: |         elif before and before in self.pipe_names: | ||||||
|  |             pipe_index = self.pipe_names.index(before) | ||||||
|             self.pipeline.insert(self.pipe_names.index(before), pipe) |             self.pipeline.insert(self.pipe_names.index(before), pipe) | ||||||
|         elif after and after in self.pipe_names: |         elif after and after in self.pipe_names: | ||||||
|  |             pipe_index = self.pipe_names.index(after) + 1 | ||||||
|             self.pipeline.insert(self.pipe_names.index(after) + 1, pipe) |             self.pipeline.insert(self.pipe_names.index(after) + 1, pipe) | ||||||
|         else: |         else: | ||||||
|             raise ValueError( |             raise ValueError( | ||||||
|                 Errors.E001.format(name=before or after, opts=self.pipe_names) |                 Errors.E001.format(name=before or after, opts=self.pipe_names) | ||||||
|             ) |             ) | ||||||
|  |         if ENABLE_PIPELINE_ANALYSIS: | ||||||
|  |             analyze_pipes(self.pipeline, name, component, pipe_index) | ||||||
| 
 | 
 | ||||||
|     def has_pipe(self, name): |     def has_pipe(self, name): | ||||||
|         """Check if a component name is present in the pipeline. Equivalent to |         """Check if a component name is present in the pipeline. Equivalent to | ||||||
|  | @ -382,6 +374,8 @@ class Language(object): | ||||||
|                 msg += Errors.E135.format(name=name) |                 msg += Errors.E135.format(name=name) | ||||||
|             raise ValueError(msg) |             raise ValueError(msg) | ||||||
|         self.pipeline[self.pipe_names.index(name)] = (name, component) |         self.pipeline[self.pipe_names.index(name)] = (name, component) | ||||||
|  |         if ENABLE_PIPELINE_ANALYSIS: | ||||||
|  |             analyze_all_pipes(self.pipeline) | ||||||
| 
 | 
 | ||||||
|     def rename_pipe(self, old_name, new_name): |     def rename_pipe(self, old_name, new_name): | ||||||
|         """Rename a pipeline component. |         """Rename a pipeline component. | ||||||
|  | @ -408,7 +402,10 @@ class Language(object): | ||||||
|         """ |         """ | ||||||
|         if name not in self.pipe_names: |         if name not in self.pipe_names: | ||||||
|             raise ValueError(Errors.E001.format(name=name, opts=self.pipe_names)) |             raise ValueError(Errors.E001.format(name=name, opts=self.pipe_names)) | ||||||
|         return self.pipeline.pop(self.pipe_names.index(name)) |         removed = self.pipeline.pop(self.pipe_names.index(name)) | ||||||
|  |         if ENABLE_PIPELINE_ANALYSIS: | ||||||
|  |             analyze_all_pipes(self.pipeline) | ||||||
|  |         return removed | ||||||
| 
 | 
 | ||||||
|     def __call__(self, text, disable=[], component_cfg=None): |     def __call__(self, text, disable=[], component_cfg=None): | ||||||
|         """Apply the pipeline to some text. The text can span multiple sentences, |         """Apply the pipeline to some text. The text can span multiple sentences, | ||||||
|  | @ -448,6 +445,8 @@ class Language(object): | ||||||
| 
 | 
 | ||||||
|         DOCS: https://spacy.io/api/language#disable_pipes |         DOCS: https://spacy.io/api/language#disable_pipes | ||||||
|         """ |         """ | ||||||
|  |         if len(names) == 1 and isinstance(names[0], (list, tuple)): | ||||||
|  |             names = names[0]  # support list of names instead of spread | ||||||
|         return DisabledPipes(self, *names) |         return DisabledPipes(self, *names) | ||||||
| 
 | 
 | ||||||
|     def make_doc(self, text): |     def make_doc(self, text): | ||||||
|  | @ -999,6 +998,52 @@ class Language(object): | ||||||
|         return self |         return self | ||||||
| 
 | 
 | ||||||
| 
 | 
 | ||||||
|  | class component(object): | ||||||
|  |     """Decorator for pipeline components. Can decorate both function components | ||||||
|  |     and class components and will automatically register components in the | ||||||
|  |     Language.factories. If the component is a class and needs access to the | ||||||
|  |     nlp object or config parameters, it can expose a from_nlp classmethod | ||||||
|  |     that takes the nlp object and **cfg arguments and returns the initialized | ||||||
|  |     component. | ||||||
|  |     """ | ||||||
|  | 
 | ||||||
|  |     # NB: This decorator needs to live here, because it needs to write to | ||||||
|  |     # Language.factories. All other solutions would cause circular import. | ||||||
|  | 
 | ||||||
|  |     def __init__(self, name=None, assigns=tuple(), requires=tuple(), retokenizes=False): | ||||||
|  |         """Decorate a pipeline component. | ||||||
|  | 
 | ||||||
|  |         name (unicode): Default component and factory name. | ||||||
|  |         assigns (list): Attributes assigned by component, e.g. `["token.pos"]`. | ||||||
|  |         requires (list): Attributes required by component, e.g. `["token.dep"]`. | ||||||
|  |         retokenizes (bool): Whether the component changes the tokenization. | ||||||
|  |         """ | ||||||
|  |         self.name = name | ||||||
|  |         self.assigns = validate_attrs(assigns) | ||||||
|  |         self.requires = validate_attrs(requires) | ||||||
|  |         self.retokenizes = retokenizes | ||||||
|  | 
 | ||||||
|  |     def __call__(self, *args, **kwargs): | ||||||
|  |         obj = args[0] | ||||||
|  |         args = args[1:] | ||||||
|  |         factory_name = self.name or util.get_component_name(obj) | ||||||
|  |         obj.name = factory_name | ||||||
|  |         obj.factory = factory_name | ||||||
|  |         obj.assigns = self.assigns | ||||||
|  |         obj.requires = self.requires | ||||||
|  |         obj.retokenizes = self.retokenizes | ||||||
|  | 
 | ||||||
|  |         def factory(nlp, **cfg): | ||||||
|  |             if hasattr(obj, "from_nlp"): | ||||||
|  |                 return obj.from_nlp(nlp, **cfg) | ||||||
|  |             elif isinstance(obj, class_types): | ||||||
|  |                 return obj() | ||||||
|  |             return obj | ||||||
|  | 
 | ||||||
|  |         Language.factories[obj.factory] = factory | ||||||
|  |         return obj | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
| def _fix_pretrained_vectors_name(nlp): | def _fix_pretrained_vectors_name(nlp): | ||||||
|     # TODO: Replace this once we handle vectors consistently as static |     # TODO: Replace this once we handle vectors consistently as static | ||||||
|     # data |     # data | ||||||
|  |  | ||||||
|  | @ -102,7 +102,10 @@ cdef class DependencyMatcher: | ||||||
|                 visitedNodes[relation["SPEC"]["NBOR_NAME"]] = True |                 visitedNodes[relation["SPEC"]["NBOR_NAME"]] = True | ||||||
|             idx = idx + 1 |             idx = idx + 1 | ||||||
| 
 | 
 | ||||||
|     def add(self, key, on_match, *patterns): |     def add(self, key, patterns, *_patterns, on_match=None): | ||||||
|  |         if patterns is None or hasattr(patterns, "__call__"):  # old API | ||||||
|  |             on_match = patterns | ||||||
|  |             patterns = _patterns | ||||||
|         for pattern in patterns: |         for pattern in patterns: | ||||||
|             if len(pattern) == 0: |             if len(pattern) == 0: | ||||||
|                 raise ValueError(Errors.E012.format(key=key)) |                 raise ValueError(Errors.E012.format(key=key)) | ||||||
|  |  | ||||||
|  | @ -74,7 +74,7 @@ cdef class Matcher: | ||||||
|         """ |         """ | ||||||
|         return self._normalize_key(key) in self._patterns |         return self._normalize_key(key) in self._patterns | ||||||
| 
 | 
 | ||||||
|     def add(self, key, on_match, *patterns): |     def add(self, key, patterns, *_patterns, on_match=None): | ||||||
|         """Add a match-rule to the matcher. A match-rule consists of: an ID |         """Add a match-rule to the matcher. A match-rule consists of: an ID | ||||||
|         key, an on_match callback, and one or more patterns. |         key, an on_match callback, and one or more patterns. | ||||||
| 
 | 
 | ||||||
|  | @ -98,16 +98,29 @@ cdef class Matcher: | ||||||
|         operator will behave non-greedily. This quirk in the semantics makes |         operator will behave non-greedily. This quirk in the semantics makes | ||||||
|         the matcher more efficient, by avoiding the need for back-tracking. |         the matcher more efficient, by avoiding the need for back-tracking. | ||||||
| 
 | 
 | ||||||
|  |         As of spaCy v2.2.2, Matcher.add supports the future API, which makes | ||||||
|  |         the patterns the second argument and a list (instead of a variable | ||||||
|  |         number of arguments). The on_match callback becomes an optional keyword | ||||||
|  |         argument. | ||||||
|  | 
 | ||||||
|         key (unicode): The match ID. |         key (unicode): The match ID. | ||||||
|         on_match (callable): Callback executed on match. |         patterns (list): The patterns to add for the given key. | ||||||
|         *patterns (list): List of token descriptions. |         on_match (callable): Optional callback executed on match. | ||||||
|  |         *_patterns (list): For backwards compatibility: list of patterns to add | ||||||
|  |             as variable arguments. Will be ignored if a list of patterns is | ||||||
|  |             provided as the second argument. | ||||||
|         """ |         """ | ||||||
|         errors = {} |         errors = {} | ||||||
|         if on_match is not None and not hasattr(on_match, "__call__"): |         if on_match is not None and not hasattr(on_match, "__call__"): | ||||||
|             raise ValueError(Errors.E171.format(arg_type=type(on_match))) |             raise ValueError(Errors.E171.format(arg_type=type(on_match))) | ||||||
|  |         if patterns is None or hasattr(patterns, "__call__"):  # old API | ||||||
|  |             on_match = patterns | ||||||
|  |             patterns = _patterns | ||||||
|         for i, pattern in enumerate(patterns): |         for i, pattern in enumerate(patterns): | ||||||
|             if len(pattern) == 0: |             if len(pattern) == 0: | ||||||
|                 raise ValueError(Errors.E012.format(key=key)) |                 raise ValueError(Errors.E012.format(key=key)) | ||||||
|  |             if not isinstance(pattern, list): | ||||||
|  |                 raise ValueError(Errors.E178.format(pat=pattern, key=key)) | ||||||
|             if self.validator: |             if self.validator: | ||||||
|                 errors[i] = validate_json(pattern, self.validator) |                 errors[i] = validate_json(pattern, self.validator) | ||||||
|         if any(err for err in errors.values()): |         if any(err for err in errors.values()): | ||||||
|  |  | ||||||
|  | @ -152,16 +152,27 @@ cdef class PhraseMatcher: | ||||||
|         del self._callbacks[key] |         del self._callbacks[key] | ||||||
|         del self._docs[key] |         del self._docs[key] | ||||||
| 
 | 
 | ||||||
|     def add(self, key, on_match, *docs): |     def add(self, key, docs, *_docs, on_match=None): | ||||||
|         """Add a match-rule to the phrase-matcher. A match-rule consists of: an ID |         """Add a match-rule to the phrase-matcher. A match-rule consists of: an ID | ||||||
|         key, an on_match callback, and one or more patterns. |         key, an on_match callback, and one or more patterns. | ||||||
| 
 | 
 | ||||||
|  |         As of spaCy v2.2.2, PhraseMatcher.add supports the future API, which | ||||||
|  |         makes the patterns the second argument and a list (instead of a variable | ||||||
|  |         number of arguments). The on_match callback becomes an optional keyword | ||||||
|  |         argument. | ||||||
|  | 
 | ||||||
|         key (unicode): The match ID. |         key (unicode): The match ID. | ||||||
|  |         docs (list): List of `Doc` objects representing match patterns. | ||||||
|         on_match (callable): Callback executed on match. |         on_match (callable): Callback executed on match. | ||||||
|         *docs (Doc): `Doc` objects representing match patterns. |         *_docs (Doc): For backwards compatibility: list of patterns to add | ||||||
|  |             as variable arguments. Will be ignored if a list of patterns is | ||||||
|  |             provided as the second argument. | ||||||
| 
 | 
 | ||||||
|         DOCS: https://spacy.io/api/phrasematcher#add |         DOCS: https://spacy.io/api/phrasematcher#add | ||||||
|         """ |         """ | ||||||
|  |         if docs is None or hasattr(docs, "__call__"):  # old API | ||||||
|  |             on_match = docs | ||||||
|  |             docs = _docs | ||||||
| 
 | 
 | ||||||
|         _ = self.vocab[key] |         _ = self.vocab[key] | ||||||
|         self._callbacks[key] = on_match |         self._callbacks[key] = on_match | ||||||
|  | @ -171,6 +182,8 @@ cdef class PhraseMatcher: | ||||||
|         cdef MapStruct* internal_node |         cdef MapStruct* internal_node | ||||||
|         cdef void* result |         cdef void* result | ||||||
| 
 | 
 | ||||||
|  |         if isinstance(docs, Doc): | ||||||
|  |             raise ValueError(Errors.E179.format(key=key)) | ||||||
|         for doc in docs: |         for doc in docs: | ||||||
|             if len(doc) == 0: |             if len(doc) == 0: | ||||||
|                 continue |                 continue | ||||||
|  |  | ||||||
							
								
								
									
										5
									
								
								spacy/ml/__init__.py
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										5
									
								
								spacy/ml/__init__.py
									
									
									
									
									
										Normal file
									
								
							|  | @ -0,0 +1,5 @@ | ||||||
|  | # coding: utf8 | ||||||
|  | from __future__ import unicode_literals | ||||||
|  | 
 | ||||||
|  | from .tok2vec import Tok2Vec  # noqa: F401 | ||||||
|  | from .common import FeedForward, LayerNormalizedMaxout  # noqa: F401 | ||||||
							
								
								
									
										131
									
								
								spacy/ml/_legacy_tok2vec.py
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										131
									
								
								spacy/ml/_legacy_tok2vec.py
									
									
									
									
									
										Normal file
									
								
							|  | @ -0,0 +1,131 @@ | ||||||
|  | # coding: utf8 | ||||||
|  | from __future__ import unicode_literals | ||||||
|  | from thinc.v2v import Model, Maxout | ||||||
|  | from thinc.i2v import HashEmbed, StaticVectors | ||||||
|  | from thinc.t2t import ExtractWindow | ||||||
|  | from thinc.misc import Residual | ||||||
|  | from thinc.misc import LayerNorm as LN | ||||||
|  | from thinc.misc import FeatureExtracter | ||||||
|  | from thinc.api import layerize, chain, clone, concatenate, with_flatten | ||||||
|  | from thinc.api import uniqued, wrap, noop | ||||||
|  | 
 | ||||||
|  | from ..attrs import ID, ORTH, NORM, PREFIX, SUFFIX, SHAPE | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
|  | def Tok2Vec(width, embed_size, **kwargs): | ||||||
|  |     # Circular imports :( | ||||||
|  |     from .._ml import CharacterEmbed | ||||||
|  |     from .._ml import PyTorchBiLSTM | ||||||
|  | 
 | ||||||
|  |     pretrained_vectors = kwargs.get("pretrained_vectors", None) | ||||||
|  |     cnn_maxout_pieces = kwargs.get("cnn_maxout_pieces", 3) | ||||||
|  |     subword_features = kwargs.get("subword_features", True) | ||||||
|  |     char_embed = kwargs.get("char_embed", False) | ||||||
|  |     if char_embed: | ||||||
|  |         subword_features = False | ||||||
|  |     conv_depth = kwargs.get("conv_depth", 4) | ||||||
|  |     bilstm_depth = kwargs.get("bilstm_depth", 0) | ||||||
|  |     cols = [ID, NORM, PREFIX, SUFFIX, SHAPE, ORTH] | ||||||
|  |     with Model.define_operators({">>": chain, "|": concatenate, "**": clone}): | ||||||
|  |         norm = HashEmbed(width, embed_size, column=cols.index(NORM), name="embed_norm") | ||||||
|  |         if subword_features: | ||||||
|  |             prefix = HashEmbed( | ||||||
|  |                 width, embed_size // 2, column=cols.index(PREFIX), name="embed_prefix" | ||||||
|  |             ) | ||||||
|  |             suffix = HashEmbed( | ||||||
|  |                 width, embed_size // 2, column=cols.index(SUFFIX), name="embed_suffix" | ||||||
|  |             ) | ||||||
|  |             shape = HashEmbed( | ||||||
|  |                 width, embed_size // 2, column=cols.index(SHAPE), name="embed_shape" | ||||||
|  |             ) | ||||||
|  |         else: | ||||||
|  |             prefix, suffix, shape = (None, None, None) | ||||||
|  |         if pretrained_vectors is not None: | ||||||
|  |             glove = StaticVectors(pretrained_vectors, width, column=cols.index(ID)) | ||||||
|  | 
 | ||||||
|  |             if subword_features: | ||||||
|  |                 embed = uniqued( | ||||||
|  |                     (glove | norm | prefix | suffix | shape) | ||||||
|  |                     >> LN(Maxout(width, width * 5, pieces=3)), | ||||||
|  |                     column=cols.index(ORTH), | ||||||
|  |                 ) | ||||||
|  |             else: | ||||||
|  |                 embed = uniqued( | ||||||
|  |                     (glove | norm) >> LN(Maxout(width, width * 2, pieces=3)), | ||||||
|  |                     column=cols.index(ORTH), | ||||||
|  |                 ) | ||||||
|  |         elif subword_features: | ||||||
|  |             embed = uniqued( | ||||||
|  |                 (norm | prefix | suffix | shape) | ||||||
|  |                 >> LN(Maxout(width, width * 4, pieces=3)), | ||||||
|  |                 column=cols.index(ORTH), | ||||||
|  |             ) | ||||||
|  |         elif char_embed: | ||||||
|  |             embed = concatenate_lists( | ||||||
|  |                 CharacterEmbed(nM=64, nC=8), | ||||||
|  |                 FeatureExtracter(cols) >> with_flatten(norm), | ||||||
|  |             ) | ||||||
|  |             reduce_dimensions = LN( | ||||||
|  |                 Maxout(width, 64 * 8 + width, pieces=cnn_maxout_pieces) | ||||||
|  |             ) | ||||||
|  |         else: | ||||||
|  |             embed = norm | ||||||
|  | 
 | ||||||
|  |         convolution = Residual( | ||||||
|  |             ExtractWindow(nW=1) | ||||||
|  |             >> LN(Maxout(width, width * 3, pieces=cnn_maxout_pieces)) | ||||||
|  |         ) | ||||||
|  |         if char_embed: | ||||||
|  |             tok2vec = embed >> with_flatten( | ||||||
|  |                 reduce_dimensions >> convolution ** conv_depth, pad=conv_depth | ||||||
|  |             ) | ||||||
|  |         else: | ||||||
|  |             tok2vec = FeatureExtracter(cols) >> with_flatten( | ||||||
|  |                 embed >> convolution ** conv_depth, pad=conv_depth | ||||||
|  |             ) | ||||||
|  | 
 | ||||||
|  |         if bilstm_depth >= 1: | ||||||
|  |             tok2vec = tok2vec >> PyTorchBiLSTM(width, width, bilstm_depth) | ||||||
|  |         # Work around thinc API limitations :(. TODO: Revise in Thinc 7 | ||||||
|  |         tok2vec.nO = width | ||||||
|  |         tok2vec.embed = embed | ||||||
|  |     return tok2vec | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
|  | @layerize | ||||||
|  | def flatten(seqs, drop=0.0): | ||||||
|  |     ops = Model.ops | ||||||
|  |     lengths = ops.asarray([len(seq) for seq in seqs], dtype="i") | ||||||
|  | 
 | ||||||
|  |     def finish_update(d_X, sgd=None): | ||||||
|  |         return ops.unflatten(d_X, lengths, pad=0) | ||||||
|  | 
 | ||||||
|  |     X = ops.flatten(seqs, pad=0) | ||||||
|  |     return X, finish_update | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
|  | def concatenate_lists(*layers, **kwargs):  # pragma: no cover | ||||||
|  |     """Compose two or more models `f`, `g`, etc, such that their outputs are | ||||||
|  |     concatenated, i.e. `concatenate(f, g)(x)` computes `hstack(f(x), g(x))` | ||||||
|  |     """ | ||||||
|  |     if not layers: | ||||||
|  |         return noop() | ||||||
|  |     drop_factor = kwargs.get("drop_factor", 1.0) | ||||||
|  |     ops = layers[0].ops | ||||||
|  |     layers = [chain(layer, flatten) for layer in layers] | ||||||
|  |     concat = concatenate(*layers) | ||||||
|  | 
 | ||||||
|  |     def concatenate_lists_fwd(Xs, drop=0.0): | ||||||
|  |         if drop is not None: | ||||||
|  |             drop *= drop_factor | ||||||
|  |         lengths = ops.asarray([len(X) for X in Xs], dtype="i") | ||||||
|  |         flat_y, bp_flat_y = concat.begin_update(Xs, drop=drop) | ||||||
|  |         ys = ops.unflatten(flat_y, lengths) | ||||||
|  | 
 | ||||||
|  |         def concatenate_lists_bwd(d_ys, sgd=None): | ||||||
|  |             return bp_flat_y(ops.flatten(d_ys), sgd=sgd) | ||||||
|  | 
 | ||||||
|  |         return ys, concatenate_lists_bwd | ||||||
|  | 
 | ||||||
|  |     model = wrap(concatenate_lists_fwd, concat) | ||||||
|  |     return model | ||||||
							
								
								
									
										42
									
								
								spacy/ml/_wire.py
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										42
									
								
								spacy/ml/_wire.py
									
									
									
									
									
										Normal file
									
								
							|  | @ -0,0 +1,42 @@ | ||||||
|  | from __future__ import unicode_literals | ||||||
|  | from thinc.api import layerize, wrap, noop, chain, concatenate | ||||||
|  | from thinc.v2v import Model | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
|  | def concatenate_lists(*layers, **kwargs):  # pragma: no cover | ||||||
|  |     """Compose two or more models `f`, `g`, etc, such that their outputs are | ||||||
|  |     concatenated, i.e. `concatenate(f, g)(x)` computes `hstack(f(x), g(x))` | ||||||
|  |     """ | ||||||
|  |     if not layers: | ||||||
|  |         return layerize(noop()) | ||||||
|  |     drop_factor = kwargs.get("drop_factor", 1.0) | ||||||
|  |     ops = layers[0].ops | ||||||
|  |     layers = [chain(layer, flatten) for layer in layers] | ||||||
|  |     concat = concatenate(*layers) | ||||||
|  | 
 | ||||||
|  |     def concatenate_lists_fwd(Xs, drop=0.0): | ||||||
|  |         if drop is not None: | ||||||
|  |             drop *= drop_factor | ||||||
|  |         lengths = ops.asarray([len(X) for X in Xs], dtype="i") | ||||||
|  |         flat_y, bp_flat_y = concat.begin_update(Xs, drop=drop) | ||||||
|  |         ys = ops.unflatten(flat_y, lengths) | ||||||
|  | 
 | ||||||
|  |         def concatenate_lists_bwd(d_ys, sgd=None): | ||||||
|  |             return bp_flat_y(ops.flatten(d_ys), sgd=sgd) | ||||||
|  | 
 | ||||||
|  |         return ys, concatenate_lists_bwd | ||||||
|  | 
 | ||||||
|  |     model = wrap(concatenate_lists_fwd, concat) | ||||||
|  |     return model | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
|  | @layerize | ||||||
|  | def flatten(seqs, drop=0.0): | ||||||
|  |     ops = Model.ops | ||||||
|  |     lengths = ops.asarray([len(seq) for seq in seqs], dtype="i") | ||||||
|  | 
 | ||||||
|  |     def finish_update(d_X, sgd=None): | ||||||
|  |         return ops.unflatten(d_X, lengths, pad=0) | ||||||
|  | 
 | ||||||
|  |     X = ops.flatten(seqs, pad=0) | ||||||
|  |     return X, finish_update | ||||||
							
								
								
									
										23
									
								
								spacy/ml/common.py
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										23
									
								
								spacy/ml/common.py
									
									
									
									
									
										Normal file
									
								
							|  | @ -0,0 +1,23 @@ | ||||||
|  | from __future__ import unicode_literals | ||||||
|  | 
 | ||||||
|  | from thinc.api import chain | ||||||
|  | from thinc.v2v import Maxout | ||||||
|  | from thinc.misc import LayerNorm | ||||||
|  | from ..util import register_architecture, make_layer | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
|  | @register_architecture("thinc.FeedForward.v1") | ||||||
|  | def FeedForward(config): | ||||||
|  |     layers = [make_layer(layer_cfg) for layer_cfg in config["layers"]] | ||||||
|  |     model = chain(*layers) | ||||||
|  |     model.cfg = config | ||||||
|  |     return model | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
|  | @register_architecture("spacy.LayerNormalizedMaxout.v1") | ||||||
|  | def LayerNormalizedMaxout(config): | ||||||
|  |     width = config["width"] | ||||||
|  |     pieces = config["pieces"] | ||||||
|  |     layer = LayerNorm(Maxout(width, pieces=pieces)) | ||||||
|  |     layer.nO = width | ||||||
|  |     return layer | ||||||
							
								
								
									
										176
									
								
								spacy/ml/tok2vec.py
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										176
									
								
								spacy/ml/tok2vec.py
									
									
									
									
									
										Normal file
									
								
							|  | @ -0,0 +1,176 @@ | ||||||
|  | from __future__ import unicode_literals | ||||||
|  | 
 | ||||||
|  | from thinc.api import chain, layerize, clone, concatenate, with_flatten, uniqued | ||||||
|  | from thinc.api import noop, with_square_sequences | ||||||
|  | from thinc.v2v import Maxout, Model | ||||||
|  | from thinc.i2v import HashEmbed, StaticVectors | ||||||
|  | from thinc.t2t import ExtractWindow | ||||||
|  | from thinc.misc import Residual, LayerNorm, FeatureExtracter | ||||||
|  | from ..util import make_layer, register_architecture | ||||||
|  | from ._wire import concatenate_lists | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
|  | @register_architecture("spacy.Tok2Vec.v1") | ||||||
|  | def Tok2Vec(config): | ||||||
|  |     doc2feats = make_layer(config["@doc2feats"]) | ||||||
|  |     embed = make_layer(config["@embed"]) | ||||||
|  |     encode = make_layer(config["@encode"]) | ||||||
|  |     field_size = getattr(encode, "receptive_field", 0) | ||||||
|  |     tok2vec = chain(doc2feats, with_flatten(chain(embed, encode), pad=field_size)) | ||||||
|  |     tok2vec.cfg = config | ||||||
|  |     tok2vec.nO = encode.nO | ||||||
|  |     tok2vec.embed = embed | ||||||
|  |     tok2vec.encode = encode | ||||||
|  |     return tok2vec | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
|  | @register_architecture("spacy.Doc2Feats.v1") | ||||||
|  | def Doc2Feats(config): | ||||||
|  |     columns = config["columns"] | ||||||
|  |     return FeatureExtracter(columns) | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
|  | @register_architecture("spacy.MultiHashEmbed.v1") | ||||||
|  | def MultiHashEmbed(config): | ||||||
|  |     # For backwards compatibility with models before the architecture registry, | ||||||
|  |     # we have to be careful to get exactly the same model structure. One subtle | ||||||
|  |     # trick is that when we define concatenation with the operator, the operator | ||||||
|  |     # is actually binary associative. So when we write (a | b | c), we're actually | ||||||
|  |     # getting concatenate(concatenate(a, b), c). That's why the implementation | ||||||
|  |     # is a bit ugly here. | ||||||
|  |     cols = config["columns"] | ||||||
|  |     width = config["width"] | ||||||
|  |     rows = config["rows"] | ||||||
|  | 
 | ||||||
|  |     norm = HashEmbed(width, rows, column=cols.index("NORM"), name="embed_norm") | ||||||
|  |     if config["use_subwords"]: | ||||||
|  |         prefix = HashEmbed( | ||||||
|  |             width, rows // 2, column=cols.index("PREFIX"), name="embed_prefix" | ||||||
|  |         ) | ||||||
|  |         suffix = HashEmbed( | ||||||
|  |             width, rows // 2, column=cols.index("SUFFIX"), name="embed_suffix" | ||||||
|  |         ) | ||||||
|  |         shape = HashEmbed( | ||||||
|  |             width, rows // 2, column=cols.index("SHAPE"), name="embed_shape" | ||||||
|  |         ) | ||||||
|  |     if config.get("@pretrained_vectors"): | ||||||
|  |         glove = make_layer(config["@pretrained_vectors"]) | ||||||
|  |     mix = make_layer(config["@mix"]) | ||||||
|  | 
 | ||||||
|  |     with Model.define_operators({">>": chain, "|": concatenate}): | ||||||
|  |         if config["use_subwords"] and config["@pretrained_vectors"]: | ||||||
|  |             mix._layers[0].nI = width * 5 | ||||||
|  |             layer = uniqued( | ||||||
|  |                 (glove | norm | prefix | suffix | shape) >> mix, | ||||||
|  |                 column=cols.index("ORTH"), | ||||||
|  |             ) | ||||||
|  |         elif config["use_subwords"]: | ||||||
|  |             mix._layers[0].nI = width * 4 | ||||||
|  |             layer = uniqued( | ||||||
|  |                 (norm | prefix | suffix | shape) >> mix, column=cols.index("ORTH") | ||||||
|  |             ) | ||||||
|  |         elif config["@pretrained_vectors"]: | ||||||
|  |             mix._layers[0].nI = width * 2 | ||||||
|  |             layer = uniqued((glove | norm) >> mix, column=cols.index("ORTH"),) | ||||||
|  |         else: | ||||||
|  |             layer = norm | ||||||
|  |     layer.cfg = config | ||||||
|  |     return layer | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
|  | @register_architecture("spacy.CharacterEmbed.v1") | ||||||
|  | def CharacterEmbed(config): | ||||||
|  |     from .. import _ml | ||||||
|  | 
 | ||||||
|  |     width = config["width"] | ||||||
|  |     chars = config["chars"] | ||||||
|  | 
 | ||||||
|  |     chr_embed = _ml.CharacterEmbedModel(nM=width, nC=chars) | ||||||
|  |     other_tables = make_layer(config["@embed_features"]) | ||||||
|  |     mix = make_layer(config["@mix"]) | ||||||
|  | 
 | ||||||
|  |     model = chain(concatenate_lists(chr_embed, other_tables), mix) | ||||||
|  |     model.cfg = config | ||||||
|  |     return model | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
|  | @register_architecture("spacy.MaxoutWindowEncoder.v1") | ||||||
|  | def MaxoutWindowEncoder(config): | ||||||
|  |     nO = config["width"] | ||||||
|  |     nW = config["window_size"] | ||||||
|  |     nP = config["pieces"] | ||||||
|  |     depth = config["depth"] | ||||||
|  | 
 | ||||||
|  |     cnn = chain( | ||||||
|  |         ExtractWindow(nW=nW), LayerNorm(Maxout(nO, nO * ((nW * 2) + 1), pieces=nP)) | ||||||
|  |     ) | ||||||
|  |     model = clone(Residual(cnn), depth) | ||||||
|  |     model.nO = nO | ||||||
|  |     model.receptive_field = nW * depth | ||||||
|  |     return model | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
|  | @register_architecture("spacy.MishWindowEncoder.v1") | ||||||
|  | def MishWindowEncoder(config): | ||||||
|  |     from thinc.v2v import Mish | ||||||
|  | 
 | ||||||
|  |     nO = config["width"] | ||||||
|  |     nW = config["window_size"] | ||||||
|  |     depth = config["depth"] | ||||||
|  | 
 | ||||||
|  |     cnn = chain(ExtractWindow(nW=nW), LayerNorm(Mish(nO, nO * ((nW * 2) + 1)))) | ||||||
|  |     model = clone(Residual(cnn), depth) | ||||||
|  |     model.nO = nO | ||||||
|  |     return model | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
|  | @register_architecture("spacy.PretrainedVectors.v1") | ||||||
|  | def PretrainedVectors(config): | ||||||
|  |     return StaticVectors(config["vectors_name"], config["width"], config["column"]) | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
|  | @register_architecture("spacy.TorchBiLSTMEncoder.v1") | ||||||
|  | def TorchBiLSTMEncoder(config): | ||||||
|  |     import torch.nn | ||||||
|  |     from thinc.extra.wrappers import PyTorchWrapperRNN | ||||||
|  | 
 | ||||||
|  |     width = config["width"] | ||||||
|  |     depth = config["depth"] | ||||||
|  |     if depth == 0: | ||||||
|  |         return layerize(noop()) | ||||||
|  |     return with_square_sequences( | ||||||
|  |         PyTorchWrapperRNN(torch.nn.LSTM(width, width // 2, depth, bidirectional=True)) | ||||||
|  |     ) | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
|  | _EXAMPLE_CONFIG = { | ||||||
|  |     "@doc2feats": { | ||||||
|  |         "arch": "Doc2Feats", | ||||||
|  |         "config": {"columns": ["ID", "NORM", "PREFIX", "SUFFIX", "SHAPE", "ORTH"]}, | ||||||
|  |     }, | ||||||
|  |     "@embed": { | ||||||
|  |         "arch": "spacy.MultiHashEmbed.v1", | ||||||
|  |         "config": { | ||||||
|  |             "width": 96, | ||||||
|  |             "rows": 2000, | ||||||
|  |             "columns": ["ID", "NORM", "PREFIX", "SUFFIX", "SHAPE", "ORTH"], | ||||||
|  |             "use_subwords": True, | ||||||
|  |             "@pretrained_vectors": { | ||||||
|  |                 "arch": "TransformedStaticVectors", | ||||||
|  |                 "config": { | ||||||
|  |                     "vectors_name": "en_vectors_web_lg.vectors", | ||||||
|  |                     "width": 96, | ||||||
|  |                     "column": 0, | ||||||
|  |                 }, | ||||||
|  |             }, | ||||||
|  |             "@mix": { | ||||||
|  |                 "arch": "LayerNormalizedMaxout", | ||||||
|  |                 "config": {"width": 96, "pieces": 3}, | ||||||
|  |             }, | ||||||
|  |         }, | ||||||
|  |     }, | ||||||
|  |     "@encode": { | ||||||
|  |         "arch": "MaxoutWindowEncode", | ||||||
|  |         "config": {"width": 96, "window_size": 1, "depth": 4, "pieces": 3}, | ||||||
|  |     }, | ||||||
|  | } | ||||||
|  | @ -4,6 +4,7 @@ from __future__ import unicode_literals | ||||||
| from collections import defaultdict, OrderedDict | from collections import defaultdict, OrderedDict | ||||||
| import srsly | import srsly | ||||||
| 
 | 
 | ||||||
|  | from ..language import component | ||||||
| from ..errors import Errors | from ..errors import Errors | ||||||
| from ..compat import basestring_ | from ..compat import basestring_ | ||||||
| from ..util import ensure_path, to_disk, from_disk | from ..util import ensure_path, to_disk, from_disk | ||||||
|  | @ -13,6 +14,7 @@ from ..matcher import Matcher, PhraseMatcher | ||||||
| DEFAULT_ENT_ID_SEP = "||" | DEFAULT_ENT_ID_SEP = "||" | ||||||
| 
 | 
 | ||||||
| 
 | 
 | ||||||
|  | @component("entity_ruler", assigns=["doc.ents", "token.ent_type", "token.ent_iob"]) | ||||||
| class EntityRuler(object): | class EntityRuler(object): | ||||||
|     """The EntityRuler lets you add spans to the `Doc.ents` using token-based |     """The EntityRuler lets you add spans to the `Doc.ents` using token-based | ||||||
|     rules or exact phrase matches. It can be combined with the statistical |     rules or exact phrase matches. It can be combined with the statistical | ||||||
|  | @ -24,8 +26,6 @@ class EntityRuler(object): | ||||||
|     USAGE: https://spacy.io/usage/rule-based-matching#entityruler |     USAGE: https://spacy.io/usage/rule-based-matching#entityruler | ||||||
|     """ |     """ | ||||||
| 
 | 
 | ||||||
|     name = "entity_ruler" |  | ||||||
| 
 |  | ||||||
|     def __init__(self, nlp, phrase_matcher_attr=None, validate=False, **cfg): |     def __init__(self, nlp, phrase_matcher_attr=None, validate=False, **cfg): | ||||||
|         """Initialize the entitiy ruler. If patterns are supplied here, they |         """Initialize the entitiy ruler. If patterns are supplied here, they | ||||||
|         need to be a list of dictionaries with a `"label"` and `"pattern"` |         need to be a list of dictionaries with a `"label"` and `"pattern"` | ||||||
|  | @ -64,10 +64,15 @@ class EntityRuler(object): | ||||||
|             self.phrase_matcher_attr = None |             self.phrase_matcher_attr = None | ||||||
|             self.phrase_matcher = PhraseMatcher(nlp.vocab, validate=validate) |             self.phrase_matcher = PhraseMatcher(nlp.vocab, validate=validate) | ||||||
|         self.ent_id_sep = cfg.get("ent_id_sep", DEFAULT_ENT_ID_SEP) |         self.ent_id_sep = cfg.get("ent_id_sep", DEFAULT_ENT_ID_SEP) | ||||||
|  |         self._ent_ids = defaultdict(dict) | ||||||
|         patterns = cfg.get("patterns") |         patterns = cfg.get("patterns") | ||||||
|         if patterns is not None: |         if patterns is not None: | ||||||
|             self.add_patterns(patterns) |             self.add_patterns(patterns) | ||||||
| 
 | 
 | ||||||
|  |     @classmethod | ||||||
|  |     def from_nlp(cls, nlp, **cfg): | ||||||
|  |         return cls(nlp, **cfg) | ||||||
|  | 
 | ||||||
|     def __len__(self): |     def __len__(self): | ||||||
|         """The number of all patterns added to the entity ruler.""" |         """The number of all patterns added to the entity ruler.""" | ||||||
|         n_token_patterns = sum(len(p) for p in self.token_patterns.values()) |         n_token_patterns = sum(len(p) for p in self.token_patterns.values()) | ||||||
|  | @ -100,10 +105,9 @@ class EntityRuler(object): | ||||||
|                 continue |                 continue | ||||||
|             # check for end - 1 here because boundaries are inclusive |             # check for end - 1 here because boundaries are inclusive | ||||||
|             if start not in seen_tokens and end - 1 not in seen_tokens: |             if start not in seen_tokens and end - 1 not in seen_tokens: | ||||||
|                 if self.ent_ids: |                 if match_id in self._ent_ids: | ||||||
|                     label_ = self.nlp.vocab.strings[match_id] |                     label, ent_id = self._ent_ids[match_id] | ||||||
|                     ent_label, ent_id = self._split_label(label_) |                     span = Span(doc, start, end, label=label) | ||||||
|                     span = Span(doc, start, end, label=ent_label) |  | ||||||
|                     if ent_id: |                     if ent_id: | ||||||
|                         for token in span: |                         for token in span: | ||||||
|                             token.ent_id_ = ent_id |                             token.ent_id_ = ent_id | ||||||
|  | @ -131,11 +135,11 @@ class EntityRuler(object): | ||||||
| 
 | 
 | ||||||
|     @property |     @property | ||||||
|     def ent_ids(self): |     def ent_ids(self): | ||||||
|         """All entity ids present in the match patterns meta dicts. |         """All entity ids present in the match patterns `id` properties. | ||||||
| 
 | 
 | ||||||
|         RETURNS (set): The string entity ids. |         RETURNS (set): The string entity ids. | ||||||
| 
 | 
 | ||||||
|         DOCS: https://spacy.io/api/entityruler#labels |         DOCS: https://spacy.io/api/entityruler#ent_ids | ||||||
|         """ |         """ | ||||||
|         all_ent_ids = set() |         all_ent_ids = set() | ||||||
|         for l in self.labels: |         for l in self.labels: | ||||||
|  | @ -147,7 +151,6 @@ class EntityRuler(object): | ||||||
|     @property |     @property | ||||||
|     def patterns(self): |     def patterns(self): | ||||||
|         """Get all patterns that were added to the entity ruler. |         """Get all patterns that were added to the entity ruler. | ||||||
| 
 |  | ||||||
|         RETURNS (list): The original patterns, one dictionary per pattern. |         RETURNS (list): The original patterns, one dictionary per pattern. | ||||||
| 
 | 
 | ||||||
|         DOCS: https://spacy.io/api/entityruler#patterns |         DOCS: https://spacy.io/api/entityruler#patterns | ||||||
|  | @ -188,11 +191,15 @@ class EntityRuler(object): | ||||||
|             ] |             ] | ||||||
|         except ValueError: |         except ValueError: | ||||||
|             subsequent_pipes = [] |             subsequent_pipes = [] | ||||||
|         with self.nlp.disable_pipes(*subsequent_pipes): |         with self.nlp.disable_pipes(subsequent_pipes): | ||||||
|             for entry in patterns: |             for entry in patterns: | ||||||
|                 label = entry["label"] |                 label = entry["label"] | ||||||
|                 if "id" in entry: |                 if "id" in entry: | ||||||
|  |                     ent_label = label | ||||||
|                     label = self._create_label(label, entry["id"]) |                     label = self._create_label(label, entry["id"]) | ||||||
|  |                     key = self.matcher._normalize_key(label) | ||||||
|  |                     self._ent_ids[key] = (ent_label, entry["id"]) | ||||||
|  | 
 | ||||||
|                 pattern = entry["pattern"] |                 pattern = entry["pattern"] | ||||||
|                 if isinstance(pattern, basestring_): |                 if isinstance(pattern, basestring_): | ||||||
|                     self.phrase_patterns[label].append(self.nlp(pattern)) |                     self.phrase_patterns[label].append(self.nlp(pattern)) | ||||||
|  | @ -201,9 +208,9 @@ class EntityRuler(object): | ||||||
|                 else: |                 else: | ||||||
|                     raise ValueError(Errors.E097.format(pattern=pattern)) |                     raise ValueError(Errors.E097.format(pattern=pattern)) | ||||||
|             for label, patterns in self.token_patterns.items(): |             for label, patterns in self.token_patterns.items(): | ||||||
|                 self.matcher.add(label, None, *patterns) |                 self.matcher.add(label, patterns) | ||||||
|             for label, patterns in self.phrase_patterns.items(): |             for label, patterns in self.phrase_patterns.items(): | ||||||
|                 self.phrase_matcher.add(label, None, *patterns) |                 self.phrase_matcher.add(label, patterns) | ||||||
| 
 | 
 | ||||||
|     def _split_label(self, label): |     def _split_label(self, label): | ||||||
|         """Split Entity label into ent_label and ent_id if it contains self.ent_id_sep |         """Split Entity label into ent_label and ent_id if it contains self.ent_id_sep | ||||||
|  |  | ||||||
|  | @ -1,9 +1,16 @@ | ||||||
| # coding: utf8 | # coding: utf8 | ||||||
| from __future__ import unicode_literals | from __future__ import unicode_literals | ||||||
| 
 | 
 | ||||||
|  | from ..language import component | ||||||
| from ..matcher import Matcher | from ..matcher import Matcher | ||||||
|  | from ..util import filter_spans | ||||||
| 
 | 
 | ||||||
| 
 | 
 | ||||||
|  | @component( | ||||||
|  |     "merge_noun_chunks", | ||||||
|  |     requires=["token.dep", "token.tag", "token.pos"], | ||||||
|  |     retokenizes=True, | ||||||
|  | ) | ||||||
| def merge_noun_chunks(doc): | def merge_noun_chunks(doc): | ||||||
|     """Merge noun chunks into a single token. |     """Merge noun chunks into a single token. | ||||||
| 
 | 
 | ||||||
|  | @ -21,6 +28,11 @@ def merge_noun_chunks(doc): | ||||||
|     return doc |     return doc | ||||||
| 
 | 
 | ||||||
| 
 | 
 | ||||||
|  | @component( | ||||||
|  |     "merge_entities", | ||||||
|  |     requires=["doc.ents", "token.ent_iob", "token.ent_type"], | ||||||
|  |     retokenizes=True, | ||||||
|  | ) | ||||||
| def merge_entities(doc): | def merge_entities(doc): | ||||||
|     """Merge entities into a single token. |     """Merge entities into a single token. | ||||||
| 
 | 
 | ||||||
|  | @ -36,6 +48,7 @@ def merge_entities(doc): | ||||||
|     return doc |     return doc | ||||||
| 
 | 
 | ||||||
| 
 | 
 | ||||||
|  | @component("merge_subtokens", requires=["token.dep"], retokenizes=True) | ||||||
| def merge_subtokens(doc, label="subtok"): | def merge_subtokens(doc, label="subtok"): | ||||||
|     """Merge subtokens into a single token. |     """Merge subtokens into a single token. | ||||||
| 
 | 
 | ||||||
|  | @ -48,7 +61,7 @@ def merge_subtokens(doc, label="subtok"): | ||||||
|     merger = Matcher(doc.vocab) |     merger = Matcher(doc.vocab) | ||||||
|     merger.add("SUBTOK", None, [{"DEP": label, "op": "+"}]) |     merger.add("SUBTOK", None, [{"DEP": label, "op": "+"}]) | ||||||
|     matches = merger(doc) |     matches = merger(doc) | ||||||
|     spans = [doc[start : end + 1] for _, start, end in matches] |     spans = filter_spans([doc[start : end + 1] for _, start, end in matches]) | ||||||
|     with doc.retokenize() as retokenizer: |     with doc.retokenize() as retokenizer: | ||||||
|         for span in spans: |         for span in spans: | ||||||
|             retokenizer.merge(span) |             retokenizer.merge(span) | ||||||
|  |  | ||||||
|  | @ -5,9 +5,11 @@ from thinc.t2v import Pooling, max_pool, mean_pool | ||||||
| from thinc.neural._classes.difference import Siamese, CauchySimilarity | from thinc.neural._classes.difference import Siamese, CauchySimilarity | ||||||
| 
 | 
 | ||||||
| from .pipes import Pipe | from .pipes import Pipe | ||||||
|  | from ..language import component | ||||||
| from .._ml import link_vectors_to_models | from .._ml import link_vectors_to_models | ||||||
| 
 | 
 | ||||||
| 
 | 
 | ||||||
|  | @component("sentencizer_hook", assigns=["doc.user_hooks"]) | ||||||
| class SentenceSegmenter(object): | class SentenceSegmenter(object): | ||||||
|     """A simple spaCy hook, to allow custom sentence boundary detection logic |     """A simple spaCy hook, to allow custom sentence boundary detection logic | ||||||
|     (that doesn't require the dependency parse). To change the sentence |     (that doesn't require the dependency parse). To change the sentence | ||||||
|  | @ -17,8 +19,6 @@ class SentenceSegmenter(object): | ||||||
|     and yield `Span` objects for each sentence. |     and yield `Span` objects for each sentence. | ||||||
|     """ |     """ | ||||||
| 
 | 
 | ||||||
|     name = "sentencizer" |  | ||||||
| 
 |  | ||||||
|     def __init__(self, vocab, strategy=None): |     def __init__(self, vocab, strategy=None): | ||||||
|         self.vocab = vocab |         self.vocab = vocab | ||||||
|         if strategy is None or strategy == "on_punct": |         if strategy is None or strategy == "on_punct": | ||||||
|  | @ -44,6 +44,7 @@ class SentenceSegmenter(object): | ||||||
|             yield doc[start : len(doc)] |             yield doc[start : len(doc)] | ||||||
| 
 | 
 | ||||||
| 
 | 
 | ||||||
|  | @component("similarity", assigns=["doc.user_hooks"]) | ||||||
| class SimilarityHook(Pipe): | class SimilarityHook(Pipe): | ||||||
|     """ |     """ | ||||||
|     Experimental: A pipeline component to install a hook for supervised |     Experimental: A pipeline component to install a hook for supervised | ||||||
|  | @ -58,8 +59,6 @@ class SimilarityHook(Pipe): | ||||||
|     Where W is a vector of dimension weights, initialized to 1. |     Where W is a vector of dimension weights, initialized to 1. | ||||||
|     """ |     """ | ||||||
| 
 | 
 | ||||||
|     name = "similarity" |  | ||||||
| 
 |  | ||||||
|     def __init__(self, vocab, model=True, **cfg): |     def __init__(self, vocab, model=True, **cfg): | ||||||
|         self.vocab = vocab |         self.vocab = vocab | ||||||
|         self.model = model |         self.model = model | ||||||
|  |  | ||||||
|  | @ -8,6 +8,7 @@ from thinc.api import chain | ||||||
| from thinc.neural.util import to_categorical, copy_array, get_array_module | from thinc.neural.util import to_categorical, copy_array, get_array_module | ||||||
| from .. import util | from .. import util | ||||||
| from .pipes import Pipe | from .pipes import Pipe | ||||||
|  | from ..language import component | ||||||
| from .._ml import Tok2Vec, build_morphologizer_model | from .._ml import Tok2Vec, build_morphologizer_model | ||||||
| from .._ml import link_vectors_to_models, zero_init, flatten | from .._ml import link_vectors_to_models, zero_init, flatten | ||||||
| from .._ml import create_default_optimizer | from .._ml import create_default_optimizer | ||||||
|  | @ -18,8 +19,8 @@ from ..vocab cimport Vocab | ||||||
| from ..morphology cimport Morphology | from ..morphology cimport Morphology | ||||||
| 
 | 
 | ||||||
| 
 | 
 | ||||||
|  | @component("morphologizer", assigns=["token.morph", "token.pos"]) | ||||||
| class Morphologizer(Pipe): | class Morphologizer(Pipe): | ||||||
|     name = 'morphologizer' |  | ||||||
| 
 | 
 | ||||||
|     @classmethod |     @classmethod | ||||||
|     def Model(cls, **cfg): |     def Model(cls, **cfg): | ||||||
|  |  | ||||||
|  | @ -13,7 +13,6 @@ from thinc.misc import LayerNorm | ||||||
| from thinc.neural.util import to_categorical | from thinc.neural.util import to_categorical | ||||||
| from thinc.neural.util import get_array_module | from thinc.neural.util import get_array_module | ||||||
| 
 | 
 | ||||||
| from .functions import merge_subtokens |  | ||||||
| from ..tokens.doc cimport Doc | from ..tokens.doc cimport Doc | ||||||
| from ..syntax.nn_parser cimport Parser | from ..syntax.nn_parser cimport Parser | ||||||
| from ..syntax.ner cimport BiluoPushDown | from ..syntax.ner cimport BiluoPushDown | ||||||
|  | @ -21,6 +20,8 @@ from ..syntax.arc_eager cimport ArcEager | ||||||
| from ..morphology cimport Morphology | from ..morphology cimport Morphology | ||||||
| from ..vocab cimport Vocab | from ..vocab cimport Vocab | ||||||
| 
 | 
 | ||||||
|  | from .functions import merge_subtokens | ||||||
|  | from ..language import Language, component | ||||||
| from ..syntax import nonproj | from ..syntax import nonproj | ||||||
| from ..attrs import POS, ID | from ..attrs import POS, ID | ||||||
| from ..parts_of_speech import X | from ..parts_of_speech import X | ||||||
|  | @ -54,6 +55,10 @@ class Pipe(object): | ||||||
|         """Initialize a model for the pipe.""" |         """Initialize a model for the pipe.""" | ||||||
|         raise NotImplementedError |         raise NotImplementedError | ||||||
| 
 | 
 | ||||||
|  |     @classmethod | ||||||
|  |     def from_nlp(cls, nlp, **cfg): | ||||||
|  |         return cls(nlp.vocab, **cfg) | ||||||
|  | 
 | ||||||
|     def __init__(self, vocab, model=True, **cfg): |     def __init__(self, vocab, model=True, **cfg): | ||||||
|         """Create a new pipe instance.""" |         """Create a new pipe instance.""" | ||||||
|         raise NotImplementedError |         raise NotImplementedError | ||||||
|  | @ -223,11 +228,10 @@ class Pipe(object): | ||||||
|         return self |         return self | ||||||
| 
 | 
 | ||||||
| 
 | 
 | ||||||
|  | @component("tensorizer", assigns=["doc.tensor"]) | ||||||
| class Tensorizer(Pipe): | class Tensorizer(Pipe): | ||||||
|     """Pre-train position-sensitive vectors for tokens.""" |     """Pre-train position-sensitive vectors for tokens.""" | ||||||
| 
 | 
 | ||||||
|     name = "tensorizer" |  | ||||||
| 
 |  | ||||||
|     @classmethod |     @classmethod | ||||||
|     def Model(cls, output_size=300, **cfg): |     def Model(cls, output_size=300, **cfg): | ||||||
|         """Create a new statistical model for the class. |         """Create a new statistical model for the class. | ||||||
|  | @ -362,14 +366,13 @@ class Tensorizer(Pipe): | ||||||
|         return sgd |         return sgd | ||||||
| 
 | 
 | ||||||
| 
 | 
 | ||||||
|  | @component("tagger", assigns=["token.tag", "token.pos"]) | ||||||
| class Tagger(Pipe): | class Tagger(Pipe): | ||||||
|     """Pipeline component for part-of-speech tagging. |     """Pipeline component for part-of-speech tagging. | ||||||
| 
 | 
 | ||||||
|     DOCS: https://spacy.io/api/tagger |     DOCS: https://spacy.io/api/tagger | ||||||
|     """ |     """ | ||||||
| 
 | 
 | ||||||
|     name = "tagger" |  | ||||||
| 
 |  | ||||||
|     def __init__(self, vocab, model=True, **cfg): |     def __init__(self, vocab, model=True, **cfg): | ||||||
|         self.vocab = vocab |         self.vocab = vocab | ||||||
|         self.model = model |         self.model = model | ||||||
|  | @ -514,7 +517,6 @@ class Tagger(Pipe): | ||||||
|         orig_tag_map = dict(self.vocab.morphology.tag_map) |         orig_tag_map = dict(self.vocab.morphology.tag_map) | ||||||
|         new_tag_map = OrderedDict() |         new_tag_map = OrderedDict() | ||||||
|         for raw_text, annots_brackets in get_gold_tuples(): |         for raw_text, annots_brackets in get_gold_tuples(): | ||||||
|             _ = annots_brackets.pop() |  | ||||||
|             for annots, brackets in annots_brackets: |             for annots, brackets in annots_brackets: | ||||||
|                 ids, words, tags, heads, deps, ents = annots |                 ids, words, tags, heads, deps, ents = annots | ||||||
|                 for tag in tags: |                 for tag in tags: | ||||||
|  | @ -657,13 +659,12 @@ class Tagger(Pipe): | ||||||
|         return self |         return self | ||||||
| 
 | 
 | ||||||
| 
 | 
 | ||||||
|  | @component("nn_labeller") | ||||||
| class MultitaskObjective(Tagger): | class MultitaskObjective(Tagger): | ||||||
|     """Experimental: Assist training of a parser or tagger, by training a |     """Experimental: Assist training of a parser or tagger, by training a | ||||||
|     side-objective. |     side-objective. | ||||||
|     """ |     """ | ||||||
| 
 | 
 | ||||||
|     name = "nn_labeller" |  | ||||||
| 
 |  | ||||||
|     def __init__(self, vocab, model=True, target='dep_tag_offset', **cfg): |     def __init__(self, vocab, model=True, target='dep_tag_offset', **cfg): | ||||||
|         self.vocab = vocab |         self.vocab = vocab | ||||||
|         self.model = model |         self.model = model | ||||||
|  | @ -898,12 +899,12 @@ class ClozeMultitask(Pipe): | ||||||
|             losses[self.name] += loss |             losses[self.name] += loss | ||||||
| 
 | 
 | ||||||
| 
 | 
 | ||||||
|  | @component("textcat", assigns=["doc.cats"]) | ||||||
| class TextCategorizer(Pipe): | class TextCategorizer(Pipe): | ||||||
|     """Pipeline component for text classification. |     """Pipeline component for text classification. | ||||||
| 
 | 
 | ||||||
|     DOCS: https://spacy.io/api/textcategorizer |     DOCS: https://spacy.io/api/textcategorizer | ||||||
|     """ |     """ | ||||||
|     name = 'textcat' |  | ||||||
| 
 | 
 | ||||||
|     @classmethod |     @classmethod | ||||||
|     def Model(cls, nr_class=1, **cfg): |     def Model(cls, nr_class=1, **cfg): | ||||||
|  | @ -1032,10 +1033,10 @@ class TextCategorizer(Pipe): | ||||||
|         return 1 |         return 1 | ||||||
| 
 | 
 | ||||||
|     def begin_training(self, get_gold_tuples=lambda: [], pipeline=None, sgd=None, **kwargs): |     def begin_training(self, get_gold_tuples=lambda: [], pipeline=None, sgd=None, **kwargs): | ||||||
|         for raw_text, annots_brackets in get_gold_tuples(): |         for raw_text, annot_brackets in get_gold_tuples(): | ||||||
|             cats = annots_brackets.pop() |             for _, (cats, _2) in annot_brackets:  | ||||||
|             for cat in cats: |                 for cat in cats: | ||||||
|                 self.add_label(cat) |                     self.add_label(cat) | ||||||
|         if self.model is True: |         if self.model is True: | ||||||
|             self.cfg["pretrained_vectors"] = kwargs.get("pretrained_vectors") |             self.cfg["pretrained_vectors"] = kwargs.get("pretrained_vectors") | ||||||
|             self.require_labels() |             self.require_labels() | ||||||
|  | @ -1051,8 +1052,11 @@ cdef class DependencyParser(Parser): | ||||||
| 
 | 
 | ||||||
|     DOCS: https://spacy.io/api/dependencyparser |     DOCS: https://spacy.io/api/dependencyparser | ||||||
|     """ |     """ | ||||||
| 
 |     # cdef classes can't have decorators, so we're defining this here | ||||||
|     name = "parser" |     name = "parser" | ||||||
|  |     factory = "parser" | ||||||
|  |     assigns = ["token.dep", "token.is_sent_start", "doc.sents"] | ||||||
|  |     requires = [] | ||||||
|     TransitionSystem = ArcEager |     TransitionSystem = ArcEager | ||||||
| 
 | 
 | ||||||
|     @property |     @property | ||||||
|  | @ -1097,8 +1101,10 @@ cdef class EntityRecognizer(Parser): | ||||||
| 
 | 
 | ||||||
|     DOCS: https://spacy.io/api/entityrecognizer |     DOCS: https://spacy.io/api/entityrecognizer | ||||||
|     """ |     """ | ||||||
| 
 |  | ||||||
|     name = "ner" |     name = "ner" | ||||||
|  |     factory = "ner" | ||||||
|  |     assigns = ["doc.ents", "token.ent_iob", "token.ent_type"] | ||||||
|  |     requires = [] | ||||||
|     TransitionSystem = BiluoPushDown |     TransitionSystem = BiluoPushDown | ||||||
|     nr_feature = 6 |     nr_feature = 6 | ||||||
| 
 | 
 | ||||||
|  | @ -1129,12 +1135,16 @@ cdef class EntityRecognizer(Parser): | ||||||
|         return tuple(sorted(labels)) |         return tuple(sorted(labels)) | ||||||
| 
 | 
 | ||||||
| 
 | 
 | ||||||
|  | @component( | ||||||
|  |     "entity_linker", | ||||||
|  |     requires=["doc.ents", "token.ent_iob", "token.ent_type"], | ||||||
|  |     assigns=["token.ent_kb_id"] | ||||||
|  | ) | ||||||
| class EntityLinker(Pipe): | class EntityLinker(Pipe): | ||||||
|     """Pipeline component for named entity linking. |     """Pipeline component for named entity linking. | ||||||
| 
 | 
 | ||||||
|     DOCS: https://spacy.io/api/entitylinker |     DOCS: https://spacy.io/api/entitylinker | ||||||
|     """ |     """ | ||||||
|     name = 'entity_linker' |  | ||||||
|     NIL = "NIL"  # string used to refer to a non-existing link |     NIL = "NIL"  # string used to refer to a non-existing link | ||||||
| 
 | 
 | ||||||
|     @classmethod |     @classmethod | ||||||
|  | @ -1298,7 +1308,8 @@ class EntityLinker(Pipe): | ||||||
|                     for ent in sent_doc.ents: |                     for ent in sent_doc.ents: | ||||||
|                         entity_count += 1 |                         entity_count += 1 | ||||||
| 
 | 
 | ||||||
|                         if ent.label_ in self.cfg.get("labels_discard", []): |                         to_discard = self.cfg.get("labels_discard", []) | ||||||
|  |                         if to_discard and ent.label_ in to_discard: | ||||||
|                             # ignoring this entity - setting to NIL |                             # ignoring this entity - setting to NIL | ||||||
|                             final_kb_ids.append(self.NIL) |                             final_kb_ids.append(self.NIL) | ||||||
|                             final_tensors.append(sentence_encoding) |                             final_tensors.append(sentence_encoding) | ||||||
|  | @ -1404,13 +1415,13 @@ class EntityLinker(Pipe): | ||||||
|         raise NotImplementedError |         raise NotImplementedError | ||||||
| 
 | 
 | ||||||
| 
 | 
 | ||||||
|  | @component("sentencizer", assigns=["token.is_sent_start", "doc.sents"]) | ||||||
| class Sentencizer(object): | class Sentencizer(object): | ||||||
|     """Segment the Doc into sentences using a rule-based strategy. |     """Segment the Doc into sentences using a rule-based strategy. | ||||||
| 
 | 
 | ||||||
|     DOCS: https://spacy.io/api/sentencizer |     DOCS: https://spacy.io/api/sentencizer | ||||||
|     """ |     """ | ||||||
| 
 | 
 | ||||||
|     name = "sentencizer" |  | ||||||
|     default_punct_chars = ['!', '.', '?', '։', '؟', '۔', '܀', '܁', '܂', '߹', |     default_punct_chars = ['!', '.', '?', '։', '؟', '۔', '܀', '܁', '܂', '߹', | ||||||
|             '।', '॥', '၊', '။', '።', '፧', '፨', '᙮', '᜵', '᜶', '᠃', '᠉', '᥄', |             '।', '॥', '၊', '။', '።', '፧', '፨', '᙮', '᜵', '᜶', '᠃', '᠉', '᥄', | ||||||
|             '᥅', '᪨', '᪩', '᪪', '᪫', '᭚', '᭛', '᭞', '᭟', '᰻', '᰼', '᱾', '᱿', |             '᥅', '᪨', '᪩', '᪪', '᪫', '᭚', '᭛', '᭞', '᭟', '᰻', '᰼', '᱾', '᱿', | ||||||
|  | @ -1436,6 +1447,10 @@ class Sentencizer(object): | ||||||
|         else: |         else: | ||||||
|             self.punct_chars = set(self.default_punct_chars) |             self.punct_chars = set(self.default_punct_chars) | ||||||
| 
 | 
 | ||||||
|  |     @classmethod | ||||||
|  |     def from_nlp(cls, nlp, **cfg): | ||||||
|  |         return cls(**cfg) | ||||||
|  | 
 | ||||||
|     def __call__(self, doc): |     def __call__(self, doc): | ||||||
|         """Apply the sentencizer to a Doc and set Token.is_sent_start. |         """Apply the sentencizer to a Doc and set Token.is_sent_start. | ||||||
| 
 | 
 | ||||||
|  | @ -1502,4 +1517,9 @@ class Sentencizer(object): | ||||||
|         return self |         return self | ||||||
| 
 | 
 | ||||||
| 
 | 
 | ||||||
|  | # Cython classes can't be decorated, so we need to add the factories here | ||||||
|  | Language.factories["parser"] = lambda nlp, **cfg: DependencyParser.from_nlp(nlp, **cfg) | ||||||
|  | Language.factories["ner"] = lambda nlp, **cfg: EntityRecognizer.from_nlp(nlp, **cfg) | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
| __all__ = ["Tagger", "DependencyParser", "EntityRecognizer", "Tensorizer", "TextCategorizer", "EntityLinker", "Sentencizer"] | __all__ = ["Tagger", "DependencyParser", "EntityRecognizer", "Tensorizer", "TextCategorizer", "EntityLinker", "Sentencizer"] | ||||||
|  |  | ||||||
|  | @ -19,7 +19,7 @@ from thinc.extra.search cimport Beam | ||||||
| from thinc.api import chain, clone | from thinc.api import chain, clone | ||||||
| from thinc.v2v import Model, Maxout, Affine | from thinc.v2v import Model, Maxout, Affine | ||||||
| from thinc.misc import LayerNorm | from thinc.misc import LayerNorm | ||||||
| from thinc.neural.ops import CupyOps | from thinc.neural.ops import CupyOps, NumpyOps | ||||||
| from thinc.neural.util import get_array_module | from thinc.neural.util import get_array_module | ||||||
| from thinc.linalg cimport Vec, VecVec | from thinc.linalg cimport Vec, VecVec | ||||||
| cimport blis.cy | cimport blis.cy | ||||||
|  | @ -440,28 +440,38 @@ cdef class precompute_hiddens: | ||||||
|         def backward(d_state_vector_ids, sgd=None): |         def backward(d_state_vector_ids, sgd=None): | ||||||
|             d_state_vector, token_ids = d_state_vector_ids |             d_state_vector, token_ids = d_state_vector_ids | ||||||
|             d_state_vector = bp_nonlinearity(d_state_vector, sgd) |             d_state_vector = bp_nonlinearity(d_state_vector, sgd) | ||||||
|             # This will usually be on GPU |  | ||||||
|             if not isinstance(d_state_vector, self.ops.xp.ndarray): |  | ||||||
|                 d_state_vector = self.ops.xp.array(d_state_vector) |  | ||||||
|             d_tokens = bp_hiddens((d_state_vector, token_ids), sgd) |             d_tokens = bp_hiddens((d_state_vector, token_ids), sgd) | ||||||
|             return d_tokens |             return d_tokens | ||||||
|         return state_vector, backward |         return state_vector, backward | ||||||
| 
 | 
 | ||||||
|     def _nonlinearity(self, state_vector): |     def _nonlinearity(self, state_vector): | ||||||
|  |         if isinstance(state_vector, numpy.ndarray): | ||||||
|  |             ops = NumpyOps() | ||||||
|  |         else: | ||||||
|  |             ops = CupyOps() | ||||||
|  |   | ||||||
|         if self.nP == 1: |         if self.nP == 1: | ||||||
|             state_vector = state_vector.reshape(state_vector.shape[:-1]) |             state_vector = state_vector.reshape(state_vector.shape[:-1]) | ||||||
|             mask = state_vector >= 0. |             mask = state_vector >= 0. | ||||||
|             state_vector *= mask |             state_vector *= mask | ||||||
|         else: |         else: | ||||||
|             state_vector, mask = self.ops.maxout(state_vector) |             state_vector, mask = ops.maxout(state_vector) | ||||||
| 
 | 
 | ||||||
|         def backprop_nonlinearity(d_best, sgd=None): |         def backprop_nonlinearity(d_best, sgd=None): | ||||||
|  |             if isinstance(d_best, numpy.ndarray): | ||||||
|  |                 ops = NumpyOps() | ||||||
|  |             else: | ||||||
|  |                 ops = CupyOps() | ||||||
|  |             mask_ = ops.asarray(mask) | ||||||
|  | 
 | ||||||
|  |             # This will usually be on GPU | ||||||
|  |             d_best = ops.asarray(d_best) | ||||||
|             # Fix nans (which can occur from unseen classes.) |             # Fix nans (which can occur from unseen classes.) | ||||||
|             d_best[self.ops.xp.isnan(d_best)] = 0. |             d_best[ops.xp.isnan(d_best)] = 0. | ||||||
|             if self.nP == 1: |             if self.nP == 1: | ||||||
|                 d_best *= mask |                 d_best *= mask_ | ||||||
|                 d_best = d_best.reshape((d_best.shape + (1,))) |                 d_best = d_best.reshape((d_best.shape + (1,))) | ||||||
|                 return d_best |                 return d_best | ||||||
|             else: |             else: | ||||||
|                 return self.ops.backprop_maxout(d_best, mask, self.nP) |                 return ops.backprop_maxout(d_best, mask_, self.nP) | ||||||
|         return state_vector, backprop_nonlinearity |         return state_vector, backprop_nonlinearity | ||||||
|  |  | ||||||
|  | @ -342,7 +342,6 @@ cdef class ArcEager(TransitionSystem): | ||||||
|             actions[RIGHT][label] = 1 |             actions[RIGHT][label] = 1 | ||||||
|             actions[REDUCE][label] = 1 |             actions[REDUCE][label] = 1 | ||||||
|         for raw_text, sents in kwargs.get('gold_parses', []): |         for raw_text, sents in kwargs.get('gold_parses', []): | ||||||
|             _ = sents.pop() |  | ||||||
|             for (ids, words, tags, heads, labels, iob), ctnts in sents: |             for (ids, words, tags, heads, labels, iob), ctnts in sents: | ||||||
|                 heads, labels = nonproj.projectivize(heads, labels) |                 heads, labels = nonproj.projectivize(heads, labels) | ||||||
|                 for child, head, label in zip(ids, heads, labels): |                 for child, head, label in zip(ids, heads, labels): | ||||||
|  |  | ||||||
|  | @ -73,7 +73,6 @@ cdef class BiluoPushDown(TransitionSystem): | ||||||
|                 actions[action][entity_type] = 1 |                 actions[action][entity_type] = 1 | ||||||
|         moves = ('M', 'B', 'I', 'L', 'U') |         moves = ('M', 'B', 'I', 'L', 'U') | ||||||
|         for raw_text, sents in kwargs.get('gold_parses', []): |         for raw_text, sents in kwargs.get('gold_parses', []): | ||||||
|             _ = sents.pop() |  | ||||||
|             for (ids, words, tags, heads, labels, biluo), _ in sents: |             for (ids, words, tags, heads, labels, biluo), _ in sents: | ||||||
|                 for i, ner_tag in enumerate(biluo): |                 for i, ner_tag in enumerate(biluo): | ||||||
|                     if ner_tag != 'O' and ner_tag != '-': |                     if ner_tag != 'O' and ner_tag != '-': | ||||||
|  |  | ||||||
|  | @ -57,7 +57,10 @@ cdef class Parser: | ||||||
|         subword_features = util.env_opt('subword_features', |         subword_features = util.env_opt('subword_features', | ||||||
|                             cfg.get('subword_features', True)) |                             cfg.get('subword_features', True)) | ||||||
|         conv_depth = util.env_opt('conv_depth', cfg.get('conv_depth', 4)) |         conv_depth = util.env_opt('conv_depth', cfg.get('conv_depth', 4)) | ||||||
|  |         conv_window = util.env_opt('conv_window', cfg.get('conv_depth', 1)) | ||||||
|  |         t2v_pieces = util.env_opt('cnn_maxout_pieces', cfg.get('cnn_maxout_pieces', 3)) | ||||||
|         bilstm_depth = util.env_opt('bilstm_depth', cfg.get('bilstm_depth', 0)) |         bilstm_depth = util.env_opt('bilstm_depth', cfg.get('bilstm_depth', 0)) | ||||||
|  |         self_attn_depth = util.env_opt('self_attn_depth', cfg.get('self_attn_depth', 0)) | ||||||
|         if depth != 1: |         if depth != 1: | ||||||
|             raise ValueError(TempErrors.T004.format(value=depth)) |             raise ValueError(TempErrors.T004.format(value=depth)) | ||||||
|         parser_maxout_pieces = util.env_opt('parser_maxout_pieces', |         parser_maxout_pieces = util.env_opt('parser_maxout_pieces', | ||||||
|  | @ -69,6 +72,8 @@ cdef class Parser: | ||||||
|         pretrained_vectors = cfg.get('pretrained_vectors', None) |         pretrained_vectors = cfg.get('pretrained_vectors', None) | ||||||
|         tok2vec = Tok2Vec(token_vector_width, embed_size, |         tok2vec = Tok2Vec(token_vector_width, embed_size, | ||||||
|                           conv_depth=conv_depth, |                           conv_depth=conv_depth, | ||||||
|  |                           conv_window=conv_window, | ||||||
|  |                           cnn_maxout_pieces=t2v_pieces, | ||||||
|                           subword_features=subword_features, |                           subword_features=subword_features, | ||||||
|                           pretrained_vectors=pretrained_vectors, |                           pretrained_vectors=pretrained_vectors, | ||||||
|                           bilstm_depth=bilstm_depth) |                           bilstm_depth=bilstm_depth) | ||||||
|  | @ -90,7 +95,12 @@ cdef class Parser: | ||||||
|             'hidden_width': hidden_width, |             'hidden_width': hidden_width, | ||||||
|             'maxout_pieces': parser_maxout_pieces, |             'maxout_pieces': parser_maxout_pieces, | ||||||
|             'pretrained_vectors': pretrained_vectors, |             'pretrained_vectors': pretrained_vectors, | ||||||
|             'bilstm_depth': bilstm_depth |             'bilstm_depth': bilstm_depth, | ||||||
|  |             'self_attn_depth': self_attn_depth, | ||||||
|  |             'conv_depth': conv_depth, | ||||||
|  |             'conv_window': conv_window, | ||||||
|  |             'embed_size': embed_size, | ||||||
|  |             'cnn_maxout_pieces': t2v_pieces | ||||||
|         } |         } | ||||||
|         return ParserModel(tok2vec, lower, upper), cfg |         return ParserModel(tok2vec, lower, upper), cfg | ||||||
| 
 | 
 | ||||||
|  | @ -128,6 +138,10 @@ cdef class Parser: | ||||||
|         self._multitasks = [] |         self._multitasks = [] | ||||||
|         self._rehearsal_model = None |         self._rehearsal_model = None | ||||||
| 
 | 
 | ||||||
|  |     @classmethod | ||||||
|  |     def from_nlp(cls, nlp, **cfg): | ||||||
|  |         return cls(nlp.vocab, **cfg) | ||||||
|  | 
 | ||||||
|     def __reduce__(self): |     def __reduce__(self): | ||||||
|         return (Parser, (self.vocab, self.moves, self.model), None, None) |         return (Parser, (self.vocab, self.moves, self.model), None, None) | ||||||
| 
 | 
 | ||||||
|  | @ -602,12 +616,11 @@ cdef class Parser: | ||||||
|             doc_sample = [] |             doc_sample = [] | ||||||
|             gold_sample = [] |             gold_sample = [] | ||||||
|             for raw_text, annots_brackets in islice(get_gold_tuples(), 1000): |             for raw_text, annots_brackets in islice(get_gold_tuples(), 1000): | ||||||
|                 _ = annots_brackets.pop() |  | ||||||
|                 for annots, brackets in annots_brackets: |                 for annots, brackets in annots_brackets: | ||||||
|                     ids, words, tags, heads, deps, ents = annots |                     ids, words, tags, heads, deps, ents = annots | ||||||
|                     doc_sample.append(Doc(self.vocab, words=words)) |                     doc_sample.append(Doc(self.vocab, words=words)) | ||||||
|                     gold_sample.append(GoldParse(doc_sample[-1], words=words, tags=tags, |                     gold_sample.append(GoldParse(doc_sample[-1], words=words, tags=tags, | ||||||
|                                                  heads=heads, deps=deps, ents=ents)) |                                                  heads=heads, deps=deps, entities=ents)) | ||||||
|             self.model.begin_training(doc_sample, gold_sample) |             self.model.begin_training(doc_sample, gold_sample) | ||||||
|             if pipeline is not None: |             if pipeline is not None: | ||||||
|                 self.init_multitask_objectives(get_gold_tuples, pipeline, sgd=sgd, **cfg) |                 self.init_multitask_objectives(get_gold_tuples, pipeline, sgd=sgd, **cfg) | ||||||
|  |  | ||||||
|  | @ -2,9 +2,9 @@ | ||||||
| from __future__ import unicode_literals | from __future__ import unicode_literals | ||||||
| 
 | 
 | ||||||
| import pytest | import pytest | ||||||
| from spacy.lang.sv.syntax_iterators import SYNTAX_ITERATORS |  | ||||||
| from ...util import get_doc | from ...util import get_doc | ||||||
| 
 | 
 | ||||||
|  | 
 | ||||||
| SV_NP_TEST_EXAMPLES = [ | SV_NP_TEST_EXAMPLES = [ | ||||||
|     ( |     ( | ||||||
|         "En student läste en bok",  # A student read a book |         "En student läste en bok",  # A student read a book | ||||||
|  | @ -45,4 +45,3 @@ def test_sv_noun_chunks(sv_tokenizer, text, pos, deps, heads, expected_noun_chun | ||||||
|     assert len(noun_chunks) == len(expected_noun_chunks) |     assert len(noun_chunks) == len(expected_noun_chunks) | ||||||
|     for i, np in enumerate(noun_chunks): |     for i, np in enumerate(noun_chunks): | ||||||
|         assert np.text == expected_noun_chunks[i] |         assert np.text == expected_noun_chunks[i] | ||||||
| 
 |  | ||||||
|  |  | ||||||
|  | @ -17,7 +17,7 @@ def matcher(en_vocab): | ||||||
|     } |     } | ||||||
|     matcher = Matcher(en_vocab) |     matcher = Matcher(en_vocab) | ||||||
|     for key, patterns in rules.items(): |     for key, patterns in rules.items(): | ||||||
|         matcher.add(key, None, *patterns) |         matcher.add(key, patterns) | ||||||
|     return matcher |     return matcher | ||||||
| 
 | 
 | ||||||
| 
 | 
 | ||||||
|  | @ -25,11 +25,11 @@ def test_matcher_from_api_docs(en_vocab): | ||||||
|     matcher = Matcher(en_vocab) |     matcher = Matcher(en_vocab) | ||||||
|     pattern = [{"ORTH": "test"}] |     pattern = [{"ORTH": "test"}] | ||||||
|     assert len(matcher) == 0 |     assert len(matcher) == 0 | ||||||
|     matcher.add("Rule", None, pattern) |     matcher.add("Rule", [pattern]) | ||||||
|     assert len(matcher) == 1 |     assert len(matcher) == 1 | ||||||
|     matcher.remove("Rule") |     matcher.remove("Rule") | ||||||
|     assert "Rule" not in matcher |     assert "Rule" not in matcher | ||||||
|     matcher.add("Rule", None, pattern) |     matcher.add("Rule", [pattern]) | ||||||
|     assert "Rule" in matcher |     assert "Rule" in matcher | ||||||
|     on_match, patterns = matcher.get("Rule") |     on_match, patterns = matcher.get("Rule") | ||||||
|     assert len(patterns[0]) |     assert len(patterns[0]) | ||||||
|  | @ -52,7 +52,7 @@ def test_matcher_from_usage_docs(en_vocab): | ||||||
|         token.vocab[token.text].norm_ = "happy emoji" |         token.vocab[token.text].norm_ = "happy emoji" | ||||||
| 
 | 
 | ||||||
|     matcher = Matcher(en_vocab) |     matcher = Matcher(en_vocab) | ||||||
|     matcher.add("HAPPY", label_sentiment, *pos_patterns) |     matcher.add("HAPPY", pos_patterns, on_match=label_sentiment) | ||||||
|     matcher(doc) |     matcher(doc) | ||||||
|     assert doc.sentiment != 0 |     assert doc.sentiment != 0 | ||||||
|     assert doc[1].norm_ == "happy emoji" |     assert doc[1].norm_ == "happy emoji" | ||||||
|  | @ -60,11 +60,33 @@ def test_matcher_from_usage_docs(en_vocab): | ||||||
| 
 | 
 | ||||||
| def test_matcher_len_contains(matcher): | def test_matcher_len_contains(matcher): | ||||||
|     assert len(matcher) == 3 |     assert len(matcher) == 3 | ||||||
|     matcher.add("TEST", None, [{"ORTH": "test"}]) |     matcher.add("TEST", [[{"ORTH": "test"}]]) | ||||||
|     assert "TEST" in matcher |     assert "TEST" in matcher | ||||||
|     assert "TEST2" not in matcher |     assert "TEST2" not in matcher | ||||||
| 
 | 
 | ||||||
| 
 | 
 | ||||||
|  | def test_matcher_add_new_old_api(en_vocab): | ||||||
|  |     doc = Doc(en_vocab, words=["a", "b"]) | ||||||
|  |     patterns = [[{"TEXT": "a"}], [{"TEXT": "a"}, {"TEXT": "b"}]] | ||||||
|  |     matcher = Matcher(en_vocab) | ||||||
|  |     matcher.add("OLD_API", None, *patterns) | ||||||
|  |     assert len(matcher(doc)) == 2 | ||||||
|  |     matcher = Matcher(en_vocab) | ||||||
|  |     on_match = Mock() | ||||||
|  |     matcher.add("OLD_API_CALLBACK", on_match, *patterns) | ||||||
|  |     assert len(matcher(doc)) == 2 | ||||||
|  |     assert on_match.call_count == 2 | ||||||
|  |     # New API: add(key: str, patterns: List[List[dict]], on_match: Callable) | ||||||
|  |     matcher = Matcher(en_vocab) | ||||||
|  |     matcher.add("NEW_API", patterns) | ||||||
|  |     assert len(matcher(doc)) == 2 | ||||||
|  |     matcher = Matcher(en_vocab) | ||||||
|  |     on_match = Mock() | ||||||
|  |     matcher.add("NEW_API_CALLBACK", patterns, on_match=on_match) | ||||||
|  |     assert len(matcher(doc)) == 2 | ||||||
|  |     assert on_match.call_count == 2 | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
| def test_matcher_no_match(matcher): | def test_matcher_no_match(matcher): | ||||||
|     doc = Doc(matcher.vocab, words=["I", "like", "cheese", "."]) |     doc = Doc(matcher.vocab, words=["I", "like", "cheese", "."]) | ||||||
|     assert matcher(doc) == [] |     assert matcher(doc) == [] | ||||||
|  | @ -100,12 +122,12 @@ def test_matcher_empty_dict(en_vocab): | ||||||
|     """Test matcher allows empty token specs, meaning match on any token.""" |     """Test matcher allows empty token specs, meaning match on any token.""" | ||||||
|     matcher = Matcher(en_vocab) |     matcher = Matcher(en_vocab) | ||||||
|     doc = Doc(matcher.vocab, words=["a", "b", "c"]) |     doc = Doc(matcher.vocab, words=["a", "b", "c"]) | ||||||
|     matcher.add("A.C", None, [{"ORTH": "a"}, {}, {"ORTH": "c"}]) |     matcher.add("A.C", [[{"ORTH": "a"}, {}, {"ORTH": "c"}]]) | ||||||
|     matches = matcher(doc) |     matches = matcher(doc) | ||||||
|     assert len(matches) == 1 |     assert len(matches) == 1 | ||||||
|     assert matches[0][1:] == (0, 3) |     assert matches[0][1:] == (0, 3) | ||||||
|     matcher = Matcher(en_vocab) |     matcher = Matcher(en_vocab) | ||||||
|     matcher.add("A.", None, [{"ORTH": "a"}, {}]) |     matcher.add("A.", [[{"ORTH": "a"}, {}]]) | ||||||
|     matches = matcher(doc) |     matches = matcher(doc) | ||||||
|     assert matches[0][1:] == (0, 2) |     assert matches[0][1:] == (0, 2) | ||||||
| 
 | 
 | ||||||
|  | @ -114,7 +136,7 @@ def test_matcher_operator_shadow(en_vocab): | ||||||
|     matcher = Matcher(en_vocab) |     matcher = Matcher(en_vocab) | ||||||
|     doc = Doc(matcher.vocab, words=["a", "b", "c"]) |     doc = Doc(matcher.vocab, words=["a", "b", "c"]) | ||||||
|     pattern = [{"ORTH": "a"}, {"IS_ALPHA": True, "OP": "+"}, {"ORTH": "c"}] |     pattern = [{"ORTH": "a"}, {"IS_ALPHA": True, "OP": "+"}, {"ORTH": "c"}] | ||||||
|     matcher.add("A.C", None, pattern) |     matcher.add("A.C", [pattern]) | ||||||
|     matches = matcher(doc) |     matches = matcher(doc) | ||||||
|     assert len(matches) == 1 |     assert len(matches) == 1 | ||||||
|     assert matches[0][1:] == (0, 3) |     assert matches[0][1:] == (0, 3) | ||||||
|  | @ -136,12 +158,12 @@ def test_matcher_match_zero(matcher): | ||||||
|         {"IS_PUNCT": True}, |         {"IS_PUNCT": True}, | ||||||
|         {"ORTH": '"'}, |         {"ORTH": '"'}, | ||||||
|     ] |     ] | ||||||
|     matcher.add("Quote", None, pattern1) |     matcher.add("Quote", [pattern1]) | ||||||
|     doc = Doc(matcher.vocab, words=words1) |     doc = Doc(matcher.vocab, words=words1) | ||||||
|     assert len(matcher(doc)) == 1 |     assert len(matcher(doc)) == 1 | ||||||
|     doc = Doc(matcher.vocab, words=words2) |     doc = Doc(matcher.vocab, words=words2) | ||||||
|     assert len(matcher(doc)) == 0 |     assert len(matcher(doc)) == 0 | ||||||
|     matcher.add("Quote", None, pattern2) |     matcher.add("Quote", [pattern2]) | ||||||
|     assert len(matcher(doc)) == 0 |     assert len(matcher(doc)) == 0 | ||||||
| 
 | 
 | ||||||
| 
 | 
 | ||||||
|  | @ -149,7 +171,7 @@ def test_matcher_match_zero_plus(matcher): | ||||||
|     words = 'He said , " some words " ...'.split() |     words = 'He said , " some words " ...'.split() | ||||||
|     pattern = [{"ORTH": '"'}, {"OP": "*", "IS_PUNCT": False}, {"ORTH": '"'}] |     pattern = [{"ORTH": '"'}, {"OP": "*", "IS_PUNCT": False}, {"ORTH": '"'}] | ||||||
|     matcher = Matcher(matcher.vocab) |     matcher = Matcher(matcher.vocab) | ||||||
|     matcher.add("Quote", None, pattern) |     matcher.add("Quote", [pattern]) | ||||||
|     doc = Doc(matcher.vocab, words=words) |     doc = Doc(matcher.vocab, words=words) | ||||||
|     assert len(matcher(doc)) == 1 |     assert len(matcher(doc)) == 1 | ||||||
| 
 | 
 | ||||||
|  | @ -160,11 +182,8 @@ def test_matcher_match_one_plus(matcher): | ||||||
|     doc = Doc(control.vocab, words=["Philippe", "Philippe"]) |     doc = Doc(control.vocab, words=["Philippe", "Philippe"]) | ||||||
|     m = control(doc) |     m = control(doc) | ||||||
|     assert len(m) == 2 |     assert len(m) == 2 | ||||||
|     matcher.add( |     pattern = [{"ORTH": "Philippe", "OP": "1"}, {"ORTH": "Philippe", "OP": "+"}] | ||||||
|         "KleenePhilippe", |     matcher.add("KleenePhilippe", [pattern]) | ||||||
|         None, |  | ||||||
|         [{"ORTH": "Philippe", "OP": "1"}, {"ORTH": "Philippe", "OP": "+"}], |  | ||||||
|     ) |  | ||||||
|     m = matcher(doc) |     m = matcher(doc) | ||||||
|     assert len(m) == 1 |     assert len(m) == 1 | ||||||
| 
 | 
 | ||||||
|  | @ -172,7 +191,7 @@ def test_matcher_match_one_plus(matcher): | ||||||
| def test_matcher_any_token_operator(en_vocab): | def test_matcher_any_token_operator(en_vocab): | ||||||
|     """Test that patterns with "any token" {} work with operators.""" |     """Test that patterns with "any token" {} work with operators.""" | ||||||
|     matcher = Matcher(en_vocab) |     matcher = Matcher(en_vocab) | ||||||
|     matcher.add("TEST", None, [{"ORTH": "test"}, {"OP": "*"}]) |     matcher.add("TEST", [[{"ORTH": "test"}, {"OP": "*"}]]) | ||||||
|     doc = Doc(en_vocab, words=["test", "hello", "world"]) |     doc = Doc(en_vocab, words=["test", "hello", "world"]) | ||||||
|     matches = [doc[start:end].text for _, start, end in matcher(doc)] |     matches = [doc[start:end].text for _, start, end in matcher(doc)] | ||||||
|     assert len(matches) == 3 |     assert len(matches) == 3 | ||||||
|  | @ -186,7 +205,7 @@ def test_matcher_extension_attribute(en_vocab): | ||||||
|     get_is_fruit = lambda token: token.text in ("apple", "banana") |     get_is_fruit = lambda token: token.text in ("apple", "banana") | ||||||
|     Token.set_extension("is_fruit", getter=get_is_fruit, force=True) |     Token.set_extension("is_fruit", getter=get_is_fruit, force=True) | ||||||
|     pattern = [{"ORTH": "an"}, {"_": {"is_fruit": True}}] |     pattern = [{"ORTH": "an"}, {"_": {"is_fruit": True}}] | ||||||
|     matcher.add("HAVING_FRUIT", None, pattern) |     matcher.add("HAVING_FRUIT", [pattern]) | ||||||
|     doc = Doc(en_vocab, words=["an", "apple"]) |     doc = Doc(en_vocab, words=["an", "apple"]) | ||||||
|     matches = matcher(doc) |     matches = matcher(doc) | ||||||
|     assert len(matches) == 1 |     assert len(matches) == 1 | ||||||
|  | @ -198,7 +217,7 @@ def test_matcher_extension_attribute(en_vocab): | ||||||
| def test_matcher_set_value(en_vocab): | def test_matcher_set_value(en_vocab): | ||||||
|     matcher = Matcher(en_vocab) |     matcher = Matcher(en_vocab) | ||||||
|     pattern = [{"ORTH": {"IN": ["an", "a"]}}] |     pattern = [{"ORTH": {"IN": ["an", "a"]}}] | ||||||
|     matcher.add("A_OR_AN", None, pattern) |     matcher.add("A_OR_AN", [pattern]) | ||||||
|     doc = Doc(en_vocab, words=["an", "a", "apple"]) |     doc = Doc(en_vocab, words=["an", "a", "apple"]) | ||||||
|     matches = matcher(doc) |     matches = matcher(doc) | ||||||
|     assert len(matches) == 2 |     assert len(matches) == 2 | ||||||
|  | @ -210,7 +229,7 @@ def test_matcher_set_value(en_vocab): | ||||||
| def test_matcher_set_value_operator(en_vocab): | def test_matcher_set_value_operator(en_vocab): | ||||||
|     matcher = Matcher(en_vocab) |     matcher = Matcher(en_vocab) | ||||||
|     pattern = [{"ORTH": {"IN": ["a", "the"]}, "OP": "?"}, {"ORTH": "house"}] |     pattern = [{"ORTH": {"IN": ["a", "the"]}, "OP": "?"}, {"ORTH": "house"}] | ||||||
|     matcher.add("DET_HOUSE", None, pattern) |     matcher.add("DET_HOUSE", [pattern]) | ||||||
|     doc = Doc(en_vocab, words=["In", "a", "house"]) |     doc = Doc(en_vocab, words=["In", "a", "house"]) | ||||||
|     matches = matcher(doc) |     matches = matcher(doc) | ||||||
|     assert len(matches) == 2 |     assert len(matches) == 2 | ||||||
|  | @ -222,7 +241,7 @@ def test_matcher_set_value_operator(en_vocab): | ||||||
| def test_matcher_regex(en_vocab): | def test_matcher_regex(en_vocab): | ||||||
|     matcher = Matcher(en_vocab) |     matcher = Matcher(en_vocab) | ||||||
|     pattern = [{"ORTH": {"REGEX": r"(?:a|an)"}}] |     pattern = [{"ORTH": {"REGEX": r"(?:a|an)"}}] | ||||||
|     matcher.add("A_OR_AN", None, pattern) |     matcher.add("A_OR_AN", [pattern]) | ||||||
|     doc = Doc(en_vocab, words=["an", "a", "hi"]) |     doc = Doc(en_vocab, words=["an", "a", "hi"]) | ||||||
|     matches = matcher(doc) |     matches = matcher(doc) | ||||||
|     assert len(matches) == 2 |     assert len(matches) == 2 | ||||||
|  | @ -234,7 +253,7 @@ def test_matcher_regex(en_vocab): | ||||||
| def test_matcher_regex_shape(en_vocab): | def test_matcher_regex_shape(en_vocab): | ||||||
|     matcher = Matcher(en_vocab) |     matcher = Matcher(en_vocab) | ||||||
|     pattern = [{"SHAPE": {"REGEX": r"^[^x]+$"}}] |     pattern = [{"SHAPE": {"REGEX": r"^[^x]+$"}}] | ||||||
|     matcher.add("NON_ALPHA", None, pattern) |     matcher.add("NON_ALPHA", [pattern]) | ||||||
|     doc = Doc(en_vocab, words=["99", "problems", "!"]) |     doc = Doc(en_vocab, words=["99", "problems", "!"]) | ||||||
|     matches = matcher(doc) |     matches = matcher(doc) | ||||||
|     assert len(matches) == 2 |     assert len(matches) == 2 | ||||||
|  | @ -246,7 +265,7 @@ def test_matcher_regex_shape(en_vocab): | ||||||
| def test_matcher_compare_length(en_vocab): | def test_matcher_compare_length(en_vocab): | ||||||
|     matcher = Matcher(en_vocab) |     matcher = Matcher(en_vocab) | ||||||
|     pattern = [{"LENGTH": {">=": 2}}] |     pattern = [{"LENGTH": {">=": 2}}] | ||||||
|     matcher.add("LENGTH_COMPARE", None, pattern) |     matcher.add("LENGTH_COMPARE", [pattern]) | ||||||
|     doc = Doc(en_vocab, words=["a", "aa", "aaa"]) |     doc = Doc(en_vocab, words=["a", "aa", "aaa"]) | ||||||
|     matches = matcher(doc) |     matches = matcher(doc) | ||||||
|     assert len(matches) == 2 |     assert len(matches) == 2 | ||||||
|  | @ -260,7 +279,7 @@ def test_matcher_extension_set_membership(en_vocab): | ||||||
|     get_reversed = lambda token: "".join(reversed(token.text)) |     get_reversed = lambda token: "".join(reversed(token.text)) | ||||||
|     Token.set_extension("reversed", getter=get_reversed, force=True) |     Token.set_extension("reversed", getter=get_reversed, force=True) | ||||||
|     pattern = [{"_": {"reversed": {"IN": ["eyb", "ih"]}}}] |     pattern = [{"_": {"reversed": {"IN": ["eyb", "ih"]}}}] | ||||||
|     matcher.add("REVERSED", None, pattern) |     matcher.add("REVERSED", [pattern]) | ||||||
|     doc = Doc(en_vocab, words=["hi", "bye", "hello"]) |     doc = Doc(en_vocab, words=["hi", "bye", "hello"]) | ||||||
|     matches = matcher(doc) |     matches = matcher(doc) | ||||||
|     assert len(matches) == 2 |     assert len(matches) == 2 | ||||||
|  | @ -328,9 +347,9 @@ def dependency_matcher(en_vocab): | ||||||
|     ] |     ] | ||||||
| 
 | 
 | ||||||
|     matcher = DependencyMatcher(en_vocab) |     matcher = DependencyMatcher(en_vocab) | ||||||
|     matcher.add("pattern1", None, pattern1) |     matcher.add("pattern1", [pattern1]) | ||||||
|     matcher.add("pattern2", None, pattern2) |     matcher.add("pattern2", [pattern2]) | ||||||
|     matcher.add("pattern3", None, pattern3) |     matcher.add("pattern3", [pattern3]) | ||||||
| 
 | 
 | ||||||
|     return matcher |     return matcher | ||||||
| 
 | 
 | ||||||
|  | @ -347,6 +366,14 @@ def test_dependency_matcher_compile(dependency_matcher): | ||||||
| #     assert matches[2][1] == [[4, 3, 2]] | #     assert matches[2][1] == [[4, 3, 2]] | ||||||
| 
 | 
 | ||||||
| 
 | 
 | ||||||
|  | def test_matcher_basic_check(en_vocab): | ||||||
|  |     matcher = Matcher(en_vocab) | ||||||
|  |     # Potential mistake: pass in pattern instead of list of patterns | ||||||
|  |     pattern = [{"TEXT": "hello"}, {"TEXT": "world"}] | ||||||
|  |     with pytest.raises(ValueError): | ||||||
|  |         matcher.add("TEST", pattern) | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
| def test_attr_pipeline_checks(en_vocab): | def test_attr_pipeline_checks(en_vocab): | ||||||
|     doc1 = Doc(en_vocab, words=["Test"]) |     doc1 = Doc(en_vocab, words=["Test"]) | ||||||
|     doc1.is_parsed = True |     doc1.is_parsed = True | ||||||
|  | @ -355,7 +382,7 @@ def test_attr_pipeline_checks(en_vocab): | ||||||
|     doc3 = Doc(en_vocab, words=["Test"]) |     doc3 = Doc(en_vocab, words=["Test"]) | ||||||
|     # DEP requires is_parsed |     # DEP requires is_parsed | ||||||
|     matcher = Matcher(en_vocab) |     matcher = Matcher(en_vocab) | ||||||
|     matcher.add("TEST", None, [{"DEP": "a"}]) |     matcher.add("TEST", [[{"DEP": "a"}]]) | ||||||
|     matcher(doc1) |     matcher(doc1) | ||||||
|     with pytest.raises(ValueError): |     with pytest.raises(ValueError): | ||||||
|         matcher(doc2) |         matcher(doc2) | ||||||
|  | @ -364,7 +391,7 @@ def test_attr_pipeline_checks(en_vocab): | ||||||
|     # TAG, POS, LEMMA require is_tagged |     # TAG, POS, LEMMA require is_tagged | ||||||
|     for attr in ("TAG", "POS", "LEMMA"): |     for attr in ("TAG", "POS", "LEMMA"): | ||||||
|         matcher = Matcher(en_vocab) |         matcher = Matcher(en_vocab) | ||||||
|         matcher.add("TEST", None, [{attr: "a"}]) |         matcher.add("TEST", [[{attr: "a"}]]) | ||||||
|         matcher(doc2) |         matcher(doc2) | ||||||
|         with pytest.raises(ValueError): |         with pytest.raises(ValueError): | ||||||
|             matcher(doc1) |             matcher(doc1) | ||||||
|  | @ -372,12 +399,12 @@ def test_attr_pipeline_checks(en_vocab): | ||||||
|             matcher(doc3) |             matcher(doc3) | ||||||
|     # TEXT/ORTH only require tokens |     # TEXT/ORTH only require tokens | ||||||
|     matcher = Matcher(en_vocab) |     matcher = Matcher(en_vocab) | ||||||
|     matcher.add("TEST", None, [{"ORTH": "a"}]) |     matcher.add("TEST", [[{"ORTH": "a"}]]) | ||||||
|     matcher(doc1) |     matcher(doc1) | ||||||
|     matcher(doc2) |     matcher(doc2) | ||||||
|     matcher(doc3) |     matcher(doc3) | ||||||
|     matcher = Matcher(en_vocab) |     matcher = Matcher(en_vocab) | ||||||
|     matcher.add("TEST", None, [{"TEXT": "a"}]) |     matcher.add("TEST", [[{"TEXT": "a"}]]) | ||||||
|     matcher(doc1) |     matcher(doc1) | ||||||
|     matcher(doc2) |     matcher(doc2) | ||||||
|     matcher(doc3) |     matcher(doc3) | ||||||
|  | @ -407,7 +434,7 @@ def test_attr_pipeline_checks(en_vocab): | ||||||
| def test_matcher_schema_token_attributes(en_vocab, pattern, text): | def test_matcher_schema_token_attributes(en_vocab, pattern, text): | ||||||
|     matcher = Matcher(en_vocab) |     matcher = Matcher(en_vocab) | ||||||
|     doc = Doc(en_vocab, words=text.split(" ")) |     doc = Doc(en_vocab, words=text.split(" ")) | ||||||
|     matcher.add("Rule", None, pattern) |     matcher.add("Rule", [pattern]) | ||||||
|     assert len(matcher) == 1 |     assert len(matcher) == 1 | ||||||
|     matches = matcher(doc) |     matches = matcher(doc) | ||||||
|     assert len(matches) == 1 |     assert len(matches) == 1 | ||||||
|  | @ -417,7 +444,7 @@ def test_matcher_valid_callback(en_vocab): | ||||||
|     """Test that on_match can only be None or callable.""" |     """Test that on_match can only be None or callable.""" | ||||||
|     matcher = Matcher(en_vocab) |     matcher = Matcher(en_vocab) | ||||||
|     with pytest.raises(ValueError): |     with pytest.raises(ValueError): | ||||||
|         matcher.add("TEST", [], [{"TEXT": "test"}]) |         matcher.add("TEST", [[{"TEXT": "test"}]], on_match=[]) | ||||||
|     matcher(Doc(en_vocab, words=["test"])) |     matcher(Doc(en_vocab, words=["test"])) | ||||||
| 
 | 
 | ||||||
| 
 | 
 | ||||||
|  | @ -425,7 +452,7 @@ def test_matcher_callback(en_vocab): | ||||||
|     mock = Mock() |     mock = Mock() | ||||||
|     matcher = Matcher(en_vocab) |     matcher = Matcher(en_vocab) | ||||||
|     pattern = [{"ORTH": "test"}] |     pattern = [{"ORTH": "test"}] | ||||||
|     matcher.add("Rule", mock, pattern) |     matcher.add("Rule", [pattern], on_match=mock) | ||||||
|     doc = Doc(en_vocab, words=["This", "is", "a", "test", "."]) |     doc = Doc(en_vocab, words=["This", "is", "a", "test", "."]) | ||||||
|     matches = matcher(doc) |     matches = matcher(doc) | ||||||
|     mock.assert_called_once_with(matcher, doc, 0, matches) |     mock.assert_called_once_with(matcher, doc, 0, matches) | ||||||
|  |  | ||||||
|  | @ -55,7 +55,7 @@ def test_greedy_matching(doc, text, pattern, re_pattern): | ||||||
|     """Test that the greedy matching behavior of the * op is consistant with |     """Test that the greedy matching behavior of the * op is consistant with | ||||||
|     other re implementations.""" |     other re implementations.""" | ||||||
|     matcher = Matcher(doc.vocab) |     matcher = Matcher(doc.vocab) | ||||||
|     matcher.add(re_pattern, None, pattern) |     matcher.add(re_pattern, [pattern]) | ||||||
|     matches = matcher(doc) |     matches = matcher(doc) | ||||||
|     re_matches = [m.span() for m in re.finditer(re_pattern, text)] |     re_matches = [m.span() for m in re.finditer(re_pattern, text)] | ||||||
|     for match, re_match in zip(matches, re_matches): |     for match, re_match in zip(matches, re_matches): | ||||||
|  | @ -77,7 +77,7 @@ def test_match_consuming(doc, text, pattern, re_pattern): | ||||||
|     """Test that matcher.__call__ consumes tokens on a match similar to |     """Test that matcher.__call__ consumes tokens on a match similar to | ||||||
|     re.findall.""" |     re.findall.""" | ||||||
|     matcher = Matcher(doc.vocab) |     matcher = Matcher(doc.vocab) | ||||||
|     matcher.add(re_pattern, None, pattern) |     matcher.add(re_pattern, [pattern]) | ||||||
|     matches = matcher(doc) |     matches = matcher(doc) | ||||||
|     re_matches = [m.span() for m in re.finditer(re_pattern, text)] |     re_matches = [m.span() for m in re.finditer(re_pattern, text)] | ||||||
|     assert len(matches) == len(re_matches) |     assert len(matches) == len(re_matches) | ||||||
|  | @ -111,7 +111,7 @@ def test_operator_combos(en_vocab): | ||||||
|                 pattern.append({"ORTH": part[0], "OP": "+"}) |                 pattern.append({"ORTH": part[0], "OP": "+"}) | ||||||
|             else: |             else: | ||||||
|                 pattern.append({"ORTH": part}) |                 pattern.append({"ORTH": part}) | ||||||
|         matcher.add("PATTERN", None, pattern) |         matcher.add("PATTERN", [pattern]) | ||||||
|         matches = matcher(doc) |         matches = matcher(doc) | ||||||
|         if result: |         if result: | ||||||
|             assert matches, (string, pattern_str) |             assert matches, (string, pattern_str) | ||||||
|  | @ -123,7 +123,7 @@ def test_matcher_end_zero_plus(en_vocab): | ||||||
|     """Test matcher works when patterns end with * operator. (issue 1450)""" |     """Test matcher works when patterns end with * operator. (issue 1450)""" | ||||||
|     matcher = Matcher(en_vocab) |     matcher = Matcher(en_vocab) | ||||||
|     pattern = [{"ORTH": "a"}, {"ORTH": "b", "OP": "*"}] |     pattern = [{"ORTH": "a"}, {"ORTH": "b", "OP": "*"}] | ||||||
|     matcher.add("TSTEND", None, pattern) |     matcher.add("TSTEND", [pattern]) | ||||||
|     nlp = lambda string: Doc(matcher.vocab, words=string.split()) |     nlp = lambda string: Doc(matcher.vocab, words=string.split()) | ||||||
|     assert len(matcher(nlp("a"))) == 1 |     assert len(matcher(nlp("a"))) == 1 | ||||||
|     assert len(matcher(nlp("a b"))) == 2 |     assert len(matcher(nlp("a b"))) == 2 | ||||||
|  | @ -140,7 +140,7 @@ def test_matcher_sets_return_correct_tokens(en_vocab): | ||||||
|         [{"LOWER": {"IN": ["one"]}}], |         [{"LOWER": {"IN": ["one"]}}], | ||||||
|         [{"LOWER": {"IN": ["two"]}}], |         [{"LOWER": {"IN": ["two"]}}], | ||||||
|     ] |     ] | ||||||
|     matcher.add("TEST", None, *patterns) |     matcher.add("TEST", patterns) | ||||||
|     doc = Doc(en_vocab, words="zero one two three".split()) |     doc = Doc(en_vocab, words="zero one two three".split()) | ||||||
|     matches = matcher(doc) |     matches = matcher(doc) | ||||||
|     texts = [Span(doc, s, e, label=L).text for L, s, e in matches] |     texts = [Span(doc, s, e, label=L).text for L, s, e in matches] | ||||||
|  | @ -154,7 +154,7 @@ def test_matcher_remove(): | ||||||
| 
 | 
 | ||||||
|     pattern = [{"ORTH": "test"}, {"OP": "?"}] |     pattern = [{"ORTH": "test"}, {"OP": "?"}] | ||||||
|     assert len(matcher) == 0 |     assert len(matcher) == 0 | ||||||
|     matcher.add("Rule", None, pattern) |     matcher.add("Rule", [pattern]) | ||||||
|     assert "Rule" in matcher |     assert "Rule" in matcher | ||||||
| 
 | 
 | ||||||
|     # should give two matches |     # should give two matches | ||||||
|  |  | ||||||
|  | @ -50,7 +50,7 @@ def validator(): | ||||||
| def test_matcher_pattern_validation(en_vocab, pattern): | def test_matcher_pattern_validation(en_vocab, pattern): | ||||||
|     matcher = Matcher(en_vocab, validate=True) |     matcher = Matcher(en_vocab, validate=True) | ||||||
|     with pytest.raises(MatchPatternError): |     with pytest.raises(MatchPatternError): | ||||||
|         matcher.add("TEST", None, pattern) |         matcher.add("TEST", [pattern]) | ||||||
| 
 | 
 | ||||||
| 
 | 
 | ||||||
| @pytest.mark.parametrize("pattern,n_errors,_", TEST_PATTERNS) | @pytest.mark.parametrize("pattern,n_errors,_", TEST_PATTERNS) | ||||||
|  | @ -71,6 +71,6 @@ def test_minimal_pattern_validation(en_vocab, pattern, n_errors, n_min_errors): | ||||||
|     matcher = Matcher(en_vocab) |     matcher = Matcher(en_vocab) | ||||||
|     if n_min_errors > 0: |     if n_min_errors > 0: | ||||||
|         with pytest.raises(ValueError): |         with pytest.raises(ValueError): | ||||||
|             matcher.add("TEST", None, pattern) |             matcher.add("TEST", [pattern]) | ||||||
|     elif n_errors == 0: |     elif n_errors == 0: | ||||||
|         matcher.add("TEST", None, pattern) |         matcher.add("TEST", [pattern]) | ||||||
|  |  | ||||||
|  | @ -13,53 +13,75 @@ def test_matcher_phrase_matcher(en_vocab): | ||||||
|     # intermediate phrase |     # intermediate phrase | ||||||
|     pattern = Doc(en_vocab, words=["Google", "Now"]) |     pattern = Doc(en_vocab, words=["Google", "Now"]) | ||||||
|     matcher = PhraseMatcher(en_vocab) |     matcher = PhraseMatcher(en_vocab) | ||||||
|     matcher.add("COMPANY", None, pattern) |     matcher.add("COMPANY", [pattern]) | ||||||
|     assert len(matcher(doc)) == 1 |     assert len(matcher(doc)) == 1 | ||||||
|     # initial token |     # initial token | ||||||
|     pattern = Doc(en_vocab, words=["I"]) |     pattern = Doc(en_vocab, words=["I"]) | ||||||
|     matcher = PhraseMatcher(en_vocab) |     matcher = PhraseMatcher(en_vocab) | ||||||
|     matcher.add("I", None, pattern) |     matcher.add("I", [pattern]) | ||||||
|     assert len(matcher(doc)) == 1 |     assert len(matcher(doc)) == 1 | ||||||
|     # initial phrase |     # initial phrase | ||||||
|     pattern = Doc(en_vocab, words=["I", "like"]) |     pattern = Doc(en_vocab, words=["I", "like"]) | ||||||
|     matcher = PhraseMatcher(en_vocab) |     matcher = PhraseMatcher(en_vocab) | ||||||
|     matcher.add("ILIKE", None, pattern) |     matcher.add("ILIKE", [pattern]) | ||||||
|     assert len(matcher(doc)) == 1 |     assert len(matcher(doc)) == 1 | ||||||
|     # final token |     # final token | ||||||
|     pattern = Doc(en_vocab, words=["best"]) |     pattern = Doc(en_vocab, words=["best"]) | ||||||
|     matcher = PhraseMatcher(en_vocab) |     matcher = PhraseMatcher(en_vocab) | ||||||
|     matcher.add("BEST", None, pattern) |     matcher.add("BEST", [pattern]) | ||||||
|     assert len(matcher(doc)) == 1 |     assert len(matcher(doc)) == 1 | ||||||
|     # final phrase |     # final phrase | ||||||
|     pattern = Doc(en_vocab, words=["Now", "best"]) |     pattern = Doc(en_vocab, words=["Now", "best"]) | ||||||
|     matcher = PhraseMatcher(en_vocab) |     matcher = PhraseMatcher(en_vocab) | ||||||
|     matcher.add("NOWBEST", None, pattern) |     matcher.add("NOWBEST", [pattern]) | ||||||
|     assert len(matcher(doc)) == 1 |     assert len(matcher(doc)) == 1 | ||||||
| 
 | 
 | ||||||
| 
 | 
 | ||||||
| def test_phrase_matcher_length(en_vocab): | def test_phrase_matcher_length(en_vocab): | ||||||
|     matcher = PhraseMatcher(en_vocab) |     matcher = PhraseMatcher(en_vocab) | ||||||
|     assert len(matcher) == 0 |     assert len(matcher) == 0 | ||||||
|     matcher.add("TEST", None, Doc(en_vocab, words=["test"])) |     matcher.add("TEST", [Doc(en_vocab, words=["test"])]) | ||||||
|     assert len(matcher) == 1 |     assert len(matcher) == 1 | ||||||
|     matcher.add("TEST2", None, Doc(en_vocab, words=["test2"])) |     matcher.add("TEST2", [Doc(en_vocab, words=["test2"])]) | ||||||
|     assert len(matcher) == 2 |     assert len(matcher) == 2 | ||||||
| 
 | 
 | ||||||
| 
 | 
 | ||||||
| def test_phrase_matcher_contains(en_vocab): | def test_phrase_matcher_contains(en_vocab): | ||||||
|     matcher = PhraseMatcher(en_vocab) |     matcher = PhraseMatcher(en_vocab) | ||||||
|     matcher.add("TEST", None, Doc(en_vocab, words=["test"])) |     matcher.add("TEST", [Doc(en_vocab, words=["test"])]) | ||||||
|     assert "TEST" in matcher |     assert "TEST" in matcher | ||||||
|     assert "TEST2" not in matcher |     assert "TEST2" not in matcher | ||||||
| 
 | 
 | ||||||
| 
 | 
 | ||||||
|  | def test_phrase_matcher_add_new_api(en_vocab): | ||||||
|  |     doc = Doc(en_vocab, words=["a", "b"]) | ||||||
|  |     patterns = [Doc(en_vocab, words=["a"]), Doc(en_vocab, words=["a", "b"])] | ||||||
|  |     matcher = PhraseMatcher(en_vocab) | ||||||
|  |     matcher.add("OLD_API", None, *patterns) | ||||||
|  |     assert len(matcher(doc)) == 2 | ||||||
|  |     matcher = PhraseMatcher(en_vocab) | ||||||
|  |     on_match = Mock() | ||||||
|  |     matcher.add("OLD_API_CALLBACK", on_match, *patterns) | ||||||
|  |     assert len(matcher(doc)) == 2 | ||||||
|  |     assert on_match.call_count == 2 | ||||||
|  |     # New API: add(key: str, patterns: List[List[dict]], on_match: Callable) | ||||||
|  |     matcher = PhraseMatcher(en_vocab) | ||||||
|  |     matcher.add("NEW_API", patterns) | ||||||
|  |     assert len(matcher(doc)) == 2 | ||||||
|  |     matcher = PhraseMatcher(en_vocab) | ||||||
|  |     on_match = Mock() | ||||||
|  |     matcher.add("NEW_API_CALLBACK", patterns, on_match=on_match) | ||||||
|  |     assert len(matcher(doc)) == 2 | ||||||
|  |     assert on_match.call_count == 2 | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
| def test_phrase_matcher_repeated_add(en_vocab): | def test_phrase_matcher_repeated_add(en_vocab): | ||||||
|     matcher = PhraseMatcher(en_vocab) |     matcher = PhraseMatcher(en_vocab) | ||||||
|     # match ID only gets added once |     # match ID only gets added once | ||||||
|     matcher.add("TEST", None, Doc(en_vocab, words=["like"])) |     matcher.add("TEST", [Doc(en_vocab, words=["like"])]) | ||||||
|     matcher.add("TEST", None, Doc(en_vocab, words=["like"])) |     matcher.add("TEST", [Doc(en_vocab, words=["like"])]) | ||||||
|     matcher.add("TEST", None, Doc(en_vocab, words=["like"])) |     matcher.add("TEST", [Doc(en_vocab, words=["like"])]) | ||||||
|     matcher.add("TEST", None, Doc(en_vocab, words=["like"])) |     matcher.add("TEST", [Doc(en_vocab, words=["like"])]) | ||||||
|     doc = Doc(en_vocab, words=["I", "like", "Google", "Now", "best"]) |     doc = Doc(en_vocab, words=["I", "like", "Google", "Now", "best"]) | ||||||
|     assert "TEST" in matcher |     assert "TEST" in matcher | ||||||
|     assert "TEST2" not in matcher |     assert "TEST2" not in matcher | ||||||
|  | @ -68,8 +90,8 @@ def test_phrase_matcher_repeated_add(en_vocab): | ||||||
| 
 | 
 | ||||||
| def test_phrase_matcher_remove(en_vocab): | def test_phrase_matcher_remove(en_vocab): | ||||||
|     matcher = PhraseMatcher(en_vocab) |     matcher = PhraseMatcher(en_vocab) | ||||||
|     matcher.add("TEST1", None, Doc(en_vocab, words=["like"])) |     matcher.add("TEST1", [Doc(en_vocab, words=["like"])]) | ||||||
|     matcher.add("TEST2", None, Doc(en_vocab, words=["best"])) |     matcher.add("TEST2", [Doc(en_vocab, words=["best"])]) | ||||||
|     doc = Doc(en_vocab, words=["I", "like", "Google", "Now", "best"]) |     doc = Doc(en_vocab, words=["I", "like", "Google", "Now", "best"]) | ||||||
|     assert "TEST1" in matcher |     assert "TEST1" in matcher | ||||||
|     assert "TEST2" in matcher |     assert "TEST2" in matcher | ||||||
|  | @ -95,9 +117,9 @@ def test_phrase_matcher_remove(en_vocab): | ||||||
| 
 | 
 | ||||||
| def test_phrase_matcher_overlapping_with_remove(en_vocab): | def test_phrase_matcher_overlapping_with_remove(en_vocab): | ||||||
|     matcher = PhraseMatcher(en_vocab) |     matcher = PhraseMatcher(en_vocab) | ||||||
|     matcher.add("TEST", None, Doc(en_vocab, words=["like"])) |     matcher.add("TEST", [Doc(en_vocab, words=["like"])]) | ||||||
|     # TEST2 is added alongside TEST |     # TEST2 is added alongside TEST | ||||||
|     matcher.add("TEST2", None, Doc(en_vocab, words=["like"])) |     matcher.add("TEST2", [Doc(en_vocab, words=["like"])]) | ||||||
|     doc = Doc(en_vocab, words=["I", "like", "Google", "Now", "best"]) |     doc = Doc(en_vocab, words=["I", "like", "Google", "Now", "best"]) | ||||||
|     assert "TEST" in matcher |     assert "TEST" in matcher | ||||||
|     assert len(matcher) == 2 |     assert len(matcher) == 2 | ||||||
|  | @ -122,7 +144,7 @@ def test_phrase_matcher_string_attrs(en_vocab): | ||||||
|     pos2 = ["INTJ", "PUNCT", "PRON", "VERB", "NOUN", "ADV", "ADV"] |     pos2 = ["INTJ", "PUNCT", "PRON", "VERB", "NOUN", "ADV", "ADV"] | ||||||
|     pattern = get_doc(en_vocab, words=words1, pos=pos1) |     pattern = get_doc(en_vocab, words=words1, pos=pos1) | ||||||
|     matcher = PhraseMatcher(en_vocab, attr="POS") |     matcher = PhraseMatcher(en_vocab, attr="POS") | ||||||
|     matcher.add("TEST", None, pattern) |     matcher.add("TEST", [pattern]) | ||||||
|     doc = get_doc(en_vocab, words=words2, pos=pos2) |     doc = get_doc(en_vocab, words=words2, pos=pos2) | ||||||
|     matches = matcher(doc) |     matches = matcher(doc) | ||||||
|     assert len(matches) == 1 |     assert len(matches) == 1 | ||||||
|  | @ -140,7 +162,7 @@ def test_phrase_matcher_string_attrs_negative(en_vocab): | ||||||
|     pos2 = ["X", "X", "X"] |     pos2 = ["X", "X", "X"] | ||||||
|     pattern = get_doc(en_vocab, words=words1, pos=pos1) |     pattern = get_doc(en_vocab, words=words1, pos=pos1) | ||||||
|     matcher = PhraseMatcher(en_vocab, attr="POS") |     matcher = PhraseMatcher(en_vocab, attr="POS") | ||||||
|     matcher.add("TEST", None, pattern) |     matcher.add("TEST", [pattern]) | ||||||
|     doc = get_doc(en_vocab, words=words2, pos=pos2) |     doc = get_doc(en_vocab, words=words2, pos=pos2) | ||||||
|     matches = matcher(doc) |     matches = matcher(doc) | ||||||
|     assert len(matches) == 0 |     assert len(matches) == 0 | ||||||
|  | @ -151,7 +173,7 @@ def test_phrase_matcher_bool_attrs(en_vocab): | ||||||
|     words2 = ["No", "problem", ",", "he", "said", "."] |     words2 = ["No", "problem", ",", "he", "said", "."] | ||||||
|     pattern = Doc(en_vocab, words=words1) |     pattern = Doc(en_vocab, words=words1) | ||||||
|     matcher = PhraseMatcher(en_vocab, attr="IS_PUNCT") |     matcher = PhraseMatcher(en_vocab, attr="IS_PUNCT") | ||||||
|     matcher.add("TEST", None, pattern) |     matcher.add("TEST", [pattern]) | ||||||
|     doc = Doc(en_vocab, words=words2) |     doc = Doc(en_vocab, words=words2) | ||||||
|     matches = matcher(doc) |     matches = matcher(doc) | ||||||
|     assert len(matches) == 2 |     assert len(matches) == 2 | ||||||
|  | @ -173,15 +195,15 @@ def test_phrase_matcher_validation(en_vocab): | ||||||
|     doc3 = Doc(en_vocab, words=["Test"]) |     doc3 = Doc(en_vocab, words=["Test"]) | ||||||
|     matcher = PhraseMatcher(en_vocab, validate=True) |     matcher = PhraseMatcher(en_vocab, validate=True) | ||||||
|     with pytest.warns(UserWarning): |     with pytest.warns(UserWarning): | ||||||
|         matcher.add("TEST1", None, doc1) |         matcher.add("TEST1", [doc1]) | ||||||
|     with pytest.warns(UserWarning): |     with pytest.warns(UserWarning): | ||||||
|         matcher.add("TEST2", None, doc2) |         matcher.add("TEST2", [doc2]) | ||||||
|     with pytest.warns(None) as record: |     with pytest.warns(None) as record: | ||||||
|         matcher.add("TEST3", None, doc3) |         matcher.add("TEST3", [doc3]) | ||||||
|         assert not record.list |         assert not record.list | ||||||
|     matcher = PhraseMatcher(en_vocab, attr="POS", validate=True) |     matcher = PhraseMatcher(en_vocab, attr="POS", validate=True) | ||||||
|     with pytest.warns(None) as record: |     with pytest.warns(None) as record: | ||||||
|         matcher.add("TEST4", None, doc2) |         matcher.add("TEST4", [doc2]) | ||||||
|         assert not record.list |         assert not record.list | ||||||
| 
 | 
 | ||||||
| 
 | 
 | ||||||
|  | @ -198,24 +220,24 @@ def test_attr_pipeline_checks(en_vocab): | ||||||
|     doc3 = Doc(en_vocab, words=["Test"]) |     doc3 = Doc(en_vocab, words=["Test"]) | ||||||
|     # DEP requires is_parsed |     # DEP requires is_parsed | ||||||
|     matcher = PhraseMatcher(en_vocab, attr="DEP") |     matcher = PhraseMatcher(en_vocab, attr="DEP") | ||||||
|     matcher.add("TEST1", None, doc1) |     matcher.add("TEST1", [doc1]) | ||||||
|     with pytest.raises(ValueError): |     with pytest.raises(ValueError): | ||||||
|         matcher.add("TEST2", None, doc2) |         matcher.add("TEST2", [doc2]) | ||||||
|     with pytest.raises(ValueError): |     with pytest.raises(ValueError): | ||||||
|         matcher.add("TEST3", None, doc3) |         matcher.add("TEST3", [doc3]) | ||||||
|     # TAG, POS, LEMMA require is_tagged |     # TAG, POS, LEMMA require is_tagged | ||||||
|     for attr in ("TAG", "POS", "LEMMA"): |     for attr in ("TAG", "POS", "LEMMA"): | ||||||
|         matcher = PhraseMatcher(en_vocab, attr=attr) |         matcher = PhraseMatcher(en_vocab, attr=attr) | ||||||
|         matcher.add("TEST2", None, doc2) |         matcher.add("TEST2", [doc2]) | ||||||
|         with pytest.raises(ValueError): |         with pytest.raises(ValueError): | ||||||
|             matcher.add("TEST1", None, doc1) |             matcher.add("TEST1", [doc1]) | ||||||
|         with pytest.raises(ValueError): |         with pytest.raises(ValueError): | ||||||
|             matcher.add("TEST3", None, doc3) |             matcher.add("TEST3", [doc3]) | ||||||
|     # TEXT/ORTH only require tokens |     # TEXT/ORTH only require tokens | ||||||
|     matcher = PhraseMatcher(en_vocab, attr="ORTH") |     matcher = PhraseMatcher(en_vocab, attr="ORTH") | ||||||
|     matcher.add("TEST3", None, doc3) |     matcher.add("TEST3", [doc3]) | ||||||
|     matcher = PhraseMatcher(en_vocab, attr="TEXT") |     matcher = PhraseMatcher(en_vocab, attr="TEXT") | ||||||
|     matcher.add("TEST3", None, doc3) |     matcher.add("TEST3", [doc3]) | ||||||
| 
 | 
 | ||||||
| 
 | 
 | ||||||
| def test_phrase_matcher_callback(en_vocab): | def test_phrase_matcher_callback(en_vocab): | ||||||
|  | @ -223,7 +245,7 @@ def test_phrase_matcher_callback(en_vocab): | ||||||
|     doc = Doc(en_vocab, words=["I", "like", "Google", "Now", "best"]) |     doc = Doc(en_vocab, words=["I", "like", "Google", "Now", "best"]) | ||||||
|     pattern = Doc(en_vocab, words=["Google", "Now"]) |     pattern = Doc(en_vocab, words=["Google", "Now"]) | ||||||
|     matcher = PhraseMatcher(en_vocab) |     matcher = PhraseMatcher(en_vocab) | ||||||
|     matcher.add("COMPANY", mock, pattern) |     matcher.add("COMPANY", [pattern], on_match=mock) | ||||||
|     matches = matcher(doc) |     matches = matcher(doc) | ||||||
|     mock.assert_called_once_with(matcher, doc, 0, matches) |     mock.assert_called_once_with(matcher, doc, 0, matches) | ||||||
| 
 | 
 | ||||||
|  | @ -234,5 +256,13 @@ def test_phrase_matcher_remove_overlapping_patterns(en_vocab): | ||||||
|     pattern2 = Doc(en_vocab, words=["this", "is"]) |     pattern2 = Doc(en_vocab, words=["this", "is"]) | ||||||
|     pattern3 = Doc(en_vocab, words=["this", "is", "a"]) |     pattern3 = Doc(en_vocab, words=["this", "is", "a"]) | ||||||
|     pattern4 = Doc(en_vocab, words=["this", "is", "a", "word"]) |     pattern4 = Doc(en_vocab, words=["this", "is", "a", "word"]) | ||||||
|     matcher.add("THIS", None, pattern1, pattern2, pattern3, pattern4) |     matcher.add("THIS", [pattern1, pattern2, pattern3, pattern4]) | ||||||
|     matcher.remove("THIS") |     matcher.remove("THIS") | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
|  | def test_phrase_matcher_basic_check(en_vocab): | ||||||
|  |     matcher = PhraseMatcher(en_vocab) | ||||||
|  |     # Potential mistake: pass in pattern instead of list of patterns | ||||||
|  |     pattern = Doc(en_vocab, words=["hello", "world"]) | ||||||
|  |     with pytest.raises(ValueError): | ||||||
|  |         matcher.add("TEST", pattern) | ||||||
|  |  | ||||||
							
								
								
									
										168
									
								
								spacy/tests/pipeline/test_analysis.py
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										168
									
								
								spacy/tests/pipeline/test_analysis.py
									
									
									
									
									
										Normal file
									
								
							|  | @ -0,0 +1,168 @@ | ||||||
|  | # coding: utf8 | ||||||
|  | from __future__ import unicode_literals | ||||||
|  | 
 | ||||||
|  | import spacy.language | ||||||
|  | from spacy.language import Language, component | ||||||
|  | from spacy.analysis import print_summary, validate_attrs | ||||||
|  | from spacy.analysis import get_assigns_for_attr, get_requires_for_attr | ||||||
|  | from spacy.compat import is_python2 | ||||||
|  | from mock import Mock, ANY | ||||||
|  | import pytest | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
|  | def test_component_decorator_function(): | ||||||
|  |     @component(name="test") | ||||||
|  |     def test_component(doc): | ||||||
|  |         """docstring""" | ||||||
|  |         return doc | ||||||
|  | 
 | ||||||
|  |     assert test_component.name == "test" | ||||||
|  |     if not is_python2: | ||||||
|  |         assert test_component.__doc__ == "docstring" | ||||||
|  |     assert test_component("foo") == "foo" | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
|  | def test_component_decorator_class(): | ||||||
|  |     @component(name="test") | ||||||
|  |     class TestComponent(object): | ||||||
|  |         """docstring1""" | ||||||
|  | 
 | ||||||
|  |         foo = "bar" | ||||||
|  | 
 | ||||||
|  |         def __call__(self, doc): | ||||||
|  |             """docstring2""" | ||||||
|  |             return doc | ||||||
|  | 
 | ||||||
|  |         def custom(self, x): | ||||||
|  |             """docstring3""" | ||||||
|  |             return x | ||||||
|  | 
 | ||||||
|  |     assert TestComponent.name == "test" | ||||||
|  |     assert TestComponent.foo == "bar" | ||||||
|  |     assert hasattr(TestComponent, "custom") | ||||||
|  |     test_component = TestComponent() | ||||||
|  |     assert test_component.foo == "bar" | ||||||
|  |     assert test_component("foo") == "foo" | ||||||
|  |     assert hasattr(test_component, "custom") | ||||||
|  |     assert test_component.custom("bar") == "bar" | ||||||
|  |     if not is_python2: | ||||||
|  |         assert TestComponent.__doc__ == "docstring1" | ||||||
|  |         assert TestComponent.__call__.__doc__ == "docstring2" | ||||||
|  |         assert TestComponent.custom.__doc__ == "docstring3" | ||||||
|  |         assert test_component.__doc__ == "docstring1" | ||||||
|  |         assert test_component.__call__.__doc__ == "docstring2" | ||||||
|  |         assert test_component.custom.__doc__ == "docstring3" | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
|  | def test_component_decorator_assigns(): | ||||||
|  |     spacy.language.ENABLE_PIPELINE_ANALYSIS = True | ||||||
|  | 
 | ||||||
|  |     @component("c1", assigns=["token.tag", "doc.tensor"]) | ||||||
|  |     def test_component1(doc): | ||||||
|  |         return doc | ||||||
|  | 
 | ||||||
|  |     @component( | ||||||
|  |         "c2", requires=["token.tag", "token.pos"], assigns=["token.lemma", "doc.tensor"] | ||||||
|  |     ) | ||||||
|  |     def test_component2(doc): | ||||||
|  |         return doc | ||||||
|  | 
 | ||||||
|  |     @component("c3", requires=["token.lemma"], assigns=["token._.custom_lemma"]) | ||||||
|  |     def test_component3(doc): | ||||||
|  |         return doc | ||||||
|  | 
 | ||||||
|  |     assert "c1" in Language.factories | ||||||
|  |     assert "c2" in Language.factories | ||||||
|  |     assert "c3" in Language.factories | ||||||
|  | 
 | ||||||
|  |     nlp = Language() | ||||||
|  |     nlp.add_pipe(test_component1) | ||||||
|  |     with pytest.warns(UserWarning): | ||||||
|  |         nlp.add_pipe(test_component2) | ||||||
|  |     nlp.add_pipe(test_component3) | ||||||
|  |     assigns_tensor = get_assigns_for_attr(nlp.pipeline, "doc.tensor") | ||||||
|  |     assert [name for name, _ in assigns_tensor] == ["c1", "c2"] | ||||||
|  |     test_component4 = nlp.create_pipe("c1") | ||||||
|  |     assert test_component4.name == "c1" | ||||||
|  |     assert test_component4.factory == "c1" | ||||||
|  |     nlp.add_pipe(test_component4, name="c4") | ||||||
|  |     assert nlp.pipe_names == ["c1", "c2", "c3", "c4"] | ||||||
|  |     assert "c4" not in Language.factories | ||||||
|  |     assert nlp.pipe_factories["c1"] == "c1" | ||||||
|  |     assert nlp.pipe_factories["c4"] == "c1" | ||||||
|  |     assigns_tensor = get_assigns_for_attr(nlp.pipeline, "doc.tensor") | ||||||
|  |     assert [name for name, _ in assigns_tensor] == ["c1", "c2", "c4"] | ||||||
|  |     requires_pos = get_requires_for_attr(nlp.pipeline, "token.pos") | ||||||
|  |     assert [name for name, _ in requires_pos] == ["c2"] | ||||||
|  |     assert print_summary(nlp, no_print=True) | ||||||
|  |     assert nlp("hello world") | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
|  | def test_component_factories_from_nlp(): | ||||||
|  |     """Test that class components can implement a from_nlp classmethod that | ||||||
|  |     gives them access to the nlp object and config via the factory.""" | ||||||
|  | 
 | ||||||
|  |     class TestComponent5(object): | ||||||
|  |         def __call__(self, doc): | ||||||
|  |             return doc | ||||||
|  | 
 | ||||||
|  |     mock = Mock() | ||||||
|  |     mock.return_value = TestComponent5() | ||||||
|  |     TestComponent5.from_nlp = classmethod(mock) | ||||||
|  |     TestComponent5 = component("c5")(TestComponent5) | ||||||
|  | 
 | ||||||
|  |     assert "c5" in Language.factories | ||||||
|  |     nlp = Language() | ||||||
|  |     pipe = nlp.create_pipe("c5", config={"foo": "bar"}) | ||||||
|  |     nlp.add_pipe(pipe) | ||||||
|  |     assert nlp("hello world") | ||||||
|  |     # The first argument here is the class itself, so we're accepting any here | ||||||
|  |     mock.assert_called_once_with(ANY, nlp, foo="bar") | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
|  | def test_analysis_validate_attrs_valid(): | ||||||
|  |     attrs = ["doc.sents", "doc.ents", "token.tag", "token._.xyz", "span._.xyz"] | ||||||
|  |     assert validate_attrs(attrs) | ||||||
|  |     for attr in attrs: | ||||||
|  |         assert validate_attrs([attr]) | ||||||
|  |     with pytest.raises(ValueError): | ||||||
|  |         validate_attrs(["doc.sents", "doc.xyz"]) | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
|  | @pytest.mark.parametrize( | ||||||
|  |     "attr", | ||||||
|  |     [ | ||||||
|  |         "doc", | ||||||
|  |         "doc_ents", | ||||||
|  |         "doc.xyz", | ||||||
|  |         "token.xyz", | ||||||
|  |         "token.tag_", | ||||||
|  |         "token.tag.xyz", | ||||||
|  |         "token._.xyz.abc", | ||||||
|  |         "span.label", | ||||||
|  |     ], | ||||||
|  | ) | ||||||
|  | def test_analysis_validate_attrs_invalid(attr): | ||||||
|  |     with pytest.raises(ValueError): | ||||||
|  |         validate_attrs([attr]) | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
|  | def test_analysis_validate_attrs_remove_pipe(): | ||||||
|  |     """Test that attributes are validated correctly on remove.""" | ||||||
|  |     spacy.language.ENABLE_PIPELINE_ANALYSIS = True | ||||||
|  | 
 | ||||||
|  |     @component("c1", assigns=["token.tag"]) | ||||||
|  |     def c1(doc): | ||||||
|  |         return doc | ||||||
|  | 
 | ||||||
|  |     @component("c2", requires=["token.pos"]) | ||||||
|  |     def c2(doc): | ||||||
|  |         return doc | ||||||
|  | 
 | ||||||
|  |     nlp = Language() | ||||||
|  |     nlp.add_pipe(c1) | ||||||
|  |     with pytest.warns(UserWarning): | ||||||
|  |         nlp.add_pipe(c2) | ||||||
|  |     with pytest.warns(None) as record: | ||||||
|  |         nlp.remove_pipe("c2") | ||||||
|  |     assert not record.list | ||||||
|  | @ -154,7 +154,8 @@ def test_append_alias(nlp): | ||||||
|     assert len(mykb.get_candidates("douglas")) == 3 |     assert len(mykb.get_candidates("douglas")) == 3 | ||||||
| 
 | 
 | ||||||
|     # append the same alias-entity pair again should not work (will throw a warning) |     # append the same alias-entity pair again should not work (will throw a warning) | ||||||
|     mykb.append_alias(alias="douglas", entity="Q1", prior_prob=0.3) |     with pytest.warns(UserWarning): | ||||||
|  |         mykb.append_alias(alias="douglas", entity="Q1", prior_prob=0.3) | ||||||
| 
 | 
 | ||||||
|     # test the size of the relevant candidates remained unchanged |     # test the size of the relevant candidates remained unchanged | ||||||
|     assert len(mykb.get_candidates("douglas")) == 3 |     assert len(mykb.get_candidates("douglas")) == 3 | ||||||
|  |  | ||||||
							
								
								
									
										34
									
								
								spacy/tests/pipeline/test_functions.py
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										34
									
								
								spacy/tests/pipeline/test_functions.py
									
									
									
									
									
										Normal file
									
								
							|  | @ -0,0 +1,34 @@ | ||||||
|  | # coding: utf-8 | ||||||
|  | from __future__ import unicode_literals | ||||||
|  | 
 | ||||||
|  | import pytest | ||||||
|  | from spacy.pipeline.functions import merge_subtokens | ||||||
|  | from ..util import get_doc | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
|  | @pytest.fixture | ||||||
|  | def doc(en_tokenizer): | ||||||
|  |     # fmt: off | ||||||
|  |     text = "This is a sentence. This is another sentence. And a third." | ||||||
|  |     heads = [1, 0, 1, -2, -3, 1, 0, 1, -2, -3, 1, 1, 1, 0] | ||||||
|  |     deps = ["nsubj", "ROOT", "subtok", "attr", "punct", "nsubj", "ROOT", | ||||||
|  |             "subtok", "attr", "punct", "subtok", "subtok", "subtok", "ROOT"] | ||||||
|  |     # fmt: on | ||||||
|  |     tokens = en_tokenizer(text) | ||||||
|  |     return get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads, deps=deps) | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
|  | def test_merge_subtokens(doc): | ||||||
|  |     doc = merge_subtokens(doc) | ||||||
|  |     # get_doc() doesn't set spaces, so the result is "And a third ." | ||||||
|  |     assert [t.text for t in doc] == [ | ||||||
|  |         "This", | ||||||
|  |         "is", | ||||||
|  |         "a sentence", | ||||||
|  |         ".", | ||||||
|  |         "This", | ||||||
|  |         "is", | ||||||
|  |         "another sentence", | ||||||
|  |         ".", | ||||||
|  |         "And a third .", | ||||||
|  |     ] | ||||||
|  | @ -105,6 +105,16 @@ def test_disable_pipes_context(nlp, name): | ||||||
|     assert nlp.has_pipe(name) |     assert nlp.has_pipe(name) | ||||||
| 
 | 
 | ||||||
| 
 | 
 | ||||||
|  | def test_disable_pipes_list_arg(nlp): | ||||||
|  |     for name in ["c1", "c2", "c3"]: | ||||||
|  |         nlp.add_pipe(new_pipe, name=name) | ||||||
|  |         assert nlp.has_pipe(name) | ||||||
|  |     with nlp.disable_pipes(["c1", "c2"]): | ||||||
|  |         assert not nlp.has_pipe("c1") | ||||||
|  |         assert not nlp.has_pipe("c2") | ||||||
|  |         assert nlp.has_pipe("c3") | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
| @pytest.mark.parametrize("n_pipes", [100]) | @pytest.mark.parametrize("n_pipes", [100]) | ||||||
| def test_add_lots_of_pipes(nlp, n_pipes): | def test_add_lots_of_pipes(nlp, n_pipes): | ||||||
|     for i in range(n_pipes): |     for i in range(n_pipes): | ||||||
|  |  | ||||||
|  | @ -30,7 +30,7 @@ def test_issue118(en_tokenizer, patterns): | ||||||
|     doc = en_tokenizer(text) |     doc = en_tokenizer(text) | ||||||
|     ORG = doc.vocab.strings["ORG"] |     ORG = doc.vocab.strings["ORG"] | ||||||
|     matcher = Matcher(doc.vocab) |     matcher = Matcher(doc.vocab) | ||||||
|     matcher.add("BostonCeltics", None, *patterns) |     matcher.add("BostonCeltics", patterns) | ||||||
|     assert len(list(doc.ents)) == 0 |     assert len(list(doc.ents)) == 0 | ||||||
|     matches = [(ORG, start, end) for _, start, end in matcher(doc)] |     matches = [(ORG, start, end) for _, start, end in matcher(doc)] | ||||||
|     assert matches == [(ORG, 9, 11), (ORG, 10, 11)] |     assert matches == [(ORG, 9, 11), (ORG, 10, 11)] | ||||||
|  | @ -57,7 +57,7 @@ def test_issue118_prefix_reorder(en_tokenizer, patterns): | ||||||
|     doc = en_tokenizer(text) |     doc = en_tokenizer(text) | ||||||
|     ORG = doc.vocab.strings["ORG"] |     ORG = doc.vocab.strings["ORG"] | ||||||
|     matcher = Matcher(doc.vocab) |     matcher = Matcher(doc.vocab) | ||||||
|     matcher.add("BostonCeltics", None, *patterns) |     matcher.add("BostonCeltics", patterns) | ||||||
|     assert len(list(doc.ents)) == 0 |     assert len(list(doc.ents)) == 0 | ||||||
|     matches = [(ORG, start, end) for _, start, end in matcher(doc)] |     matches = [(ORG, start, end) for _, start, end in matcher(doc)] | ||||||
|     doc.ents += tuple(matches)[1:] |     doc.ents += tuple(matches)[1:] | ||||||
|  | @ -78,7 +78,7 @@ def test_issue242(en_tokenizer): | ||||||
|     ] |     ] | ||||||
|     doc = en_tokenizer(text) |     doc = en_tokenizer(text) | ||||||
|     matcher = Matcher(doc.vocab) |     matcher = Matcher(doc.vocab) | ||||||
|     matcher.add("FOOD", None, *patterns) |     matcher.add("FOOD", patterns) | ||||||
|     matches = [(ent_type, start, end) for ent_type, start, end in matcher(doc)] |     matches = [(ent_type, start, end) for ent_type, start, end in matcher(doc)] | ||||||
|     match1, match2 = matches |     match1, match2 = matches | ||||||
|     assert match1[1] == 3 |     assert match1[1] == 3 | ||||||
|  | @ -127,17 +127,13 @@ def test_issue587(en_tokenizer): | ||||||
|     """Test that Matcher doesn't segfault on particular input""" |     """Test that Matcher doesn't segfault on particular input""" | ||||||
|     doc = en_tokenizer("a b; c") |     doc = en_tokenizer("a b; c") | ||||||
|     matcher = Matcher(doc.vocab) |     matcher = Matcher(doc.vocab) | ||||||
|     matcher.add("TEST1", None, [{ORTH: "a"}, {ORTH: "b"}]) |     matcher.add("TEST1", [[{ORTH: "a"}, {ORTH: "b"}]]) | ||||||
|     matches = matcher(doc) |     matches = matcher(doc) | ||||||
|     assert len(matches) == 1 |     assert len(matches) == 1 | ||||||
|     matcher.add( |     matcher.add("TEST2", [[{ORTH: "a"}, {ORTH: "b"}, {IS_PUNCT: True}, {ORTH: "c"}]]) | ||||||
|         "TEST2", None, [{ORTH: "a"}, {ORTH: "b"}, {IS_PUNCT: True}, {ORTH: "c"}] |  | ||||||
|     ) |  | ||||||
|     matches = matcher(doc) |     matches = matcher(doc) | ||||||
|     assert len(matches) == 2 |     assert len(matches) == 2 | ||||||
|     matcher.add( |     matcher.add("TEST3", [[{ORTH: "a"}, {ORTH: "b"}, {IS_PUNCT: True}, {ORTH: "d"}]]) | ||||||
|         "TEST3", None, [{ORTH: "a"}, {ORTH: "b"}, {IS_PUNCT: True}, {ORTH: "d"}] |  | ||||||
|     ) |  | ||||||
|     matches = matcher(doc) |     matches = matcher(doc) | ||||||
|     assert len(matches) == 2 |     assert len(matches) == 2 | ||||||
| 
 | 
 | ||||||
|  | @ -145,7 +141,7 @@ def test_issue587(en_tokenizer): | ||||||
| def test_issue588(en_vocab): | def test_issue588(en_vocab): | ||||||
|     matcher = Matcher(en_vocab) |     matcher = Matcher(en_vocab) | ||||||
|     with pytest.raises(ValueError): |     with pytest.raises(ValueError): | ||||||
|         matcher.add("TEST", None, []) |         matcher.add("TEST", [[]]) | ||||||
| 
 | 
 | ||||||
| 
 | 
 | ||||||
| @pytest.mark.xfail | @pytest.mark.xfail | ||||||
|  | @ -161,11 +157,9 @@ def test_issue590(en_vocab): | ||||||
|     doc = Doc(en_vocab, words=["n", "=", "1", ";", "a", ":", "5", "%"]) |     doc = Doc(en_vocab, words=["n", "=", "1", ";", "a", ":", "5", "%"]) | ||||||
|     matcher = Matcher(en_vocab) |     matcher = Matcher(en_vocab) | ||||||
|     matcher.add( |     matcher.add( | ||||||
|         "ab", |         "ab", [[{"IS_ALPHA": True}, {"ORTH": ":"}, {"LIKE_NUM": True}, {"ORTH": "%"}]] | ||||||
|         None, |  | ||||||
|         [{"IS_ALPHA": True}, {"ORTH": ":"}, {"LIKE_NUM": True}, {"ORTH": "%"}], |  | ||||||
|     ) |     ) | ||||||
|     matcher.add("ab", None, [{"IS_ALPHA": True}, {"ORTH": "="}, {"LIKE_NUM": True}]) |     matcher.add("ab", [[{"IS_ALPHA": True}, {"ORTH": "="}, {"LIKE_NUM": True}]]) | ||||||
|     matches = matcher(doc) |     matches = matcher(doc) | ||||||
|     assert len(matches) == 2 |     assert len(matches) == 2 | ||||||
| 
 | 
 | ||||||
|  | @ -221,7 +215,7 @@ def test_issue615(en_tokenizer): | ||||||
|     label = "Sport_Equipment" |     label = "Sport_Equipment" | ||||||
|     doc = en_tokenizer(text) |     doc = en_tokenizer(text) | ||||||
|     matcher = Matcher(doc.vocab) |     matcher = Matcher(doc.vocab) | ||||||
|     matcher.add(label, merge_phrases, pattern) |     matcher.add(label, [pattern], on_match=merge_phrases) | ||||||
|     matcher(doc) |     matcher(doc) | ||||||
|     entities = list(doc.ents) |     entities = list(doc.ents) | ||||||
|     assert entities != [] |     assert entities != [] | ||||||
|  | @ -339,7 +333,7 @@ def test_issue850(): | ||||||
|     vocab = Vocab(lex_attr_getters={LOWER: lambda string: string.lower()}) |     vocab = Vocab(lex_attr_getters={LOWER: lambda string: string.lower()}) | ||||||
|     matcher = Matcher(vocab) |     matcher = Matcher(vocab) | ||||||
|     pattern = [{"LOWER": "bob"}, {"OP": "*"}, {"LOWER": "frank"}] |     pattern = [{"LOWER": "bob"}, {"OP": "*"}, {"LOWER": "frank"}] | ||||||
|     matcher.add("FarAway", None, pattern) |     matcher.add("FarAway", [pattern]) | ||||||
|     doc = Doc(matcher.vocab, words=["bob", "and", "and", "frank"]) |     doc = Doc(matcher.vocab, words=["bob", "and", "and", "frank"]) | ||||||
|     match = matcher(doc) |     match = matcher(doc) | ||||||
|     assert len(match) == 1 |     assert len(match) == 1 | ||||||
|  | @ -353,7 +347,7 @@ def test_issue850_basic(): | ||||||
|     vocab = Vocab(lex_attr_getters={LOWER: lambda string: string.lower()}) |     vocab = Vocab(lex_attr_getters={LOWER: lambda string: string.lower()}) | ||||||
|     matcher = Matcher(vocab) |     matcher = Matcher(vocab) | ||||||
|     pattern = [{"LOWER": "bob"}, {"OP": "*", "LOWER": "and"}, {"LOWER": "frank"}] |     pattern = [{"LOWER": "bob"}, {"OP": "*", "LOWER": "and"}, {"LOWER": "frank"}] | ||||||
|     matcher.add("FarAway", None, pattern) |     matcher.add("FarAway", [pattern]) | ||||||
|     doc = Doc(matcher.vocab, words=["bob", "and", "and", "frank"]) |     doc = Doc(matcher.vocab, words=["bob", "and", "and", "frank"]) | ||||||
|     match = matcher(doc) |     match = matcher(doc) | ||||||
|     assert len(match) == 1 |     assert len(match) == 1 | ||||||
|  |  | ||||||
|  | @ -111,7 +111,7 @@ def test_issue1434(): | ||||||
|     hello_world = Doc(vocab, words=["Hello", "World"]) |     hello_world = Doc(vocab, words=["Hello", "World"]) | ||||||
|     hello = Doc(vocab, words=["Hello"]) |     hello = Doc(vocab, words=["Hello"]) | ||||||
|     matcher = Matcher(vocab) |     matcher = Matcher(vocab) | ||||||
|     matcher.add("MyMatcher", None, pattern) |     matcher.add("MyMatcher", [pattern]) | ||||||
|     matches = matcher(hello_world) |     matches = matcher(hello_world) | ||||||
|     assert matches |     assert matches | ||||||
|     matches = matcher(hello) |     matches = matcher(hello) | ||||||
|  | @ -133,7 +133,7 @@ def test_issue1450(string, start, end): | ||||||
|     """Test matcher works when patterns end with * operator.""" |     """Test matcher works when patterns end with * operator.""" | ||||||
|     pattern = [{"ORTH": "a"}, {"ORTH": "b", "OP": "*"}] |     pattern = [{"ORTH": "a"}, {"ORTH": "b", "OP": "*"}] | ||||||
|     matcher = Matcher(Vocab()) |     matcher = Matcher(Vocab()) | ||||||
|     matcher.add("TSTEND", None, pattern) |     matcher.add("TSTEND", [pattern]) | ||||||
|     doc = Doc(Vocab(), words=string.split()) |     doc = Doc(Vocab(), words=string.split()) | ||||||
|     matches = matcher(doc) |     matches = matcher(doc) | ||||||
|     if start is None or end is None: |     if start is None or end is None: | ||||||
|  |  | ||||||
|  | @ -224,7 +224,7 @@ def test_issue1868(): | ||||||
| 
 | 
 | ||||||
| def test_issue1883(): | def test_issue1883(): | ||||||
|     matcher = Matcher(Vocab()) |     matcher = Matcher(Vocab()) | ||||||
|     matcher.add("pat1", None, [{"orth": "hello"}]) |     matcher.add("pat1", [[{"orth": "hello"}]]) | ||||||
|     doc = Doc(matcher.vocab, words=["hello"]) |     doc = Doc(matcher.vocab, words=["hello"]) | ||||||
|     assert len(matcher(doc)) == 1 |     assert len(matcher(doc)) == 1 | ||||||
|     new_matcher = copy.deepcopy(matcher) |     new_matcher = copy.deepcopy(matcher) | ||||||
|  | @ -249,7 +249,7 @@ def test_issue1915(): | ||||||
| def test_issue1945(): | def test_issue1945(): | ||||||
|     """Test regression in Matcher introduced in v2.0.6.""" |     """Test regression in Matcher introduced in v2.0.6.""" | ||||||
|     matcher = Matcher(Vocab()) |     matcher = Matcher(Vocab()) | ||||||
|     matcher.add("MWE", None, [{"orth": "a"}, {"orth": "a"}]) |     matcher.add("MWE", [[{"orth": "a"}, {"orth": "a"}]]) | ||||||
|     doc = Doc(matcher.vocab, words=["a", "a", "a"]) |     doc = Doc(matcher.vocab, words=["a", "a", "a"]) | ||||||
|     matches = matcher(doc)  # we should see two overlapping matches here |     matches = matcher(doc)  # we should see two overlapping matches here | ||||||
|     assert len(matches) == 2 |     assert len(matches) == 2 | ||||||
|  | @ -285,7 +285,7 @@ def test_issue1971(en_vocab): | ||||||
|         {"ORTH": "!", "OP": "?"}, |         {"ORTH": "!", "OP": "?"}, | ||||||
|     ] |     ] | ||||||
|     Token.set_extension("optional", default=False) |     Token.set_extension("optional", default=False) | ||||||
|     matcher.add("TEST", None, pattern) |     matcher.add("TEST", [pattern]) | ||||||
|     doc = Doc(en_vocab, words=["Hello", "John", "Doe", "!"]) |     doc = Doc(en_vocab, words=["Hello", "John", "Doe", "!"]) | ||||||
|     # We could also assert length 1 here, but this is more conclusive, because |     # We could also assert length 1 here, but this is more conclusive, because | ||||||
|     # the real problem here is that it returns a duplicate match for a match_id |     # the real problem here is that it returns a duplicate match for a match_id | ||||||
|  | @ -299,7 +299,7 @@ def test_issue_1971_2(en_vocab): | ||||||
|     pattern1 = [{"ORTH": "EUR", "LOWER": {"IN": ["eur"]}}, {"LIKE_NUM": True}] |     pattern1 = [{"ORTH": "EUR", "LOWER": {"IN": ["eur"]}}, {"LIKE_NUM": True}] | ||||||
|     pattern2 = [{"LIKE_NUM": True}, {"ORTH": "EUR"}]  # {"IN": ["EUR"]}}] |     pattern2 = [{"LIKE_NUM": True}, {"ORTH": "EUR"}]  # {"IN": ["EUR"]}}] | ||||||
|     doc = Doc(en_vocab, words=["EUR", "10", "is", "10", "EUR"]) |     doc = Doc(en_vocab, words=["EUR", "10", "is", "10", "EUR"]) | ||||||
|     matcher.add("TEST1", None, pattern1, pattern2) |     matcher.add("TEST1", [pattern1, pattern2]) | ||||||
|     matches = matcher(doc) |     matches = matcher(doc) | ||||||
|     assert len(matches) == 2 |     assert len(matches) == 2 | ||||||
| 
 | 
 | ||||||
|  | @ -310,8 +310,8 @@ def test_issue_1971_3(en_vocab): | ||||||
|     Token.set_extension("b", default=2, force=True) |     Token.set_extension("b", default=2, force=True) | ||||||
|     doc = Doc(en_vocab, words=["hello", "world"]) |     doc = Doc(en_vocab, words=["hello", "world"]) | ||||||
|     matcher = Matcher(en_vocab) |     matcher = Matcher(en_vocab) | ||||||
|     matcher.add("A", None, [{"_": {"a": 1}}]) |     matcher.add("A", [[{"_": {"a": 1}}]]) | ||||||
|     matcher.add("B", None, [{"_": {"b": 2}}]) |     matcher.add("B", [[{"_": {"b": 2}}]]) | ||||||
|     matches = sorted((en_vocab.strings[m_id], s, e) for m_id, s, e in matcher(doc)) |     matches = sorted((en_vocab.strings[m_id], s, e) for m_id, s, e in matcher(doc)) | ||||||
|     assert len(matches) == 4 |     assert len(matches) == 4 | ||||||
|     assert matches == sorted([("A", 0, 1), ("A", 1, 2), ("B", 0, 1), ("B", 1, 2)]) |     assert matches == sorted([("A", 0, 1), ("A", 1, 2), ("B", 0, 1), ("B", 1, 2)]) | ||||||
|  | @ -326,7 +326,7 @@ def test_issue_1971_4(en_vocab): | ||||||
|     matcher = Matcher(en_vocab) |     matcher = Matcher(en_vocab) | ||||||
|     doc = Doc(en_vocab, words=["this", "is", "text"]) |     doc = Doc(en_vocab, words=["this", "is", "text"]) | ||||||
|     pattern = [{"_": {"ext_a": "str_a", "ext_b": "str_b"}}] * 3 |     pattern = [{"_": {"ext_a": "str_a", "ext_b": "str_b"}}] * 3 | ||||||
|     matcher.add("TEST", None, pattern) |     matcher.add("TEST", [pattern]) | ||||||
|     matches = matcher(doc) |     matches = matcher(doc) | ||||||
|     # Uncommenting this caused a segmentation fault |     # Uncommenting this caused a segmentation fault | ||||||
|     assert len(matches) == 1 |     assert len(matches) == 1 | ||||||
|  |  | ||||||
|  | @ -128,7 +128,7 @@ def test_issue2464(en_vocab): | ||||||
|     """Test problem with successive ?. This is the same bug, so putting it here.""" |     """Test problem with successive ?. This is the same bug, so putting it here.""" | ||||||
|     matcher = Matcher(en_vocab) |     matcher = Matcher(en_vocab) | ||||||
|     doc = Doc(en_vocab, words=["a", "b"]) |     doc = Doc(en_vocab, words=["a", "b"]) | ||||||
|     matcher.add("4", None, [{"OP": "?"}, {"OP": "?"}]) |     matcher.add("4", [[{"OP": "?"}, {"OP": "?"}]]) | ||||||
|     matches = matcher(doc) |     matches = matcher(doc) | ||||||
|     assert len(matches) == 3 |     assert len(matches) == 3 | ||||||
| 
 | 
 | ||||||
|  |  | ||||||
|  | @ -37,7 +37,7 @@ def test_issue2569(en_tokenizer): | ||||||
|     doc = en_tokenizer("It is May 15, 1993.") |     doc = en_tokenizer("It is May 15, 1993.") | ||||||
|     doc.ents = [Span(doc, 2, 6, label=doc.vocab.strings["DATE"])] |     doc.ents = [Span(doc, 2, 6, label=doc.vocab.strings["DATE"])] | ||||||
|     matcher = Matcher(doc.vocab) |     matcher = Matcher(doc.vocab) | ||||||
|     matcher.add("RULE", None, [{"ENT_TYPE": "DATE", "OP": "+"}]) |     matcher.add("RULE", [[{"ENT_TYPE": "DATE", "OP": "+"}]]) | ||||||
|     matched = [doc[start:end] for _, start, end in matcher(doc)] |     matched = [doc[start:end] for _, start, end in matcher(doc)] | ||||||
|     matched = sorted(matched, key=len, reverse=True) |     matched = sorted(matched, key=len, reverse=True) | ||||||
|     assert len(matched) == 10 |     assert len(matched) == 10 | ||||||
|  | @ -89,7 +89,7 @@ def test_issue2671(): | ||||||
|         {"IS_PUNCT": True, "OP": "?"}, |         {"IS_PUNCT": True, "OP": "?"}, | ||||||
|         {"LOWER": "adrenaline"}, |         {"LOWER": "adrenaline"}, | ||||||
|     ] |     ] | ||||||
|     matcher.add(pattern_id, None, pattern) |     matcher.add(pattern_id, [pattern]) | ||||||
|     doc1 = nlp("This is a high-adrenaline situation.") |     doc1 = nlp("This is a high-adrenaline situation.") | ||||||
|     doc2 = nlp("This is a high adrenaline situation.") |     doc2 = nlp("This is a high adrenaline situation.") | ||||||
|     matches1 = matcher(doc1) |     matches1 = matcher(doc1) | ||||||
|  |  | ||||||
|  | @ -52,7 +52,7 @@ def test_issue3009(en_vocab): | ||||||
|     doc = get_doc(en_vocab, words=words, tags=tags) |     doc = get_doc(en_vocab, words=words, tags=tags) | ||||||
|     matcher = Matcher(en_vocab) |     matcher = Matcher(en_vocab) | ||||||
|     for i, pattern in enumerate(patterns): |     for i, pattern in enumerate(patterns): | ||||||
|         matcher.add(str(i), None, pattern) |         matcher.add(str(i), [pattern]) | ||||||
|         matches = matcher(doc) |         matches = matcher(doc) | ||||||
|         assert matches |         assert matches | ||||||
| 
 | 
 | ||||||
|  | @ -116,8 +116,8 @@ def test_issue3248_1(): | ||||||
|     total number of patterns.""" |     total number of patterns.""" | ||||||
|     nlp = English() |     nlp = English() | ||||||
|     matcher = PhraseMatcher(nlp.vocab) |     matcher = PhraseMatcher(nlp.vocab) | ||||||
|     matcher.add("TEST1", None, nlp("a"), nlp("b"), nlp("c")) |     matcher.add("TEST1", [nlp("a"), nlp("b"), nlp("c")]) | ||||||
|     matcher.add("TEST2", None, nlp("d")) |     matcher.add("TEST2", [nlp("d")]) | ||||||
|     assert len(matcher) == 2 |     assert len(matcher) == 2 | ||||||
| 
 | 
 | ||||||
| 
 | 
 | ||||||
|  | @ -125,8 +125,8 @@ def test_issue3248_2(): | ||||||
|     """Test that the PhraseMatcher can be pickled correctly.""" |     """Test that the PhraseMatcher can be pickled correctly.""" | ||||||
|     nlp = English() |     nlp = English() | ||||||
|     matcher = PhraseMatcher(nlp.vocab) |     matcher = PhraseMatcher(nlp.vocab) | ||||||
|     matcher.add("TEST1", None, nlp("a"), nlp("b"), nlp("c")) |     matcher.add("TEST1", [nlp("a"), nlp("b"), nlp("c")]) | ||||||
|     matcher.add("TEST2", None, nlp("d")) |     matcher.add("TEST2", [nlp("d")]) | ||||||
|     data = pickle.dumps(matcher) |     data = pickle.dumps(matcher) | ||||||
|     new_matcher = pickle.loads(data) |     new_matcher = pickle.loads(data) | ||||||
|     assert len(new_matcher) == len(matcher) |     assert len(new_matcher) == len(matcher) | ||||||
|  | @ -170,7 +170,7 @@ def test_issue3328(en_vocab): | ||||||
|         [{"LOWER": {"IN": ["hello", "how"]}}], |         [{"LOWER": {"IN": ["hello", "how"]}}], | ||||||
|         [{"LOWER": {"IN": ["you", "doing"]}}], |         [{"LOWER": {"IN": ["you", "doing"]}}], | ||||||
|     ] |     ] | ||||||
|     matcher.add("TEST", None, *patterns) |     matcher.add("TEST", patterns) | ||||||
|     matches = matcher(doc) |     matches = matcher(doc) | ||||||
|     assert len(matches) == 4 |     assert len(matches) == 4 | ||||||
|     matched_texts = [doc[start:end].text for _, start, end in matches] |     matched_texts = [doc[start:end].text for _, start, end in matches] | ||||||
|  | @ -183,8 +183,8 @@ def test_issue3331(en_vocab): | ||||||
|     matches, one per rule. |     matches, one per rule. | ||||||
|     """ |     """ | ||||||
|     matcher = PhraseMatcher(en_vocab) |     matcher = PhraseMatcher(en_vocab) | ||||||
|     matcher.add("A", None, Doc(en_vocab, words=["Barack", "Obama"])) |     matcher.add("A", [Doc(en_vocab, words=["Barack", "Obama"])]) | ||||||
|     matcher.add("B", None, Doc(en_vocab, words=["Barack", "Obama"])) |     matcher.add("B", [Doc(en_vocab, words=["Barack", "Obama"])]) | ||||||
|     doc = Doc(en_vocab, words=["Barack", "Obama", "lifts", "America"]) |     doc = Doc(en_vocab, words=["Barack", "Obama", "lifts", "America"]) | ||||||
|     matches = matcher(doc) |     matches = matcher(doc) | ||||||
|     assert len(matches) == 2 |     assert len(matches) == 2 | ||||||
|  | @ -297,8 +297,10 @@ def test_issue3410(): | ||||||
| def test_issue3412(): | def test_issue3412(): | ||||||
|     data = numpy.asarray([[0, 0, 0], [1, 2, 3], [9, 8, 7]], dtype="f") |     data = numpy.asarray([[0, 0, 0], [1, 2, 3], [9, 8, 7]], dtype="f") | ||||||
|     vectors = Vectors(data=data) |     vectors = Vectors(data=data) | ||||||
|     keys, best_rows, scores = vectors.most_similar(numpy.asarray([[9, 8, 7], [0, 0, 0]], dtype="f")) |     keys, best_rows, scores = vectors.most_similar( | ||||||
|     assert(best_rows[0] == 2) |         numpy.asarray([[9, 8, 7], [0, 0, 0]], dtype="f") | ||||||
|  |     ) | ||||||
|  |     assert best_rows[0] == 2 | ||||||
| 
 | 
 | ||||||
| 
 | 
 | ||||||
| def test_issue3447(): | def test_issue3447(): | ||||||
|  |  | ||||||
|  | @ -10,6 +10,6 @@ def test_issue3549(en_vocab): | ||||||
|     """Test that match pattern validation doesn't raise on empty errors.""" |     """Test that match pattern validation doesn't raise on empty errors.""" | ||||||
|     matcher = Matcher(en_vocab, validate=True) |     matcher = Matcher(en_vocab, validate=True) | ||||||
|     pattern = [{"LOWER": "hello"}, {"LOWER": "world"}] |     pattern = [{"LOWER": "hello"}, {"LOWER": "world"}] | ||||||
|     matcher.add("GOOD", None, pattern) |     matcher.add("GOOD", [pattern]) | ||||||
|     with pytest.raises(MatchPatternError): |     with pytest.raises(MatchPatternError): | ||||||
|         matcher.add("BAD", None, [{"X": "Y"}]) |         matcher.add("BAD", [[{"X": "Y"}]]) | ||||||
|  |  | ||||||
|  | @ -12,6 +12,6 @@ def test_issue3555(en_vocab): | ||||||
|     Token.set_extension("issue3555", default=None) |     Token.set_extension("issue3555", default=None) | ||||||
|     matcher = Matcher(en_vocab) |     matcher = Matcher(en_vocab) | ||||||
|     pattern = [{"LEMMA": "have"}, {"_": {"issue3555": True}}] |     pattern = [{"LEMMA": "have"}, {"_": {"issue3555": True}}] | ||||||
|     matcher.add("TEST", None, pattern) |     matcher.add("TEST", [pattern]) | ||||||
|     doc = Doc(en_vocab, words=["have", "apple"]) |     doc = Doc(en_vocab, words=["have", "apple"]) | ||||||
|     matcher(doc) |     matcher(doc) | ||||||
|  |  | ||||||
|  | @ -34,8 +34,7 @@ def test_issue3611(): | ||||||
|     nlp.add_pipe(textcat, last=True) |     nlp.add_pipe(textcat, last=True) | ||||||
| 
 | 
 | ||||||
|     # training the network |     # training the network | ||||||
|     other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "textcat"] |     with nlp.disable_pipes([p for p in nlp.pipe_names if p != "textcat"]): | ||||||
|     with nlp.disable_pipes(*other_pipes): |  | ||||||
|         optimizer = nlp.begin_training() |         optimizer = nlp.begin_training() | ||||||
|         for i in range(3): |         for i in range(3): | ||||||
|             losses = {} |             losses = {} | ||||||
|  |  | ||||||
|  | @ -12,10 +12,10 @@ def test_issue3839(en_vocab): | ||||||
|     match_id = "PATTERN" |     match_id = "PATTERN" | ||||||
|     pattern1 = [{"LOWER": "terrific"}, {"OP": "?"}, {"LOWER": "group"}] |     pattern1 = [{"LOWER": "terrific"}, {"OP": "?"}, {"LOWER": "group"}] | ||||||
|     pattern2 = [{"LOWER": "terrific"}, {"OP": "?"}, {"OP": "?"}, {"LOWER": "group"}] |     pattern2 = [{"LOWER": "terrific"}, {"OP": "?"}, {"OP": "?"}, {"LOWER": "group"}] | ||||||
|     matcher.add(match_id, None, pattern1) |     matcher.add(match_id, [pattern1]) | ||||||
|     matches = matcher(doc) |     matches = matcher(doc) | ||||||
|     assert matches[0][0] == en_vocab.strings[match_id] |     assert matches[0][0] == en_vocab.strings[match_id] | ||||||
|     matcher = Matcher(en_vocab) |     matcher = Matcher(en_vocab) | ||||||
|     matcher.add(match_id, None, pattern2) |     matcher.add(match_id, [pattern2]) | ||||||
|     matches = matcher(doc) |     matches = matcher(doc) | ||||||
|     assert matches[0][0] == en_vocab.strings[match_id] |     assert matches[0][0] == en_vocab.strings[match_id] | ||||||
|  |  | ||||||
|  | @ -10,5 +10,5 @@ def test_issue3879(en_vocab): | ||||||
|     assert len(doc) == 5 |     assert len(doc) == 5 | ||||||
|     pattern = [{"ORTH": "This", "OP": "?"}, {"OP": "?"}, {"ORTH": "test"}] |     pattern = [{"ORTH": "This", "OP": "?"}, {"OP": "?"}, {"ORTH": "test"}] | ||||||
|     matcher = Matcher(en_vocab) |     matcher = Matcher(en_vocab) | ||||||
|     matcher.add("TEST", None, pattern) |     matcher.add("TEST", [pattern]) | ||||||
|     assert len(matcher(doc)) == 2  # fails because of a FP match 'is a test' |     assert len(matcher(doc)) == 2  # fails because of a FP match 'is a test' | ||||||
|  |  | ||||||
|  | @ -14,7 +14,7 @@ def test_issue3951(en_vocab): | ||||||
|         {"OP": "?"}, |         {"OP": "?"}, | ||||||
|         {"LOWER": "world"}, |         {"LOWER": "world"}, | ||||||
|     ] |     ] | ||||||
|     matcher.add("TEST", None, pattern) |     matcher.add("TEST", [pattern]) | ||||||
|     doc = Doc(en_vocab, words=["Hello", "my", "new", "world"]) |     doc = Doc(en_vocab, words=["Hello", "my", "new", "world"]) | ||||||
|     matches = matcher(doc) |     matches = matcher(doc) | ||||||
|     assert len(matches) == 0 |     assert len(matches) == 0 | ||||||
|  |  | ||||||
|  | @ -9,8 +9,8 @@ def test_issue3972(en_vocab): | ||||||
|     """Test that the PhraseMatcher returns duplicates for duplicate match IDs. |     """Test that the PhraseMatcher returns duplicates for duplicate match IDs. | ||||||
|     """ |     """ | ||||||
|     matcher = PhraseMatcher(en_vocab) |     matcher = PhraseMatcher(en_vocab) | ||||||
|     matcher.add("A", None, Doc(en_vocab, words=["New", "York"])) |     matcher.add("A", [Doc(en_vocab, words=["New", "York"])]) | ||||||
|     matcher.add("B", None, Doc(en_vocab, words=["New", "York"])) |     matcher.add("B", [Doc(en_vocab, words=["New", "York"])]) | ||||||
|     doc = Doc(en_vocab, words=["I", "live", "in", "New", "York"]) |     doc = Doc(en_vocab, words=["I", "live", "in", "New", "York"]) | ||||||
|     matches = matcher(doc) |     matches = matcher(doc) | ||||||
| 
 | 
 | ||||||
|  |  | ||||||
|  | @ -11,7 +11,7 @@ def test_issue4002(en_vocab): | ||||||
|     matcher = PhraseMatcher(en_vocab, attr="NORM") |     matcher = PhraseMatcher(en_vocab, attr="NORM") | ||||||
|     pattern1 = Doc(en_vocab, words=["c", "d"]) |     pattern1 = Doc(en_vocab, words=["c", "d"]) | ||||||
|     assert [t.norm_ for t in pattern1] == ["c", "d"] |     assert [t.norm_ for t in pattern1] == ["c", "d"] | ||||||
|     matcher.add("TEST", None, pattern1) |     matcher.add("TEST", [pattern1]) | ||||||
|     doc = Doc(en_vocab, words=["a", "b", "c", "d"]) |     doc = Doc(en_vocab, words=["a", "b", "c", "d"]) | ||||||
|     assert [t.norm_ for t in doc] == ["a", "b", "c", "d"] |     assert [t.norm_ for t in doc] == ["a", "b", "c", "d"] | ||||||
|     matches = matcher(doc) |     matches = matcher(doc) | ||||||
|  | @ -21,6 +21,6 @@ def test_issue4002(en_vocab): | ||||||
|     pattern2[0].norm_ = "c" |     pattern2[0].norm_ = "c" | ||||||
|     pattern2[1].norm_ = "d" |     pattern2[1].norm_ = "d" | ||||||
|     assert [t.norm_ for t in pattern2] == ["c", "d"] |     assert [t.norm_ for t in pattern2] == ["c", "d"] | ||||||
|     matcher.add("TEST", None, pattern2) |     matcher.add("TEST", [pattern2]) | ||||||
|     matches = matcher(doc) |     matches = matcher(doc) | ||||||
|     assert len(matches) == 1 |     assert len(matches) == 1 | ||||||
|  |  | ||||||
|  | @ -34,8 +34,7 @@ def test_issue4030(): | ||||||
|     nlp.add_pipe(textcat, last=True) |     nlp.add_pipe(textcat, last=True) | ||||||
| 
 | 
 | ||||||
|     # training the network |     # training the network | ||||||
|     other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "textcat"] |     with nlp.disable_pipes([p for p in nlp.pipe_names if p != "textcat"]): | ||||||
|     with nlp.disable_pipes(*other_pipes): |  | ||||||
|         optimizer = nlp.begin_training() |         optimizer = nlp.begin_training() | ||||||
|         for i in range(3): |         for i in range(3): | ||||||
|             losses = {} |             losses = {} | ||||||
|  |  | ||||||
|  | @ -8,7 +8,7 @@ from spacy.tokens import Doc | ||||||
| def test_issue4120(en_vocab): | def test_issue4120(en_vocab): | ||||||
|     """Test that matches without a final {OP: ?} token are returned.""" |     """Test that matches without a final {OP: ?} token are returned.""" | ||||||
|     matcher = Matcher(en_vocab) |     matcher = Matcher(en_vocab) | ||||||
|     matcher.add("TEST", None, [{"ORTH": "a"}, {"OP": "?"}]) |     matcher.add("TEST", [[{"ORTH": "a"}, {"OP": "?"}]]) | ||||||
|     doc1 = Doc(en_vocab, words=["a"]) |     doc1 = Doc(en_vocab, words=["a"]) | ||||||
|     assert len(matcher(doc1)) == 1  # works |     assert len(matcher(doc1)) == 1  # works | ||||||
| 
 | 
 | ||||||
|  | @ -16,11 +16,11 @@ def test_issue4120(en_vocab): | ||||||
|     assert len(matcher(doc2)) == 2  # fixed |     assert len(matcher(doc2)) == 2  # fixed | ||||||
| 
 | 
 | ||||||
|     matcher = Matcher(en_vocab) |     matcher = Matcher(en_vocab) | ||||||
|     matcher.add("TEST", None, [{"ORTH": "a"}, {"OP": "?"}, {"ORTH": "b"}]) |     matcher.add("TEST", [[{"ORTH": "a"}, {"OP": "?"}, {"ORTH": "b"}]]) | ||||||
|     doc3 = Doc(en_vocab, words=["a", "b", "b", "c"]) |     doc3 = Doc(en_vocab, words=["a", "b", "b", "c"]) | ||||||
|     assert len(matcher(doc3)) == 2  # works |     assert len(matcher(doc3)) == 2  # works | ||||||
| 
 | 
 | ||||||
|     matcher = Matcher(en_vocab) |     matcher = Matcher(en_vocab) | ||||||
|     matcher.add("TEST", None, [{"ORTH": "a"}, {"OP": "?"}, {"ORTH": "b", "OP": "?"}]) |     matcher.add("TEST", [[{"ORTH": "a"}, {"OP": "?"}, {"ORTH": "b", "OP": "?"}]]) | ||||||
|     doc4 = Doc(en_vocab, words=["a", "b", "b", "c"]) |     doc4 = Doc(en_vocab, words=["a", "b", "b", "c"]) | ||||||
|     assert len(matcher(doc4)) == 3  # fixed |     assert len(matcher(doc4)) == 3  # fixed | ||||||
|  |  | ||||||
							
								
								
									
										96
									
								
								spacy/tests/regression/test_issue4402.py
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										96
									
								
								spacy/tests/regression/test_issue4402.py
									
									
									
									
									
										Normal file
									
								
							|  | @ -0,0 +1,96 @@ | ||||||
|  | # coding: utf8 | ||||||
|  | from __future__ import unicode_literals | ||||||
|  | 
 | ||||||
|  | import srsly | ||||||
|  | from spacy.gold import GoldCorpus | ||||||
|  | 
 | ||||||
|  | from spacy.lang.en import English | ||||||
|  | from spacy.tests.util import make_tempdir | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
|  | def test_issue4402(): | ||||||
|  |     nlp = English() | ||||||
|  |     with make_tempdir() as tmpdir: | ||||||
|  |         print("temp", tmpdir) | ||||||
|  |         json_path = tmpdir / "test4402.json" | ||||||
|  |         srsly.write_json(json_path, json_data) | ||||||
|  | 
 | ||||||
|  |         corpus = GoldCorpus(str(json_path), str(json_path)) | ||||||
|  | 
 | ||||||
|  |         train_docs = list(corpus.train_docs(nlp, gold_preproc=True, max_length=0)) | ||||||
|  |         # assert that the data got split into 4 sentences | ||||||
|  |         assert len(train_docs) == 4 | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
|  | json_data = [ | ||||||
|  |     { | ||||||
|  |         "id": 0, | ||||||
|  |         "paragraphs": [ | ||||||
|  |             { | ||||||
|  |                 "raw": "How should I cook bacon in an oven?\nI've heard of people cooking bacon in an oven.", | ||||||
|  |                 "sentences": [ | ||||||
|  |                     { | ||||||
|  |                         "tokens": [ | ||||||
|  |                             {"id": 0, "orth": "How", "ner": "O"}, | ||||||
|  |                             {"id": 1, "orth": "should", "ner": "O"}, | ||||||
|  |                             {"id": 2, "orth": "I", "ner": "O"}, | ||||||
|  |                             {"id": 3, "orth": "cook", "ner": "O"}, | ||||||
|  |                             {"id": 4, "orth": "bacon", "ner": "O"}, | ||||||
|  |                             {"id": 5, "orth": "in", "ner": "O"}, | ||||||
|  |                             {"id": 6, "orth": "an", "ner": "O"}, | ||||||
|  |                             {"id": 7, "orth": "oven", "ner": "O"}, | ||||||
|  |                             {"id": 8, "orth": "?", "ner": "O"}, | ||||||
|  |                         ], | ||||||
|  |                         "brackets": [], | ||||||
|  |                     }, | ||||||
|  |                     { | ||||||
|  |                         "tokens": [ | ||||||
|  |                             {"id": 9, "orth": "\n", "ner": "O"}, | ||||||
|  |                             {"id": 10, "orth": "I", "ner": "O"}, | ||||||
|  |                             {"id": 11, "orth": "'ve", "ner": "O"}, | ||||||
|  |                             {"id": 12, "orth": "heard", "ner": "O"}, | ||||||
|  |                             {"id": 13, "orth": "of", "ner": "O"}, | ||||||
|  |                             {"id": 14, "orth": "people", "ner": "O"}, | ||||||
|  |                             {"id": 15, "orth": "cooking", "ner": "O"}, | ||||||
|  |                             {"id": 16, "orth": "bacon", "ner": "O"}, | ||||||
|  |                             {"id": 17, "orth": "in", "ner": "O"}, | ||||||
|  |                             {"id": 18, "orth": "an", "ner": "O"}, | ||||||
|  |                             {"id": 19, "orth": "oven", "ner": "O"}, | ||||||
|  |                             {"id": 20, "orth": ".", "ner": "O"}, | ||||||
|  |                         ], | ||||||
|  |                         "brackets": [], | ||||||
|  |                     }, | ||||||
|  |                 ], | ||||||
|  |                 "cats": [ | ||||||
|  |                     {"label": "baking", "value": 1.0}, | ||||||
|  |                     {"label": "not_baking", "value": 0.0}, | ||||||
|  |                 ], | ||||||
|  |             }, | ||||||
|  |             { | ||||||
|  |                 "raw": "What is the difference between white and brown eggs?\n", | ||||||
|  |                 "sentences": [ | ||||||
|  |                     { | ||||||
|  |                         "tokens": [ | ||||||
|  |                             {"id": 0, "orth": "What", "ner": "O"}, | ||||||
|  |                             {"id": 1, "orth": "is", "ner": "O"}, | ||||||
|  |                             {"id": 2, "orth": "the", "ner": "O"}, | ||||||
|  |                             {"id": 3, "orth": "difference", "ner": "O"}, | ||||||
|  |                             {"id": 4, "orth": "between", "ner": "O"}, | ||||||
|  |                             {"id": 5, "orth": "white", "ner": "O"}, | ||||||
|  |                             {"id": 6, "orth": "and", "ner": "O"}, | ||||||
|  |                             {"id": 7, "orth": "brown", "ner": "O"}, | ||||||
|  |                             {"id": 8, "orth": "eggs", "ner": "O"}, | ||||||
|  |                             {"id": 9, "orth": "?", "ner": "O"}, | ||||||
|  |                         ], | ||||||
|  |                         "brackets": [], | ||||||
|  |                     }, | ||||||
|  |                     {"tokens": [{"id": 10, "orth": "\n", "ner": "O"}], "brackets": []}, | ||||||
|  |                 ], | ||||||
|  |                 "cats": [ | ||||||
|  |                     {"label": "baking", "value": 0.0}, | ||||||
|  |                     {"label": "not_baking", "value": 1.0}, | ||||||
|  |                 ], | ||||||
|  |             }, | ||||||
|  |         ], | ||||||
|  |     } | ||||||
|  | ] | ||||||
							
								
								
									
										19
									
								
								spacy/tests/regression/test_issue4528.py
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										19
									
								
								spacy/tests/regression/test_issue4528.py
									
									
									
									
									
										Normal file
									
								
							|  | @ -0,0 +1,19 @@ | ||||||
|  | # coding: utf8 | ||||||
|  | from __future__ import unicode_literals | ||||||
|  | 
 | ||||||
|  | from spacy.tokens import Doc, DocBin | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
|  | def test_issue4528(en_vocab): | ||||||
|  |     """Test that user_data is correctly serialized in DocBin.""" | ||||||
|  |     doc = Doc(en_vocab, words=["hello", "world"]) | ||||||
|  |     doc.user_data["foo"] = "bar" | ||||||
|  |     # This is how extension attribute values are stored in the user data | ||||||
|  |     doc.user_data[("._.", "foo", None, None)] = "bar" | ||||||
|  |     doc_bin = DocBin(store_user_data=True) | ||||||
|  |     doc_bin.add(doc) | ||||||
|  |     doc_bin_bytes = doc_bin.to_bytes() | ||||||
|  |     new_doc_bin = DocBin(store_user_data=True).from_bytes(doc_bin_bytes) | ||||||
|  |     new_doc = list(new_doc_bin.get_docs(en_vocab))[0] | ||||||
|  |     assert new_doc.user_data["foo"] == "bar" | ||||||
|  |     assert new_doc.user_data[("._.", "foo", None, None)] == "bar" | ||||||
							
								
								
									
										13
									
								
								spacy/tests/regression/test_issue4529.py
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										13
									
								
								spacy/tests/regression/test_issue4529.py
									
									
									
									
									
										Normal file
									
								
							|  | @ -0,0 +1,13 @@ | ||||||
|  | # coding: utf8 | ||||||
|  | from __future__ import unicode_literals | ||||||
|  | 
 | ||||||
|  | import pytest | ||||||
|  | from spacy.gold import GoldParse | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
|  | @pytest.mark.parametrize( | ||||||
|  |     "text,words", [("A'B C", ["A", "'", "B", "C"]), ("A-B", ["A-B"])] | ||||||
|  | ) | ||||||
|  | def test_gold_misaligned(en_tokenizer, text, words): | ||||||
|  |     doc = en_tokenizer(text) | ||||||
|  |     GoldParse(doc, words=words) | ||||||
|  | @ -3,7 +3,7 @@ from __future__ import unicode_literals | ||||||
| 
 | 
 | ||||||
| from spacy.gold import biluo_tags_from_offsets, offsets_from_biluo_tags | from spacy.gold import biluo_tags_from_offsets, offsets_from_biluo_tags | ||||||
| from spacy.gold import spans_from_biluo_tags, GoldParse, iob_to_biluo | from spacy.gold import spans_from_biluo_tags, GoldParse, iob_to_biluo | ||||||
| from spacy.gold import GoldCorpus, docs_to_json | from spacy.gold import GoldCorpus, docs_to_json, align | ||||||
| from spacy.lang.en import English | from spacy.lang.en import English | ||||||
| from spacy.tokens import Doc | from spacy.tokens import Doc | ||||||
| from .util import make_tempdir | from .util import make_tempdir | ||||||
|  | @ -90,7 +90,7 @@ def test_gold_ner_missing_tags(en_tokenizer): | ||||||
| def test_iob_to_biluo(): | def test_iob_to_biluo(): | ||||||
|     good_iob = ["O", "O", "B-LOC", "I-LOC", "O", "B-PERSON"] |     good_iob = ["O", "O", "B-LOC", "I-LOC", "O", "B-PERSON"] | ||||||
|     good_biluo = ["O", "O", "B-LOC", "L-LOC", "O", "U-PERSON"] |     good_biluo = ["O", "O", "B-LOC", "L-LOC", "O", "U-PERSON"] | ||||||
|     bad_iob = ["O", "O", "\"", "B-LOC", "I-LOC"] |     bad_iob = ["O", "O", '"', "B-LOC", "I-LOC"] | ||||||
|     converted_biluo = iob_to_biluo(good_iob) |     converted_biluo = iob_to_biluo(good_iob) | ||||||
|     assert good_biluo == converted_biluo |     assert good_biluo == converted_biluo | ||||||
|     with pytest.raises(ValueError): |     with pytest.raises(ValueError): | ||||||
|  | @ -99,14 +99,23 @@ def test_iob_to_biluo(): | ||||||
| 
 | 
 | ||||||
| def test_roundtrip_docs_to_json(): | def test_roundtrip_docs_to_json(): | ||||||
|     text = "I flew to Silicon Valley via London." |     text = "I flew to Silicon Valley via London." | ||||||
|  |     tags = ["PRP", "VBD", "IN", "NNP", "NNP", "IN", "NNP", "."] | ||||||
|  |     heads = [1, 1, 1, 4, 2, 1, 5, 1] | ||||||
|  |     deps = ["nsubj", "ROOT", "prep", "compound", "pobj", "prep", "pobj", "punct"] | ||||||
|  |     biluo_tags = ["O", "O", "O", "B-LOC", "L-LOC", "O", "U-GPE", "O"] | ||||||
|     cats = {"TRAVEL": 1.0, "BAKING": 0.0} |     cats = {"TRAVEL": 1.0, "BAKING": 0.0} | ||||||
|     nlp = English() |     nlp = English() | ||||||
|     doc = nlp(text) |     doc = nlp(text) | ||||||
|  |     for i in range(len(tags)): | ||||||
|  |         doc[i].tag_ = tags[i] | ||||||
|  |         doc[i].dep_ = deps[i] | ||||||
|  |         doc[i].head = doc[heads[i]] | ||||||
|  |     doc.ents = spans_from_biluo_tags(doc, biluo_tags) | ||||||
|     doc.cats = cats |     doc.cats = cats | ||||||
|     doc[0].is_sent_start = True |     doc.is_tagged = True | ||||||
|     for i in range(1, len(doc)): |     doc.is_parsed = True | ||||||
|         doc[i].is_sent_start = False |  | ||||||
| 
 | 
 | ||||||
|  |     # roundtrip to JSON | ||||||
|     with make_tempdir() as tmpdir: |     with make_tempdir() as tmpdir: | ||||||
|         json_file = tmpdir / "roundtrip.json" |         json_file = tmpdir / "roundtrip.json" | ||||||
|         srsly.write_json(json_file, [docs_to_json(doc)]) |         srsly.write_json(json_file, [docs_to_json(doc)]) | ||||||
|  | @ -116,7 +125,95 @@ def test_roundtrip_docs_to_json(): | ||||||
| 
 | 
 | ||||||
|     assert len(doc) == goldcorpus.count_train() |     assert len(doc) == goldcorpus.count_train() | ||||||
|     assert text == reloaded_doc.text |     assert text == reloaded_doc.text | ||||||
|  |     assert tags == goldparse.tags | ||||||
|  |     assert deps == goldparse.labels | ||||||
|  |     assert heads == goldparse.heads | ||||||
|  |     assert biluo_tags == goldparse.ner | ||||||
|     assert "TRAVEL" in goldparse.cats |     assert "TRAVEL" in goldparse.cats | ||||||
|     assert "BAKING" in goldparse.cats |     assert "BAKING" in goldparse.cats | ||||||
|     assert cats["TRAVEL"] == goldparse.cats["TRAVEL"] |     assert cats["TRAVEL"] == goldparse.cats["TRAVEL"] | ||||||
|     assert cats["BAKING"] == goldparse.cats["BAKING"] |     assert cats["BAKING"] == goldparse.cats["BAKING"] | ||||||
|  | 
 | ||||||
|  |     # roundtrip to JSONL train dicts | ||||||
|  |     with make_tempdir() as tmpdir: | ||||||
|  |         jsonl_file = tmpdir / "roundtrip.jsonl" | ||||||
|  |         srsly.write_jsonl(jsonl_file, [docs_to_json(doc)]) | ||||||
|  |         goldcorpus = GoldCorpus(str(jsonl_file), str(jsonl_file)) | ||||||
|  | 
 | ||||||
|  |     reloaded_doc, goldparse = next(goldcorpus.train_docs(nlp)) | ||||||
|  | 
 | ||||||
|  |     assert len(doc) == goldcorpus.count_train() | ||||||
|  |     assert text == reloaded_doc.text | ||||||
|  |     assert tags == goldparse.tags | ||||||
|  |     assert deps == goldparse.labels | ||||||
|  |     assert heads == goldparse.heads | ||||||
|  |     assert biluo_tags == goldparse.ner | ||||||
|  |     assert "TRAVEL" in goldparse.cats | ||||||
|  |     assert "BAKING" in goldparse.cats | ||||||
|  |     assert cats["TRAVEL"] == goldparse.cats["TRAVEL"] | ||||||
|  |     assert cats["BAKING"] == goldparse.cats["BAKING"] | ||||||
|  | 
 | ||||||
|  |     # roundtrip to JSONL tuples | ||||||
|  |     with make_tempdir() as tmpdir: | ||||||
|  |         jsonl_file = tmpdir / "roundtrip.jsonl" | ||||||
|  |         # write to JSONL train dicts | ||||||
|  |         srsly.write_jsonl(jsonl_file, [docs_to_json(doc)]) | ||||||
|  |         goldcorpus = GoldCorpus(str(jsonl_file), str(jsonl_file)) | ||||||
|  |         # load and rewrite as JSONL tuples | ||||||
|  |         srsly.write_jsonl(jsonl_file, goldcorpus.train_tuples) | ||||||
|  |         goldcorpus = GoldCorpus(str(jsonl_file), str(jsonl_file)) | ||||||
|  | 
 | ||||||
|  |     reloaded_doc, goldparse = next(goldcorpus.train_docs(nlp)) | ||||||
|  | 
 | ||||||
|  |     assert len(doc) == goldcorpus.count_train() | ||||||
|  |     assert text == reloaded_doc.text | ||||||
|  |     assert tags == goldparse.tags | ||||||
|  |     assert deps == goldparse.labels | ||||||
|  |     assert heads == goldparse.heads | ||||||
|  |     assert biluo_tags == goldparse.ner | ||||||
|  |     assert "TRAVEL" in goldparse.cats | ||||||
|  |     assert "BAKING" in goldparse.cats | ||||||
|  |     assert cats["TRAVEL"] == goldparse.cats["TRAVEL"] | ||||||
|  |     assert cats["BAKING"] == goldparse.cats["BAKING"] | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
|  | # xfail while we have backwards-compatible alignment | ||||||
|  | @pytest.mark.xfail | ||||||
|  | @pytest.mark.parametrize( | ||||||
|  |     "tokens_a,tokens_b,expected", | ||||||
|  |     [ | ||||||
|  |         (["a", "b", "c"], ["ab", "c"], (3, [-1, -1, 1], [-1, 2], {0: 0, 1: 0}, {})), | ||||||
|  |         ( | ||||||
|  |             ["a", "b", "``", "c"], | ||||||
|  |             ['ab"', "c"], | ||||||
|  |             (4, [-1, -1, -1, 1], [-1, 3], {0: 0, 1: 0, 2: 0}, {}), | ||||||
|  |         ), | ||||||
|  |         (["a", "bc"], ["ab", "c"], (4, [-1, -1], [-1, -1], {0: 0}, {1: 1})), | ||||||
|  |         ( | ||||||
|  |             ["ab", "c", "d"], | ||||||
|  |             ["a", "b", "cd"], | ||||||
|  |             (6, [-1, -1, -1], [-1, -1, -1], {1: 2, 2: 2}, {0: 0, 1: 0}), | ||||||
|  |         ), | ||||||
|  |         ( | ||||||
|  |             ["a", "b", "cd"], | ||||||
|  |             ["a", "b", "c", "d"], | ||||||
|  |             (3, [0, 1, -1], [0, 1, -1, -1], {}, {2: 2, 3: 2}), | ||||||
|  |         ), | ||||||
|  |         ([" ", "a"], ["a"], (1, [-1, 0], [1], {}, {})), | ||||||
|  |     ], | ||||||
|  | ) | ||||||
|  | def test_align(tokens_a, tokens_b, expected): | ||||||
|  |     cost, a2b, b2a, a2b_multi, b2a_multi = align(tokens_a, tokens_b) | ||||||
|  |     assert (cost, list(a2b), list(b2a), a2b_multi, b2a_multi) == expected | ||||||
|  |     # check symmetry | ||||||
|  |     cost, a2b, b2a, a2b_multi, b2a_multi = align(tokens_b, tokens_a) | ||||||
|  |     assert (cost, list(b2a), list(a2b), b2a_multi, a2b_multi) == expected | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
|  | def test_goldparse_startswith_space(en_tokenizer): | ||||||
|  |     text = " a" | ||||||
|  |     doc = en_tokenizer(text) | ||||||
|  |     g = GoldParse(doc, words=["a"], entities=["U-DATE"], deps=["ROOT"], heads=[0]) | ||||||
|  |     assert g.words == [" ", "a"] | ||||||
|  |     assert g.ner == [None, "U-DATE"] | ||||||
|  |     assert g.labels == [None, "ROOT"] | ||||||
|  |  | ||||||
|  | @ -95,12 +95,18 @@ def test_PrecomputableAffine(nO=4, nI=5, nF=3, nP=2): | ||||||
| 
 | 
 | ||||||
| 
 | 
 | ||||||
| def test_prefer_gpu(): | def test_prefer_gpu(): | ||||||
|     assert not prefer_gpu() |     try: | ||||||
|  |         import cupy  # noqa: F401 | ||||||
|  |     except ImportError: | ||||||
|  |         assert not prefer_gpu() | ||||||
| 
 | 
 | ||||||
| 
 | 
 | ||||||
| def test_require_gpu(): | def test_require_gpu(): | ||||||
|     with pytest.raises(ValueError): |     try: | ||||||
|         require_gpu() |         import cupy  # noqa: F401 | ||||||
|  |     except ImportError: | ||||||
|  |         with pytest.raises(ValueError): | ||||||
|  |             require_gpu() | ||||||
| 
 | 
 | ||||||
| 
 | 
 | ||||||
| def test_create_symlink_windows( | def test_create_symlink_windows( | ||||||
|  |  | ||||||
							
								
								
									
										66
									
								
								spacy/tests/test_tok2vec.py
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										66
									
								
								spacy/tests/test_tok2vec.py
									
									
									
									
									
										Normal file
									
								
							|  | @ -0,0 +1,66 @@ | ||||||
|  | # coding: utf-8 | ||||||
|  | from __future__ import unicode_literals | ||||||
|  | 
 | ||||||
|  | import pytest | ||||||
|  | 
 | ||||||
|  | from spacy._ml import Tok2Vec | ||||||
|  | from spacy.vocab import Vocab | ||||||
|  | from spacy.tokens import Doc | ||||||
|  | from spacy.compat import unicode_ | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
|  | def get_batch(batch_size): | ||||||
|  |     vocab = Vocab() | ||||||
|  |     docs = [] | ||||||
|  |     start = 0 | ||||||
|  |     for size in range(1, batch_size + 1): | ||||||
|  |         # Make the words numbers, so that they're distnct | ||||||
|  |         # across the batch, and easy to track. | ||||||
|  |         numbers = [unicode_(i) for i in range(start, start + size)] | ||||||
|  |         docs.append(Doc(vocab, words=numbers)) | ||||||
|  |         start += size | ||||||
|  |     return docs | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
|  | # This fails in Thinc v7.3.1. Need to push patch | ||||||
|  | @pytest.mark.xfail | ||||||
|  | def test_empty_doc(): | ||||||
|  |     width = 128 | ||||||
|  |     embed_size = 2000 | ||||||
|  |     vocab = Vocab() | ||||||
|  |     doc = Doc(vocab, words=[]) | ||||||
|  |     tok2vec = Tok2Vec(width, embed_size) | ||||||
|  |     vectors, backprop = tok2vec.begin_update([doc]) | ||||||
|  |     assert len(vectors) == 1 | ||||||
|  |     assert vectors[0].shape == (0, width) | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
|  | @pytest.mark.parametrize( | ||||||
|  |     "batch_size,width,embed_size", [[1, 128, 2000], [2, 128, 2000], [3, 8, 63]] | ||||||
|  | ) | ||||||
|  | def test_tok2vec_batch_sizes(batch_size, width, embed_size): | ||||||
|  |     batch = get_batch(batch_size) | ||||||
|  |     tok2vec = Tok2Vec(width, embed_size) | ||||||
|  |     vectors, backprop = tok2vec.begin_update(batch) | ||||||
|  |     assert len(vectors) == len(batch) | ||||||
|  |     for doc_vec, doc in zip(vectors, batch): | ||||||
|  |         assert doc_vec.shape == (len(doc), width) | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
|  | @pytest.mark.parametrize( | ||||||
|  |     "tok2vec_config", | ||||||
|  |     [ | ||||||
|  |         {"width": 8, "embed_size": 100, "char_embed": False}, | ||||||
|  |         {"width": 8, "embed_size": 100, "char_embed": True}, | ||||||
|  |         {"width": 8, "embed_size": 100, "conv_depth": 6}, | ||||||
|  |         {"width": 8, "embed_size": 100, "conv_depth": 6}, | ||||||
|  |         {"width": 8, "embed_size": 100, "subword_features": False}, | ||||||
|  |     ], | ||||||
|  | ) | ||||||
|  | def test_tok2vec_configs(tok2vec_config): | ||||||
|  |     docs = get_batch(3) | ||||||
|  |     tok2vec = Tok2Vec(**tok2vec_config) | ||||||
|  |     vectors, backprop = tok2vec.begin_update(docs) | ||||||
|  |     assert len(vectors) == len(docs) | ||||||
|  |     assert vectors[0].shape == (len(docs[0]), tok2vec_config["width"]) | ||||||
|  |     backprop(vectors) | ||||||
|  | @ -103,7 +103,8 @@ class DocBin(object): | ||||||
|             doc = Doc(vocab, words=words, spaces=spaces) |             doc = Doc(vocab, words=words, spaces=spaces) | ||||||
|             doc = doc.from_array(self.attrs, tokens) |             doc = doc.from_array(self.attrs, tokens) | ||||||
|             if self.store_user_data: |             if self.store_user_data: | ||||||
|                 doc.user_data.update(srsly.msgpack_loads(self.user_data[i])) |                 user_data = srsly.msgpack_loads(self.user_data[i], use_list=False) | ||||||
|  |                 doc.user_data.update(user_data) | ||||||
|             yield doc |             yield doc | ||||||
| 
 | 
 | ||||||
|     def merge(self, other): |     def merge(self, other): | ||||||
|  | @ -155,9 +156,9 @@ class DocBin(object): | ||||||
|         msg = srsly.msgpack_loads(zlib.decompress(bytes_data)) |         msg = srsly.msgpack_loads(zlib.decompress(bytes_data)) | ||||||
|         self.attrs = msg["attrs"] |         self.attrs = msg["attrs"] | ||||||
|         self.strings = set(msg["strings"]) |         self.strings = set(msg["strings"]) | ||||||
|         lengths = numpy.fromstring(msg["lengths"], dtype="int32") |         lengths = numpy.frombuffer(msg["lengths"], dtype="int32") | ||||||
|         flat_spaces = numpy.fromstring(msg["spaces"], dtype=bool) |         flat_spaces = numpy.frombuffer(msg["spaces"], dtype=bool) | ||||||
|         flat_tokens = numpy.fromstring(msg["tokens"], dtype="uint64") |         flat_tokens = numpy.frombuffer(msg["tokens"], dtype="uint64") | ||||||
|         shape = (flat_tokens.size // len(self.attrs), len(self.attrs)) |         shape = (flat_tokens.size // len(self.attrs), len(self.attrs)) | ||||||
|         flat_tokens = flat_tokens.reshape(shape) |         flat_tokens = flat_tokens.reshape(shape) | ||||||
|         flat_spaces = flat_spaces.reshape((flat_spaces.size, 1)) |         flat_spaces = flat_spaces.reshape((flat_spaces.size, 1)) | ||||||
|  |  | ||||||
|  | @ -142,6 +142,11 @@ def register_architecture(name, arch=None): | ||||||
|     return do_registration |     return do_registration | ||||||
| 
 | 
 | ||||||
| 
 | 
 | ||||||
|  | def make_layer(arch_config): | ||||||
|  |     arch_func = get_architecture(arch_config["arch"]) | ||||||
|  |     return arch_func(arch_config["config"]) | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
| def get_architecture(name): | def get_architecture(name): | ||||||
|     """Get a model architecture function by name. Raises a KeyError if the |     """Get a model architecture function by name. Raises a KeyError if the | ||||||
|     architecture is not found. |     architecture is not found. | ||||||
|  | @ -242,6 +247,7 @@ def load_model_from_path(model_path, meta=False, **overrides): | ||||||
|     cls = get_lang_class(lang) |     cls = get_lang_class(lang) | ||||||
|     nlp = cls(meta=meta, **overrides) |     nlp = cls(meta=meta, **overrides) | ||||||
|     pipeline = meta.get("pipeline", []) |     pipeline = meta.get("pipeline", []) | ||||||
|  |     factories = meta.get("factories", {}) | ||||||
|     disable = overrides.get("disable", []) |     disable = overrides.get("disable", []) | ||||||
|     if pipeline is True: |     if pipeline is True: | ||||||
|         pipeline = nlp.Defaults.pipe_names |         pipeline = nlp.Defaults.pipe_names | ||||||
|  | @ -250,7 +256,8 @@ def load_model_from_path(model_path, meta=False, **overrides): | ||||||
|     for name in pipeline: |     for name in pipeline: | ||||||
|         if name not in disable: |         if name not in disable: | ||||||
|             config = meta.get("pipeline_args", {}).get(name, {}) |             config = meta.get("pipeline_args", {}).get(name, {}) | ||||||
|             component = nlp.create_pipe(name, config=config) |             factory = factories.get(name, name) | ||||||
|  |             component = nlp.create_pipe(factory, config=config) | ||||||
|             nlp.add_pipe(component, name=name) |             nlp.add_pipe(component, name=name) | ||||||
|     return nlp.from_disk(model_path) |     return nlp.from_disk(model_path) | ||||||
| 
 | 
 | ||||||
|  | @ -363,6 +370,16 @@ def is_in_jupyter(): | ||||||
|     return False |     return False | ||||||
| 
 | 
 | ||||||
| 
 | 
 | ||||||
|  | def get_component_name(component): | ||||||
|  |     if hasattr(component, "name"): | ||||||
|  |         return component.name | ||||||
|  |     if hasattr(component, "__name__"): | ||||||
|  |         return component.__name__ | ||||||
|  |     if hasattr(component, "__class__") and hasattr(component.__class__, "__name__"): | ||||||
|  |         return component.__class__.__name__ | ||||||
|  |     return repr(component) | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
| def get_cuda_stream(require=False): | def get_cuda_stream(require=False): | ||||||
|     if CudaStream is None: |     if CudaStream is None: | ||||||
|         return None |         return None | ||||||
|  | @ -404,7 +421,7 @@ def env_opt(name, default=None): | ||||||
| 
 | 
 | ||||||
| def read_regex(path): | def read_regex(path): | ||||||
|     path = ensure_path(path) |     path = ensure_path(path) | ||||||
|     with path.open() as file_: |     with path.open(encoding="utf8") as file_: | ||||||
|         entries = file_.read().split("\n") |         entries = file_.read().split("\n") | ||||||
|     expression = "|".join( |     expression = "|".join( | ||||||
|         ["^" + re.escape(piece) for piece in entries if piece.strip()] |         ["^" + re.escape(piece) for piece in entries if piece.strip()] | ||||||
|  |  | ||||||
|  | @ -48,14 +48,14 @@ be installed if needed via `pip install spacy[lookups]`. Some languages provide | ||||||
| full lemmatization rules and exceptions, while other languages currently only | full lemmatization rules and exceptions, while other languages currently only | ||||||
| rely on simple lookup tables. | rely on simple lookup tables. | ||||||
| 
 | 
 | ||||||
| <Infobox title="About spaCy's custom pronoun lemma" variant="warning"> | <Infobox title="About spaCy's custom pronoun lemma for English" variant="warning"> | ||||||
| 
 | 
 | ||||||
| spaCy adds a **special case for pronouns**: all pronouns are lemmatized to the | spaCy adds a **special case for English pronouns**: all English pronouns are | ||||||
| special token `-PRON-`. Unlike verbs and common nouns, there's no clear base | lemmatized to the special token `-PRON-`. Unlike verbs and common nouns, | ||||||
| form of a personal pronoun. Should the lemma of "me" be "I", or should we | there's no clear base form of a personal pronoun. Should the lemma of "me" be | ||||||
| normalize person as well, giving "it" — or maybe "he"? spaCy's solution is to | "I", or should we normalize person as well, giving "it" — or maybe "he"? | ||||||
| introduce a novel symbol, `-PRON-`, which is used as the lemma for all personal | spaCy's solution is to introduce a novel symbol, `-PRON-`, which is used as the | ||||||
| pronouns. | lemma for all personal pronouns. | ||||||
| 
 | 
 | ||||||
| </Infobox> | </Infobox> | ||||||
| 
 | 
 | ||||||
|  | @ -117,76 +117,72 @@ type. They're available as the [`Token.pos`](/api/token#attributes) and | ||||||
| 
 | 
 | ||||||
| The English part-of-speech tagger uses the | The English part-of-speech tagger uses the | ||||||
| [OntoNotes 5](https://catalog.ldc.upenn.edu/LDC2013T19) version of the Penn | [OntoNotes 5](https://catalog.ldc.upenn.edu/LDC2013T19) version of the Penn | ||||||
| Treebank tag set. We also map the tags to the simpler Google Universal POS tag | Treebank tag set. We also map the tags to the simpler Universal Dependencies v2 | ||||||
| set. | POS tag set. | ||||||
| 
 |  | ||||||
| | Tag                                 |  POS    | Morphology                                     | Description                               | |  | ||||||
| | ----------------------------------- | ------- | ---------------------------------------------- | ----------------------------------------- | |  | ||||||
| | `-LRB-`                             | `PUNCT` | `PunctType=brck PunctSide=ini`                 | left round bracket                        | |  | ||||||
| | `-RRB-`                             | `PUNCT` | `PunctType=brck PunctSide=fin`                 | right round bracket                       | |  | ||||||
| | `,`                                 | `PUNCT` | `PunctType=comm`                               | punctuation mark, comma                   | |  | ||||||
| | `:`                                 | `PUNCT` |                                                | punctuation mark, colon or ellipsis       | |  | ||||||
| | `.`                                 | `PUNCT` | `PunctType=peri`                               | punctuation mark, sentence closer         | |  | ||||||
| | `''`                                | `PUNCT` | `PunctType=quot PunctSide=fin`                 | closing quotation mark                    | |  | ||||||
| | `""`                                | `PUNCT` | `PunctType=quot PunctSide=fin`                 | closing quotation mark                    | |  | ||||||
| | <InlineCode>``</InlineCode> | `PUNCT` | `PunctType=quot PunctSide=ini`                 | opening quotation mark                    | |  | ||||||
| | `#`                                 | `SYM`   | `SymType=numbersign`                           | symbol, number sign                       | |  | ||||||
| | `$`                                 | `SYM`   | `SymType=currency`                             | symbol, currency                          | |  | ||||||
| | `ADD`                               | `X`     |                                                | email                                     | |  | ||||||
| | `AFX`                               | `ADJ`   | `Hyph=yes`                                     | affix                                     | |  | ||||||
| | `BES`                               | `VERB`  |                                                | auxiliary "be"                            | |  | ||||||
| | `CC`                                | `CONJ`  | `ConjType=coor`                                | conjunction, coordinating                 | |  | ||||||
| | `CD`                                | `NUM`   | `NumType=card`                                 | cardinal number                           | |  | ||||||
| | `DT`                                | `DET`   |                                                | determiner                                | |  | ||||||
| | `EX`                                | `ADV`   | `AdvType=ex`                                   | existential there                         | |  | ||||||
| | `FW`                                | `X`     | `Foreign=yes`                                  | foreign word                              | |  | ||||||
| | `GW`                                | `X`     |                                                | additional word in multi-word expression  | |  | ||||||
| | `HVS`                               | `VERB`  |                                                | forms of "have"                           | |  | ||||||
| | `HYPH`                              | `PUNCT` | `PunctType=dash`                               | punctuation mark, hyphen                  | |  | ||||||
| | `IN`                                | `ADP`   |                                                | conjunction, subordinating or preposition | |  | ||||||
| | `JJ`                                | `ADJ`   | `Degree=pos`                                   | adjective                                 | |  | ||||||
| | `JJR`                               | `ADJ`   | `Degree=comp`                                  | adjective, comparative                    | |  | ||||||
| | `JJS`                               | `ADJ`   | `Degree=sup`                                   | adjective, superlative                    | |  | ||||||
| | `LS`                                | `PUNCT` | `NumType=ord`                                  | list item marker                          | |  | ||||||
| | `MD`                                | `VERB`  | `VerbType=mod`                                 | verb, modal auxiliary                     | |  | ||||||
| | `NFP`                               | `PUNCT` |                                                | superfluous punctuation                   | |  | ||||||
| | `NIL`                               |         |                                                | missing tag                               | |  | ||||||
| | `NN`                                | `NOUN`  | `Number=sing`                                  | noun, singular or mass                    | |  | ||||||
| | `NNP`                               | `PROPN` | `NounType=prop Number=sign`                    | noun, proper singular                     | |  | ||||||
| | `NNPS`                              | `PROPN` | `NounType=prop Number=plur`                    | noun, proper plural                       | |  | ||||||
| | `NNS`                               | `NOUN`  | `Number=plur`                                  | noun, plural                              | |  | ||||||
| | `PDT`                               | `ADJ`   | `AdjType=pdt PronType=prn`                     | predeterminer                             | |  | ||||||
| | `POS`                               | `PART`  | `Poss=yes`                                     | possessive ending                         | |  | ||||||
| | `PRP`                               | `PRON`  | `PronType=prs`                                 | pronoun, personal                         | |  | ||||||
| | `PRP$`                              | `ADJ`   | `PronType=prs Poss=yes`                        | pronoun, possessive                       | |  | ||||||
| | `RB`                                | `ADV`   | `Degree=pos`                                   | adverb                                    | |  | ||||||
| | `RBR`                               | `ADV`   | `Degree=comp`                                  | adverb, comparative                       | |  | ||||||
| | `RBS`                               | `ADV`   | `Degree=sup`                                   | adverb, superlative                       | |  | ||||||
| | `RP`                                | `PART`  |                                                | adverb, particle                          | |  | ||||||
| | `_SP`                               | `SPACE` |                                                | space                                     | |  | ||||||
| | `SYM`                               | `SYM`   |                                                | symbol                                    | |  | ||||||
| | `TO`                                | `PART`  | `PartType=inf VerbForm=inf`                    | infinitival "to"                          | |  | ||||||
| | `UH`                                | `INTJ`  |                                                | interjection                              | |  | ||||||
| | `VB`                                | `VERB`  | `VerbForm=inf`                                 | verb, base form                           | |  | ||||||
| | `VBD`                               | `VERB`  | `VerbForm=fin Tense=past`                      | verb, past tense                          | |  | ||||||
| | `VBG`                               | `VERB`  | `VerbForm=part Tense=pres Aspect=prog`         | verb, gerund or present participle        | |  | ||||||
| | `VBN`                               | `VERB`  | `VerbForm=part Tense=past Aspect=perf`         | verb, past participle                     | |  | ||||||
| | `VBP`                               | `VERB`  | `VerbForm=fin Tense=pres`                      | verb, non-3rd person singular present     | |  | ||||||
| | `VBZ`                               | `VERB`  | `VerbForm=fin Tense=pres Number=sing Person=3` | verb, 3rd person singular present         | |  | ||||||
| | `WDT`                               | `ADJ`   | `PronType=int|rel`                             | wh-determiner                             | |  | ||||||
| | `WP`                                | `NOUN`  | `PronType=int|rel`                             | wh-pronoun, personal                      | |  | ||||||
| | `WP$`                               | `ADJ`   | `Poss=yes PronType=int|rel`                    | wh-pronoun, possessive                    | |  | ||||||
| | `WRB`                               | `ADV`   | `PronType=int|rel`                             | wh-adverb                                 | |  | ||||||
| | `XX`                                | `X`     |                                                | unknown                                   | |  | ||||||
| 
 | 
 | ||||||
|  | | Tag                                   |  POS    | Morphology                              | Description                               | | ||||||
|  | | ------------------------------------- | ------- | --------------------------------------- | ----------------------------------------- | | ||||||
|  | | `$`                                   | `SYM`   |                                          | symbol, currency                          | | ||||||
|  | | <InlineCode>``</InlineCode>   | `PUNCT` | `PunctType=quot PunctSide=ini`           | opening quotation mark                    | | ||||||
|  | | `''`                                  | `PUNCT` | `PunctType=quot PunctSide=fin`           | closing quotation mark                    | | ||||||
|  | | `,`                                   | `PUNCT` | `PunctType=comm`                         | punctuation mark, comma                   | | ||||||
|  | | `-LRB-`                               | `PUNCT` | `PunctType=brck PunctSide=ini`           | left round bracket                        | | ||||||
|  | | `-RRB-`                               | `PUNCT` | `PunctType=brck PunctSide=fin`           | right round bracket                       | | ||||||
|  | | `.`                                   | `PUNCT` | `PunctType=peri`                         | punctuation mark, sentence closer         | | ||||||
|  | | `:`                                   | `PUNCT` |                                          | punctuation mark, colon or ellipsis       | | ||||||
|  | | `ADD`                                 | `X`     |                                          | email                                     | | ||||||
|  | | `AFX`                                 | `ADJ`   | `Hyph=yes`                               | affix                                     | | ||||||
|  | | `CC`                                  | `CCONJ` | `ConjType=comp`                          | conjunction, coordinating                 | | ||||||
|  | | `CD`                                  | `NUM`   | `NumType=card`                           | cardinal number                           | | ||||||
|  | | `DT`                                  | `DET`   |                                          | determiner                                | | ||||||
|  | | `EX`                                  | `PRON`  | `AdvType=ex`                             | existential there                         | | ||||||
|  | | `FW`                                  | `X`     | `Foreign=yes`                            | foreign word                              | | ||||||
|  | | `GW`                                  | `X`     |                                          | additional word in multi-word expression  | | ||||||
|  | | `HYPH`                                | `PUNCT` | `PunctType=dash`                         | punctuation mark, hyphen                  | | ||||||
|  | | `IN`                                  | `ADP`   |                                          | conjunction, subordinating or preposition | | ||||||
|  | | `JJ`                                  | `ADJ`   | `Degree=pos`                             | adjective                                 | | ||||||
|  | | `JJR`                                 | `ADJ`   | `Degree=comp`                            | adjective, comparative                    | | ||||||
|  | | `JJS`                                 | `ADJ`   | `Degree=sup`                             | adjective, superlative                    | | ||||||
|  | | `LS`                                  | `X`     | `NumType=ord`                            | list item marker                          | | ||||||
|  | | `MD`                                  | `VERB`  | `VerbType=mod`                           | verb, modal auxiliary                     | | ||||||
|  | | `NFP`                                 | `PUNCT` |                                          | superfluous punctuation                   | | ||||||
|  | | `NIL`                                 | `X`     |                                          | missing tag                               | | ||||||
|  | | `NN`                                  | `NOUN`  | `Number=sing`                            | noun, singular or mass                    | | ||||||
|  | | `NNP`                                 | `PROPN` | `NounType=prop Number=sing`              | noun, proper singular                     | | ||||||
|  | | `NNPS`                                | `PROPN` | `NounType=prop Number=plur`              | noun, proper plural                       | | ||||||
|  | | `NNS`                                 | `NOUN`  | `Number=plur`                            | noun, plural                              | | ||||||
|  | | `PDT`                                 | `DET`   |                                          | predeterminer                             | | ||||||
|  | | `POS`                                 | `PART`  | `Poss=yes`                               | possessive ending                         | | ||||||
|  | | `PRP`                                 | `PRON`  | `PronType=prs`                           | pronoun, personal                         | | ||||||
|  | | `PRP$`                                | `DET`   | `PronType=prs Poss=yes`                  | pronoun, possessive                       | | ||||||
|  | | `RB`                                  | `ADV`   | `Degree=pos`                             | adverb                                    | | ||||||
|  | | `RBR`                                 | `ADV`   | `Degree=comp`                            | adverb, comparative                       | | ||||||
|  | | `RBS`                                 | `ADV`   | `Degree=sup`                             | adverb, superlative                       | | ||||||
|  | | `RP`                                  | `ADP`   |                                          | adverb, particle                          | | ||||||
|  | | `SP`                                  | `SPACE` |                                          | space                                     | | ||||||
|  | | `SYM`                                 | `SYM`   |                                          | symbol                                    | | ||||||
|  | | `TO`                                  | `PART`  | `PartType=inf VerbForm=inf`              | infinitival "to"                          | | ||||||
|  | | `UH`                                  | `INTJ`  |                                          | interjection                              | | ||||||
|  | | `VB`                                  | `VERB`  | `VerbForm=inf`                           | verb, base form                           | | ||||||
|  | | `VBD`                                 | `VERB`  | `VerbForm=fin Tense=past`                | verb, past tense                          | | ||||||
|  | | `VBG`                                 | `VERB`  | `VerbForm=part Tense=pres Aspect=prog`   | verb, gerund or present participle        | | ||||||
|  | | `VBN`                                 | `VERB`  | `VerbForm=part Tense=past Aspect=perf`   | verb, past participle                     | | ||||||
|  | | `VBP`                                 | `VERB`  | `VerbForm=fin Tense=pres`                | verb, non-3rd person singular present     | | ||||||
|  | | `VBZ`                                 | `VERB`  | `VerbForm=fin Tense=pres Number=sing Person=three` | verb, 3rd person singular present         | | ||||||
|  | | `WDT`                                 | `DET`   |                                          | wh-determiner                             | | ||||||
|  | | `WP`                                  | `PRON`  |                                          | wh-pronoun, personal                      | | ||||||
|  | | `WP$`                                 | `DET`   | `Poss=yes`                               | wh-pronoun, possessive                    | | ||||||
|  | | `WRB`                                 | `ADV`   |                                          | wh-adverb                                 | | ||||||
|  | | `XX`                                  | `X`     |                                          | unknown                                   | | ||||||
|  | | `_SP`                                 | `SPACE` |                                          |                                           | | ||||||
| </Accordion> | </Accordion> | ||||||
| 
 | 
 | ||||||
| <Accordion title="German" id="pos-de"> | <Accordion title="German" id="pos-de"> | ||||||
| 
 | 
 | ||||||
| The German part-of-speech tagger uses the | The German part-of-speech tagger uses the | ||||||
| [TIGER Treebank](http://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/TIGERCorpus/annotation/index.html) | [TIGER Treebank](http://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/TIGERCorpus/annotation/index.html) | ||||||
| annotation scheme. We also map the tags to the simpler Google Universal POS tag | annotation scheme. We also map the tags to the simpler Universal Dependencies | ||||||
| set. | v2 POS tag set. | ||||||
| 
 | 
 | ||||||
| | Tag       |  POS    | Morphology                               | Description                                       | | | Tag       |  POS    | Morphology                               | Description                                       | | ||||||
| | --------- | ------- | ---------------------------------------- | ------------------------------------------------- | | | --------- | ------- | ---------------------------------------- | ------------------------------------------------- | | ||||||
|  | @ -194,7 +190,7 @@ set. | ||||||
| | `$,`      | `PUNCT` | `PunctType=comm`                         | comma                                             | | | `$,`      | `PUNCT` | `PunctType=comm`                         | comma                                             | | ||||||
| | `$.`      | `PUNCT` | `PunctType=peri`                         | sentence-final punctuation mark                   | | | `$.`      | `PUNCT` | `PunctType=peri`                         | sentence-final punctuation mark                   | | ||||||
| | `ADJA`    | `ADJ`   |                                          | adjective, attributive                            | | | `ADJA`    | `ADJ`   |                                          | adjective, attributive                            | | ||||||
| | `ADJD`    | `ADJ`   | `Variant=short`                          | adjective, adverbial or predicative               | | | `ADJD`    | `ADJ`   |                                          | adjective, adverbial or predicative               | | ||||||
| | `ADV`     | `ADV`   |                                          | adverb                                            | | | `ADV`     | `ADV`   |                                          | adverb                                            | | ||||||
| | `APPO`    | `ADP`   | `AdpType=post`                           | postposition                                      | | | `APPO`    | `ADP`   | `AdpType=post`                           | postposition                                      | | ||||||
| | `APPR`    | `ADP`   | `AdpType=prep`                           | preposition; circumposition left                  | | | `APPR`    | `ADP`   | `AdpType=prep`                           | preposition; circumposition left                  | | ||||||
|  | @ -204,28 +200,28 @@ set. | ||||||
| | `CARD`    | `NUM`   | `NumType=card`                           | cardinal number                                   | | | `CARD`    | `NUM`   | `NumType=card`                           | cardinal number                                   | | ||||||
| | `FM`      | `X`     | `Foreign=yes`                            | foreign language material                         | | | `FM`      | `X`     | `Foreign=yes`                            | foreign language material                         | | ||||||
| | `ITJ`     | `INTJ`  |                                          | interjection                                      | | | `ITJ`     | `INTJ`  |                                          | interjection                                      | | ||||||
| | `KOKOM`   | `CONJ`  | `ConjType=comp`                          | comparative conjunction                           | | | `KOKOM`   | `CCONJ` | `ConjType=comp`                          | comparative conjunction                           | | ||||||
| | `KON`     | `CONJ`  |                                          | coordinate conjunction                            | | | `KON`     | `CCONJ` |                                          | coordinate conjunction                            | | ||||||
| | `KOUI`    | `SCONJ` |                                          | subordinate conjunction with "zu" and infinitive  | | | `KOUI`    | `SCONJ` |                                          | subordinate conjunction with "zu" and infinitive  | | ||||||
| | `KOUS`    | `SCONJ` |                                          | subordinate conjunction with sentence             | | | `KOUS`    | `SCONJ` |                                          | subordinate conjunction with sentence             | | ||||||
| | `NE`      | `PROPN` |                                          | proper noun                                       | | | `NE`      | `PROPN` |                                          | proper noun                                       | | ||||||
| | `NNE`     | `PROPN` |                                          | proper noun                                       | |  | ||||||
| | `NN`      | `NOUN`  |                                          | noun, singular or mass                            | | | `NN`      | `NOUN`  |                                          | noun, singular or mass                            | | ||||||
| | `PROAV`   | `ADV`   | `PronType=dem`                           | pronominal adverb                                 | | | `NNE`     | `PROPN` |                                          | proper noun                                       | | ||||||
| | `PDAT`    | `DET`   | `PronType=dem`                           | attributive demonstrative pronoun                 | | | `PDAT`    | `DET`   | `PronType=dem`                           | attributive demonstrative pronoun                 | | ||||||
| | `PDS`     | `PRON`  | `PronType=dem`                           | substituting demonstrative pronoun                | | | `PDS`     | `PRON`  | `PronType=dem`                           | substituting demonstrative pronoun                | | ||||||
| | `PIAT`    | `DET`   | `PronType=ind\|neg\|tot`                 | attributive indefinite pronoun without determiner | | | `PIAT`    | `DET`   | `PronType=ind|neg|tot`                   | attributive indefinite pronoun without determiner | | ||||||
| | `PIS`     | `PRON`  | `PronType=ind\|neg\|tot`                 | substituting indefinite pronoun                   | | | `PIS`     | `PRON`  | `PronType=ind|neg|tot`                   | substituting indefinite pronoun                   | | ||||||
| | `PPER`    | `PRON`  | `PronType=prs`                           | non-reflexive personal pronoun                    | | | `PPER`    | `PRON`  | `PronType=prs`                           | non-reflexive personal pronoun                    | | ||||||
| | `PPOSAT`  | `DET`   | `Poss=yes PronType=prs`                  | attributive possessive pronoun                    | | | `PPOSAT`  | `DET`   | `Poss=yes PronType=prs`                  | attributive possessive pronoun                    | | ||||||
| | `PPOSS`   | `PRON`  | `PronType=rel`                           | substituting possessive pronoun                   | | | `PPOSS`   | `PRON`  | `Poss=yes PronType=prs`                  | substituting possessive pronoun                   | | ||||||
| | `PRELAT`  | `DET`   | `PronType=rel`                           | attributive relative pronoun                      | | | `PRELAT`  | `DET`   | `PronType=rel`                           | attributive relative pronoun                      | | ||||||
| | `PRELS`   | `PRON`  | `PronType=rel`                           | substituting relative pronoun                     | | | `PRELS`   | `PRON`  | `PronType=rel`                           | substituting relative pronoun                     | | ||||||
| | `PRF`     | `PRON`  | `PronType=prs Reflex=yes`                | reflexive personal pronoun                        | | | `PRF`     | `PRON`  | `PronType=prs Reflex=yes`                | reflexive personal pronoun                        | | ||||||
|  | | `PROAV`   | `ADV`   | `PronType=dem`                           | pronominal adverb                                 | | ||||||
| | `PTKA`    | `PART`  |                                          | particle with adjective or adverb                 | | | `PTKA`    | `PART`  |                                          | particle with adjective or adverb                 | | ||||||
| | `PTKANT`  | `PART`  | `PartType=res`                           | answer particle                                   | | | `PTKANT`  | `PART`  | `PartType=res`                           | answer particle                                   | | ||||||
| | `PTKNEG`  | `PART`  | `Negative=yes`                           | negative particle                                 | | | `PTKNEG`  | `PART`  | `Polarity=neg`                           | negative particle                                 | | ||||||
| | `PTKVZ`   | `PART`  | `PartType=vbp`                           | separable verbal particle                         | | | `PTKVZ`   | `ADP`   | `PartType=vbp`                           | separable verbal particle                         | | ||||||
| | `PTKZU`   | `PART`  | `PartType=inf`                           | "zu" before infinitive                            | | | `PTKZU`   | `PART`  | `PartType=inf`                           | "zu" before infinitive                            | | ||||||
| | `PWAT`    | `DET`   | `PronType=int`                           | attributive interrogative pronoun                 | | | `PWAT`    | `DET`   | `PronType=int`                           | attributive interrogative pronoun                 | | ||||||
| | `PWAV`    | `ADV`   | `PronType=int`                           | adverbial interrogative or relative pronoun       | | | `PWAV`    | `ADV`   | `PronType=int`                           | adverbial interrogative or relative pronoun       | | ||||||
|  | @ -234,9 +230,9 @@ set. | ||||||
| | `VAFIN`   | `AUX`   | `Mood=ind VerbForm=fin`                  | finite verb, auxiliary                            | | | `VAFIN`   | `AUX`   | `Mood=ind VerbForm=fin`                  | finite verb, auxiliary                            | | ||||||
| | `VAIMP`   | `AUX`   | `Mood=imp VerbForm=fin`                  | imperative, auxiliary                             | | | `VAIMP`   | `AUX`   | `Mood=imp VerbForm=fin`                  | imperative, auxiliary                             | | ||||||
| | `VAINF`   | `AUX`   | `VerbForm=inf`                           | infinitive, auxiliary                             | | | `VAINF`   | `AUX`   | `VerbForm=inf`                           | infinitive, auxiliary                             | | ||||||
| | `VAPP`    | `AUX`   | `Aspect=perf VerbForm=fin`               | perfect participle, auxiliary                     | | | `VAPP`    | `AUX`   | `Aspect=perf VerbForm=part`              | perfect participle, auxiliary                     | | ||||||
| | `VMFIN`   | `VERB`  | `Mood=ind VerbForm=fin VerbType=mod`     | finite verb, modal                                | | | `VMFIN`   | `VERB`  | `Mood=ind VerbForm=fin VerbType=mod`     | finite verb, modal                                | | ||||||
| | `VMINF`   | `VERB`  | `VerbForm=fin VerbType=mod`              | infinitive, modal                                 | | | `VMINF`   | `VERB`  | `VerbForm=inf VerbType=mod`              | infinitive, modal                                 | | ||||||
| | `VMPP`    | `VERB`  | `Aspect=perf VerbForm=part VerbType=mod` | perfect participle, modal                         | | | `VMPP`    | `VERB`  | `Aspect=perf VerbForm=part VerbType=mod` | perfect participle, modal                         | | ||||||
| | `VVFIN`   | `VERB`  | `Mood=ind VerbForm=fin`                  | finite verb, full                                 | | | `VVFIN`   | `VERB`  | `Mood=ind VerbForm=fin`                  | finite verb, full                                 | | ||||||
| | `VVIMP`   | `VERB`  | `Mood=imp VerbForm=fin`                  | imperative, full                                  | | | `VVIMP`   | `VERB`  | `Mood=imp VerbForm=fin`                  | imperative, full                                  | | ||||||
|  | @ -244,8 +240,7 @@ set. | ||||||
| | `VVIZU`   | `VERB`  | `VerbForm=inf`                           | infinitive with "zu", full                        | | | `VVIZU`   | `VERB`  | `VerbForm=inf`                           | infinitive with "zu", full                        | | ||||||
| | `VVPP`    | `VERB`  | `Aspect=perf VerbForm=part`              | perfect participle, full                          | | | `VVPP`    | `VERB`  | `Aspect=perf VerbForm=part`              | perfect participle, full                          | | ||||||
| | `XY`      | `X`     |                                          | non-word containing non-letter                    | | | `XY`      | `X`     |                                          | non-word containing non-letter                    | | ||||||
| | `SP`      | `SPACE` |                                          | space                                             | | | `_SP`     | `SPACE` |                                          |                                                   | | ||||||
| 
 |  | ||||||
| </Accordion> | </Accordion> | ||||||
| 
 | 
 | ||||||
| --- | --- | ||||||
|  |  | ||||||
|  | @ -155,21 +155,14 @@ $ python -m spacy convert [input_file] [output_dir] [--file-type] [--converter] | ||||||
| 
 | 
 | ||||||
| ### Output file types {new="2.1"} | ### Output file types {new="2.1"} | ||||||
| 
 | 
 | ||||||
| > #### Which format should I choose? |  | ||||||
| > |  | ||||||
| > If you're not sure, go with the default `jsonl`. Newline-delimited JSON means |  | ||||||
| > that there's one JSON object per line. Unlike a regular JSON file, it can also |  | ||||||
| > be read in line-by-line and you won't have to parse the _entire file_ first. |  | ||||||
| > This makes it a very convenient format for larger corpora. |  | ||||||
| 
 |  | ||||||
| All output files generated by this command are compatible with | All output files generated by this command are compatible with | ||||||
| [`spacy train`](/api/cli#train). | [`spacy train`](/api/cli#train). | ||||||
| 
 | 
 | ||||||
| | ID      | Description                       | | | ID      | Description                | | ||||||
| | ------- | --------------------------------- | | | ------- | -------------------------- | | ||||||
| | `jsonl` | Newline-delimited JSON (default). | | | `json`  | Regular JSON (default).    | | ||||||
| | `json`  | Regular JSON.                     | | | `jsonl` | Newline-delimited JSON.    | | ||||||
| | `msg`   | Binary MessagePack format.        | | | `msg`   | Binary MessagePack format. | | ||||||
| 
 | 
 | ||||||
| ### Converter options | ### Converter options | ||||||
| 
 | 
 | ||||||
|  | @ -453,8 +446,10 @@ improvement. | ||||||
| 
 | 
 | ||||||
| ```bash | ```bash | ||||||
| $ python -m spacy pretrain [texts_loc] [vectors_model] [output_dir] | $ python -m spacy pretrain [texts_loc] [vectors_model] [output_dir] | ||||||
| [--width] [--depth] [--embed-rows] [--loss_func] [--dropout] [--batch-size] [--max-length] [--min-length] | [--width] [--depth] [--cnn-window] [--cnn-pieces] [--use-chars] [--sa-depth] | ||||||
| [--seed] [--n-iter] [--use-vectors] [--n-save_every] [--init-tok2vec] [--epoch-start] | [--embed-rows] [--loss_func] [--dropout] [--batch-size] [--max-length] | ||||||
|  | [--min-length]  [--seed] [--n-iter] [--use-vectors] [--n-save_every] | ||||||
|  | [--init-tok2vec] [--epoch-start] | ||||||
| ``` | ``` | ||||||
| 
 | 
 | ||||||
| | Argument                                              | Type       | Description                                                                                                                                                                     | | | Argument                                              | Type       | Description                                                                                                                                                                     | | ||||||
|  | @ -464,6 +459,10 @@ $ python -m spacy pretrain [texts_loc] [vectors_model] [output_dir] | ||||||
| | `output_dir`                                          | positional | Directory to write models to on each epoch.                                                                                                                                     | | | `output_dir`                                          | positional | Directory to write models to on each epoch.                                                                                                                                     | | ||||||
| | `--width`, `-cw`                                      | option     | Width of CNN layers.                                                                                                                                                            | | | `--width`, `-cw`                                      | option     | Width of CNN layers.                                                                                                                                                            | | ||||||
| | `--depth`, `-cd`                                      | option     | Depth of CNN layers.                                                                                                                                                            | | | `--depth`, `-cd`                                      | option     | Depth of CNN layers.                                                                                                                                                            | | ||||||
|  | | `--cnn-window`, `-cW` <Tag variant="new">2.2.2</Tag>  | option     | Window size for CNN layers.                                                                                                                                                     | | ||||||
|  | | `--cnn-pieces`, `-cP` <Tag variant="new">2.2.2</Tag>  | option     | Maxout size for CNN layers. `1` for [Mish](https://github.com/digantamisra98/Mish).                                                                                             | | ||||||
|  | | `--use-chars`, `-chr` <Tag variant="new">2.2.2</Tag>  | flag       | Whether to use character-based embedding.                                                                                                                                       | | ||||||
|  | | `--sa-depth`, `-sa` <Tag variant="new">2.2.2</Tag>    | option     | Depth of self-attention layers.                                                                                                                                                 | | ||||||
| | `--embed-rows`, `-er`                                 | option     | Number of embedding rows.                                                                                                                                                       | | | `--embed-rows`, `-er`                                 | option     | Number of embedding rows.                                                                                                                                                       | | ||||||
| | `--loss-func`, `-L`                                   | option     | Loss function to use for the objective. Either `"L2"` or `"cosine"`.                                                                                                            | | | `--loss-func`, `-L`                                   | option     | Loss function to use for the objective. Either `"L2"` or `"cosine"`.                                                                                                            | | ||||||
| | `--dropout`, `-d`                                     | option     | Dropout rate.                                                                                                                                                                   | | | `--dropout`, `-d`                                     | option     | Dropout rate.                                                                                                                                                                   | | ||||||
|  | @ -476,7 +475,7 @@ $ python -m spacy pretrain [texts_loc] [vectors_model] [output_dir] | ||||||
| | `--n-save-every`, `-se`                               | option     | Save model every X batches.                                                                                                                                                     | | | `--n-save-every`, `-se`                               | option     | Save model every X batches.                                                                                                                                                     | | ||||||
| | `--init-tok2vec`, `-t2v` <Tag variant="new">2.1</Tag> | option     | Path to pretrained weights for the token-to-vector parts of the models. See `spacy pretrain`. Experimental.                                                                     | | | `--init-tok2vec`, `-t2v` <Tag variant="new">2.1</Tag> | option     | Path to pretrained weights for the token-to-vector parts of the models. See `spacy pretrain`. Experimental.                                                                     | | ||||||
| | `--epoch-start`, `-es` <Tag variant="new">2.1.5</Tag> | option     | The epoch to start counting at. Only relevant when using `--init-tok2vec` and the given weight file has been renamed. Prevents unintended overwriting of existing weight files. | | | `--epoch-start`, `-es` <Tag variant="new">2.1.5</Tag> | option     | The epoch to start counting at. Only relevant when using `--init-tok2vec` and the given weight file has been renamed. Prevents unintended overwriting of existing weight files. | | ||||||
| | **CREATES**                                           | weights    | The pretrained weights that can be used to initialize `spacy train`.                                                                                                           | | | **CREATES**                                           | weights    | The pretrained weights that can be used to initialize `spacy train`.                                                                                                            | | ||||||
| 
 | 
 | ||||||
| ### JSONL format for raw text {#pretrain-jsonl} | ### JSONL format for raw text {#pretrain-jsonl} | ||||||
| 
 | 
 | ||||||
|  |  | ||||||
|  | @ -202,6 +202,14 @@ All labels present in the match patterns. | ||||||
| | ----------- | ----- | ------------------ | | | ----------- | ----- | ------------------ | | ||||||
| | **RETURNS** | tuple | The string labels. | | | **RETURNS** | tuple | The string labels. | | ||||||
| 
 | 
 | ||||||
|  | ## EntityRuler.ent_ids {#labels tag="property" new="2.2.2"} | ||||||
|  | 
 | ||||||
|  | All entity ids present in the match patterns `id` properties. | ||||||
|  | 
 | ||||||
|  | | Name        | Type  | Description         | | ||||||
|  | | ----------- | ----- | ------------------- | | ||||||
|  | | **RETURNS** | tuple | The string ent_ids. | | ||||||
|  | 
 | ||||||
| ## EntityRuler.patterns {#patterns tag="property"} | ## EntityRuler.patterns {#patterns tag="property"} | ||||||
| 
 | 
 | ||||||
| Get all patterns that were added to the entity ruler. | Get all patterns that were added to the entity ruler. | ||||||
|  |  | ||||||
|  | @ -323,18 +323,38 @@ you can use to undo your changes. | ||||||
| > #### Example | > #### Example | ||||||
| > | > | ||||||
| > ```python | > ```python | ||||||
| > with nlp.disable_pipes('tagger', 'parser'): | > # New API as of v2.2.2 | ||||||
|  | > with nlp.disable_pipes(["tagger", "parser"]): | ||||||
|  | >    nlp.begin_training() | ||||||
|  | > | ||||||
|  | > with nlp.disable_pipes("tagger", "parser"): | ||||||
| >     nlp.begin_training() | >     nlp.begin_training() | ||||||
| > | > | ||||||
| > disabled = nlp.disable_pipes('tagger', 'parser') | > disabled = nlp.disable_pipes("tagger", "parser") | ||||||
| > nlp.begin_training() | > nlp.begin_training() | ||||||
| > disabled.restore() | > disabled.restore() | ||||||
| > ``` | > ``` | ||||||
| 
 | 
 | ||||||
| | Name        | Type            | Description                                                                          | | | Name                                      | Type            | Description                                                                          | | ||||||
| | ----------- | --------------- | ------------------------------------------------------------------------------------ | | | ----------------------------------------- | --------------- | ------------------------------------------------------------------------------------ | | ||||||
| | `*disabled` | unicode         | Names of pipeline components to disable.                                             | | | `disabled` <Tag variant="new">2.2.2</Tag> | list            | Names of pipeline components to disable.                                             | | ||||||
| | **RETURNS** | `DisabledPipes` | The disabled pipes that can be restored by calling the object's `.restore()` method. | | | `*disabled`                               | unicode         | Names of pipeline components to disable.                                             | | ||||||
|  | | **RETURNS**                               | `DisabledPipes` | The disabled pipes that can be restored by calling the object's `.restore()` method. | | ||||||
|  | 
 | ||||||
|  | <Infobox title="Changed in v2.2.2" variant="warning"> | ||||||
|  | 
 | ||||||
|  | As of spaCy v2.2.2, the `Language.disable_pipes` method can also take a list of | ||||||
|  | component names as its first argument (instead of a variable number of | ||||||
|  | arguments). This is especially useful if you're generating the component names | ||||||
|  | to disable programmatically. The new syntax will become the default in the | ||||||
|  | future. | ||||||
|  | 
 | ||||||
|  | ```diff | ||||||
|  | - disabled = nlp.disable_pipes("tagger", "parser") | ||||||
|  | + disabled = nlp.disable_pipes(["tagger", "parser"]) | ||||||
|  | ``` | ||||||
|  | 
 | ||||||
|  | </Infobox> | ||||||
| 
 | 
 | ||||||
| ## Language.to_disk {#to_disk tag="method" new="2"} | ## Language.to_disk {#to_disk tag="method" new="2"} | ||||||
| 
 | 
 | ||||||
|  |  | ||||||
|  | @ -157,16 +157,19 @@ overwritten. | ||||||
| | `on_match`  | callable or `None` | Callback function to act on matches. Takes the arguments `matcher`, `doc`, `i` and `matches`. | | | `on_match`  | callable or `None` | Callback function to act on matches. Takes the arguments `matcher`, `doc`, `i` and `matches`. | | ||||||
| | `*patterns` | list               | Match pattern. A pattern consists of a list of dicts, where each dict describes a token.      | | | `*patterns` | list               | Match pattern. A pattern consists of a list of dicts, where each dict describes a token.      | | ||||||
| 
 | 
 | ||||||
| <Infobox title="Changed in v2.0" variant="warning"> | <Infobox title="Changed in v2.2.2" variant="warning"> | ||||||
| 
 | 
 | ||||||
| As of spaCy 2.0, `Matcher.add_pattern` and `Matcher.add_entity` are deprecated | As of spaCy 2.2.2, `Matcher.add` also supports the new API, which will become | ||||||
| and have been replaced with a simpler [`Matcher.add`](/api/matcher#add) that | the default in the future. The patterns are now the second argument and a list | ||||||
| lets you add a list of patterns and a callback for a given match ID. | (instead of a variable number of arguments). The `on_match` callback becomes an | ||||||
|  | optional keyword argument. | ||||||
| 
 | 
 | ||||||
| ```diff | ```diff | ||||||
| - matcher.add_entity("GoogleNow", on_match=merge_phrases) | patterns = [[{"TEXT": "Google"}, {"TEXT": "Now"}], [{"TEXT": "GoogleNow"}]] | ||||||
| - matcher.add_pattern("GoogleNow", [{ORTH: "Google"}, {ORTH: "Now"}]) | - matcher.add("GoogleNow", None, *patterns) | ||||||
| + matcher.add('GoogleNow', merge_phrases, [{"ORTH": "Google"}, {"ORTH": "Now"}]) | + matcher.add("GoogleNow", patterns) | ||||||
|  | - matcher.add("GoogleNow", on_match, *patterns) | ||||||
|  | + matcher.add("GoogleNow", patterns, on_match=on_match) | ||||||
| ``` | ``` | ||||||
| 
 | 
 | ||||||
| </Infobox> | </Infobox> | ||||||
|  |  | ||||||
|  | @ -153,6 +153,23 @@ overwritten. | ||||||
| | `on_match` | callable or `None` | Callback function to act on matches. Takes the arguments `matcher`, `doc`, `i` and `matches`. | | | `on_match` | callable or `None` | Callback function to act on matches. Takes the arguments `matcher`, `doc`, `i` and `matches`. | | ||||||
| | `*docs`    | `Doc`              | `Doc` objects of the phrases to match.                                                        | | | `*docs`    | `Doc`              | `Doc` objects of the phrases to match.                                                        | | ||||||
| 
 | 
 | ||||||
|  | <Infobox title="Changed in v2.2.2" variant="warning"> | ||||||
|  | 
 | ||||||
|  | As of spaCy 2.2.2, `PhraseMatcher.add` also supports the new API, which will | ||||||
|  | become the default in the future. The `Doc` patterns are now the second argument | ||||||
|  | and a list (instead of a variable number of arguments). The `on_match` callback | ||||||
|  | becomes an optional keyword argument. | ||||||
|  | 
 | ||||||
|  | ```diff | ||||||
|  | patterns = [nlp("health care reform"), nlp("healthcare reform")] | ||||||
|  | - matcher.add("HEALTH", None, *patterns) | ||||||
|  | + matcher.add("HEALTH", patterns) | ||||||
|  | - matcher.add("HEALTH", on_match, *patterns) | ||||||
|  | + matcher.add("HEALTH", patterns, on_match=on_match) | ||||||
|  | ``` | ||||||
|  | 
 | ||||||
|  | </Infobox> | ||||||
|  | 
 | ||||||
| ## PhraseMatcher.remove {#remove tag="method" new="2.2"} | ## PhraseMatcher.remove {#remove tag="method" new="2.2"} | ||||||
| 
 | 
 | ||||||
| Remove a rule from the matcher by match ID. A `KeyError` is raised if the key | Remove a rule from the matcher by match ID. A `KeyError` is raised if the key | ||||||
|  |  | ||||||
|  | @ -1,9 +1,33 @@ | ||||||
| <div class="entities" style="line-height: 2.5; font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Helvetica, Arial, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol'; font-size: 18px">But  | <div | ||||||
| <mark class="entity" style="background: linear-gradient(90deg, #AA9CFC, #FC9CE7); padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone">Google  |     class="entities" | ||||||
| <span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem">ORG</span></mark>is starting from behind. The company made a late push into hardware, |     style="line-height: 2.5; font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Helvetica, Arial, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol'; font-size: 18px" | ||||||
| and  |     >But | ||||||
| <mark class="entity" style="background: linear-gradient(90deg, #AA9CFC, #FC9CE7); padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone">Apple  |     <mark | ||||||
| <span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem">ORG</span></mark>’s Siri, available on iPhones, and  |         class="entity" | ||||||
| <mark class="entity" style="background: linear-gradient(90deg, #AA9CFC, #FC9CE7); padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone">Amazon  |         style="background: linear-gradient(90deg, #AA9CFC, #FC9CE7); padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em" | ||||||
| <span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem">ORG</span></mark>’s Alexa software, which runs on its Echo and Dot devices, have clear |         >Google | ||||||
| leads in consumer adoption.</div> |         <span | ||||||
|  |             style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem" | ||||||
|  |             >ORG</span | ||||||
|  |         ></mark | ||||||
|  |     >is starting from behind. The company made a late push into hardware, and | ||||||
|  |     <mark | ||||||
|  |         class="entity" | ||||||
|  |         style="background: linear-gradient(90deg, #AA9CFC, #FC9CE7); padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em" | ||||||
|  |         >Apple | ||||||
|  |         <span | ||||||
|  |             style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem" | ||||||
|  |             >ORG</span | ||||||
|  |         ></mark | ||||||
|  |     >’s Siri, available on iPhones, and | ||||||
|  |     <mark | ||||||
|  |         class="entity" | ||||||
|  |         style="background: linear-gradient(90deg, #AA9CFC, #FC9CE7); padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em" | ||||||
|  |         >Amazon | ||||||
|  |         <span | ||||||
|  |             style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem" | ||||||
|  |             >ORG</span | ||||||
|  |         ></mark | ||||||
|  |     >’s Alexa software, which runs on its Echo and Dot devices, have clear leads in consumer | ||||||
|  |     adoption.</div | ||||||
|  | > | ||||||
|  |  | ||||||
|  | @ -2,17 +2,25 @@ | ||||||
|     class="entities" |     class="entities" | ||||||
|     style="line-height: 2.5; font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Helvetica, Arial, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol'; font-size: 16px" |     style="line-height: 2.5; font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Helvetica, Arial, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol'; font-size: 16px" | ||||||
| > | > | ||||||
|     🌱🌿 <mark |     🌱🌿 | ||||||
|     class="entity" |     <mark | ||||||
|     style="background: #3dff74; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone" |         class="entity" | ||||||
| >🐍 <span |         style="background: #3dff74; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em" | ||||||
| style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem" |         >🐍 | ||||||
| >SNEK</span |         <span | ||||||
| ></mark> ____ 🌳🌲 ____ <mark |             style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem" | ||||||
| class="entity" |             >SNEK</span | ||||||
| style="background: #cfc5ff; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone" |         ></mark | ||||||
| >👨🌾 <span |     > | ||||||
| style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem" |     ____ 🌳🌲 ____ | ||||||
| >HUMAN</span |     <mark | ||||||
| ></mark>  🏘️ |         class="entity" | ||||||
|  |         style="background: #cfc5ff; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em" | ||||||
|  |         >👨🌾 | ||||||
|  |         <span | ||||||
|  |             style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem" | ||||||
|  |             >HUMAN</span | ||||||
|  |         ></mark | ||||||
|  |     > | ||||||
|  |     🏘️ | ||||||
| </div> | </div> | ||||||
|  |  | ||||||
|  | @ -1,16 +1,37 @@ | ||||||
| <div class="entities" style="line-height: 2.5; font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Helvetica, Arial, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol'; font-size: 16px"> | <div | ||||||
|     <mark class="entity" style="background: #7aecec; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone"> |     class="entities" | ||||||
|  |     style="line-height: 2.5; font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Helvetica, Arial, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol'; font-size: 16px" | ||||||
|  | > | ||||||
|  |     <mark | ||||||
|  |         class="entity" | ||||||
|  |         style="background: #7aecec; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em" | ||||||
|  |     > | ||||||
|         Apple |         Apple | ||||||
|         <span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem">ORG</span> |         <span | ||||||
|  |             style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem" | ||||||
|  |             >ORG</span | ||||||
|  |         > | ||||||
|     </mark> |     </mark> | ||||||
|     is looking at buying |     is looking at buying | ||||||
|     <mark class="entity" style="background: #feca74; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone"> |     <mark | ||||||
|  |         class="entity" | ||||||
|  |         style="background: #feca74; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em" | ||||||
|  |     > | ||||||
|         U.K. |         U.K. | ||||||
|         <span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem">GPE</span> |         <span | ||||||
|  |             style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem" | ||||||
|  |             >GPE</span | ||||||
|  |         > | ||||||
|     </mark> |     </mark> | ||||||
|     startup for |     startup for | ||||||
|     <mark class="entity" style="background: #e4e7d2; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone"> |     <mark | ||||||
|  |         class="entity" | ||||||
|  |         style="background: #e4e7d2; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em" | ||||||
|  |     > | ||||||
|         $1 billion |         $1 billion | ||||||
|         <span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem">MONEY</span> |         <span | ||||||
|  |             style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem" | ||||||
|  |             >MONEY</span | ||||||
|  |         > | ||||||
|     </mark> |     </mark> | ||||||
| </div> | </div> | ||||||
|  |  | ||||||
|  | @ -1,18 +1,39 @@ | ||||||
| <div class="entities" style="line-height: 2.5; font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Helvetica, Arial, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol'; font-size: 18px"> | <div | ||||||
|  |     class="entities" | ||||||
|  |     style="line-height: 2.5; font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Helvetica, Arial, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol'; font-size: 18px" | ||||||
|  | > | ||||||
|     When |     When | ||||||
|     <mark class="entity" style="background: #aa9cfc; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone"> |     <mark | ||||||
|  |         class="entity" | ||||||
|  |         style="background: #aa9cfc; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em" | ||||||
|  |     > | ||||||
|         Sebastian Thrun |         Sebastian Thrun | ||||||
|         <span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem">PERSON</span> |         <span | ||||||
|  |             style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem" | ||||||
|  |             >PERSON</span | ||||||
|  |         > | ||||||
|     </mark> |     </mark> | ||||||
|     started working on self-driving cars at |     started working on self-driving cars at | ||||||
|     <mark class="entity" style="background: #7aecec; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone"> |     <mark | ||||||
|  |         class="entity" | ||||||
|  |         style="background: #7aecec; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em" | ||||||
|  |     > | ||||||
|         Google |         Google | ||||||
|         <span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem">ORG</span> |         <span | ||||||
|  |             style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem" | ||||||
|  |             >ORG</span | ||||||
|  |         > | ||||||
|     </mark> |     </mark> | ||||||
|     in |     in | ||||||
|     <mark class="entity" style="background: #bfe1d9; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone"> |     <mark | ||||||
|  |         class="entity" | ||||||
|  |         style="background: #bfe1d9; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em" | ||||||
|  |     > | ||||||
|         2007 |         2007 | ||||||
|         <span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem">DATE</span> |         <span | ||||||
|  |             style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem" | ||||||
|  |             >DATE</span | ||||||
|  |         > | ||||||
|     </mark> |     </mark> | ||||||
|     , few people outside of the company took him seriously. |     , few people outside of the company took him seriously. | ||||||
| </div> | </div> | ||||||
|  |  | ||||||
|  | @ -986,6 +986,37 @@ doc = nlp("Apple is opening its first big office in San Francisco.") | ||||||
| print([(ent.text, ent.label_) for ent in doc.ents]) | print([(ent.text, ent.label_) for ent in doc.ents]) | ||||||
| ``` | ``` | ||||||
| 
 | 
 | ||||||
|  | ### Adding IDs to patterns {#entityruler-ent-ids new="2.2.2"} | ||||||
|  | 
 | ||||||
|  | The [`EntityRuler`](/api/entityruler) can also accept an `id` attribute for each | ||||||
|  | pattern. Using the `id` attribute allows multiple patterns to be associated with | ||||||
|  | the same entity. | ||||||
|  | 
 | ||||||
|  | ```python | ||||||
|  | ### {executable="true"} | ||||||
|  | from spacy.lang.en import English | ||||||
|  | from spacy.pipeline import EntityRuler | ||||||
|  | 
 | ||||||
|  | nlp = English() | ||||||
|  | ruler = EntityRuler(nlp) | ||||||
|  | patterns = [{"label": "ORG", "pattern": "Apple", "id": "apple"}, | ||||||
|  |             {"label": "GPE", "pattern": [{"LOWER": "san"}, {"LOWER": "francisco"}], "id": "san-francisco"}, | ||||||
|  |             {"label": "GPE", "pattern": [{"LOWER": "san"}, {"LOWER": "fran"}], "id": "san-francisco"}] | ||||||
|  | ruler.add_patterns(patterns) | ||||||
|  | nlp.add_pipe(ruler) | ||||||
|  | 
 | ||||||
|  | doc1 = nlp("Apple is opening its first big office in San Francisco.") | ||||||
|  | print([(ent.text, ent.label_, ent.ent_id_) for ent in doc1.ents]) | ||||||
|  | 
 | ||||||
|  | doc2 = nlp("Apple is opening its first big office in San Fran.") | ||||||
|  | print([(ent.text, ent.label_, ent.ent_id_) for ent in doc2.ents]) | ||||||
|  | ``` | ||||||
|  | 
 | ||||||
|  | If the `id` attribute is included in the [`EntityRuler`](/api/entityruler) | ||||||
|  | patterns, the `ent_id_` property of the matched entity is set to the `id` given | ||||||
|  | in the patterns. So in the example above it's easy to identify that "San | ||||||
|  | Francisco" and "San Fran" are both the same entity. | ||||||
|  | 
 | ||||||
| The entity ruler is designed to integrate with spaCy's existing statistical | The entity ruler is designed to integrate with spaCy's existing statistical | ||||||
| models and enhance the named entity recognizer. If it's added **before the | models and enhance the named entity recognizer. If it's added **before the | ||||||
| `"ner"` component**, the entity recognizer will respect the existing entity | `"ner"` component**, the entity recognizer will respect the existing entity | ||||||
|  |  | ||||||
|  | @ -127,6 +127,7 @@ | ||||||
|         { "code": "sr", "name": "Serbian" }, |         { "code": "sr", "name": "Serbian" }, | ||||||
|         { "code": "sk", "name": "Slovak" }, |         { "code": "sk", "name": "Slovak" }, | ||||||
|         { "code": "sl", "name": "Slovenian" }, |         { "code": "sl", "name": "Slovenian" }, | ||||||
|  |         { "code": "lb", "name": "Luxembourgish" }, | ||||||
|         { |         { | ||||||
|             "code": "sq", |             "code": "sq", | ||||||
|             "name": "Albanian", |             "name": "Albanian", | ||||||
|  |  | ||||||
		Loading…
	
		Reference in New Issue
	
	Block a user