mirror of
				https://github.com/explosion/spaCy.git
				synced 2025-11-01 00:17:44 +03:00 
			
		
		
		
	Merge branch 'master' into spacy.io
This commit is contained in:
		
						commit
						50b117c072
					
				
							
								
								
									
										106
									
								
								.github/contributors/isaric.md
									
									
									
									
										vendored
									
									
										Normal file
									
								
							
							
						
						
									
										106
									
								
								.github/contributors/isaric.md
									
									
									
									
										vendored
									
									
										Normal file
									
								
							|  | @ -0,0 +1,106 @@ | |||
| # spaCy contributor agreement | ||||
| 
 | ||||
| This spaCy Contributor Agreement (**"SCA"**) is based on the | ||||
| [Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf). | ||||
| The SCA applies to any contribution that you make to any product or project | ||||
| managed by us (the **"project"**), and sets out the intellectual property rights | ||||
| you grant to us in the contributed materials. The term **"us"** shall mean | ||||
| [ExplosionAI GmbH](https://explosion.ai/legal). The term | ||||
| **"you"** shall mean the person or entity identified below. | ||||
| 
 | ||||
| If you agree to be bound by these terms, fill in the information requested | ||||
| below and include the filled-in version with your first pull request, under the | ||||
| folder [`.github/contributors/`](/.github/contributors/). The name of the file | ||||
| should be your GitHub username, with the extension `.md`. For example, the user | ||||
| example_user would create the file `.github/contributors/example_user.md`. | ||||
| 
 | ||||
| Read this agreement carefully before signing. These terms and conditions | ||||
| constitute a binding legal agreement. | ||||
| 
 | ||||
| ## Contributor Agreement | ||||
| 
 | ||||
| 1. The term "contribution" or "contributed materials" means any source code, | ||||
| object code, patch, tool, sample, graphic, specification, manual, | ||||
| documentation, or any other material posted or submitted by you to the project. | ||||
| 
 | ||||
| 2. With respect to any worldwide copyrights, or copyright applications and | ||||
| registrations, in your contribution: | ||||
| 
 | ||||
|     * you hereby assign to us joint ownership, and to the extent that such | ||||
|     assignment is or becomes invalid, ineffective or unenforceable, you hereby | ||||
|     grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge, | ||||
|     royalty-free, unrestricted license to exercise all rights under those | ||||
|     copyrights. This includes, at our option, the right to sublicense these same | ||||
|     rights to third parties through multiple levels of sublicensees or other | ||||
|     licensing arrangements; | ||||
| 
 | ||||
|     * you agree that each of us can do all things in relation to your | ||||
|     contribution as if each of us were the sole owners, and if one of us makes | ||||
|     a derivative work of your contribution, the one who makes the derivative | ||||
|     work (or has it made will be the sole owner of that derivative work; | ||||
| 
 | ||||
|     * you agree that you will not assert any moral rights in your contribution | ||||
|     against us, our licensees or transferees; | ||||
| 
 | ||||
|     * you agree that we may register a copyright in your contribution and | ||||
|     exercise all ownership rights associated with it; and | ||||
| 
 | ||||
|     * you agree that neither of us has any duty to consult with, obtain the | ||||
|     consent of, pay or render an accounting to the other for any use or | ||||
|     distribution of your contribution. | ||||
| 
 | ||||
| 3. With respect to any patents you own, or that you can license without payment | ||||
| to any third party, you hereby grant to us a perpetual, irrevocable, | ||||
| non-exclusive, worldwide, no-charge, royalty-free license to: | ||||
| 
 | ||||
|     * make, have made, use, sell, offer to sell, import, and otherwise transfer | ||||
|     your contribution in whole or in part, alone or in combination with or | ||||
|     included in any product, work or materials arising out of the project to | ||||
|     which your contribution was submitted, and | ||||
| 
 | ||||
|     * at our option, to sublicense these same rights to third parties through | ||||
|     multiple levels of sublicensees or other licensing arrangements. | ||||
| 
 | ||||
| 4. Except as set out above, you keep all right, title, and interest in your | ||||
| contribution. The rights that you grant to us under these terms are effective | ||||
| on the date you first submitted a contribution to us, even if your submission | ||||
| took place before the date you sign these terms. | ||||
| 
 | ||||
| 5. You covenant, represent, warrant and agree that: | ||||
| 
 | ||||
|     * Each contribution that you submit is and shall be an original work of | ||||
|     authorship and you can legally grant the rights set out in this SCA; | ||||
| 
 | ||||
|     * to the best of your knowledge, each contribution will not violate any | ||||
|     third party's copyrights, trademarks, patents, or other intellectual | ||||
|     property rights; and | ||||
| 
 | ||||
|     * each contribution shall be in compliance with U.S. export control laws and | ||||
|     other applicable export and import laws. You agree to notify us if you | ||||
|     become aware of any circumstance which would make any of the foregoing | ||||
|     representations inaccurate in any respect. We may publicly disclose your | ||||
|     participation in the project, including the fact that you have signed the SCA. | ||||
| 
 | ||||
| 6. This SCA is governed by the laws of the State of California and applicable | ||||
| U.S. Federal law. Any choice of law rules will not apply. | ||||
| 
 | ||||
| 7. Please place an “x” on one of the applicable statement below. Please do NOT | ||||
| mark both statements: | ||||
| 
 | ||||
|     * [x] I am signing on behalf of myself as an individual and no other person | ||||
|     or entity, including my employer, has or will have rights with respect to my | ||||
|     contributions. | ||||
| 
 | ||||
|     * [ ] I am signing on behalf of my employer or a legal entity and I have the | ||||
|     actual authority to contractually bind that entity. | ||||
| 
 | ||||
| ## Contributor Details | ||||
| 
 | ||||
| | Field                          | Entry                | | ||||
| |------------------------------- | -------------------- | | ||||
| | Name                           |  Ivan Šarić          | | ||||
| | Company name (if applicable)   |                      | | ||||
| | Title or role (if applicable)  |                      | | ||||
| | Date                           | 18.08.2019.          | | ||||
| | GitHub username                | isaric               | | ||||
| | Website (optional)             |                      | | ||||
							
								
								
									
										106
									
								
								.github/contributors/yanaiela.md
									
									
									
									
										vendored
									
									
										Normal file
									
								
							
							
						
						
									
										106
									
								
								.github/contributors/yanaiela.md
									
									
									
									
										vendored
									
									
										Normal file
									
								
							|  | @ -0,0 +1,106 @@ | |||
| # spaCy contributor agreement | ||||
| 
 | ||||
| This spaCy Contributor Agreement (**"SCA"**) is based on the | ||||
| [Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf). | ||||
| The SCA applies to any contribution that you make to any product or project | ||||
| managed by us (the **"project"**), and sets out the intellectual property rights | ||||
| you grant to us in the contributed materials. The term **"us"** shall mean | ||||
| [ExplosionAI GmbH](https://explosion.ai/legal). The term | ||||
| **"you"** shall mean the person or entity identified below. | ||||
| 
 | ||||
| If you agree to be bound by these terms, fill in the information requested | ||||
| below and include the filled-in version with your first pull request, under the | ||||
| folder [`.github/contributors/`](/.github/contributors/). The name of the file | ||||
| should be your GitHub username, with the extension `.md`. For example, the user | ||||
| example_user would create the file `.github/contributors/example_user.md`. | ||||
| 
 | ||||
| Read this agreement carefully before signing. These terms and conditions | ||||
| constitute a binding legal agreement. | ||||
| 
 | ||||
| ## Contributor Agreement | ||||
| 
 | ||||
| 1. The term "contribution" or "contributed materials" means any source code, | ||||
| object code, patch, tool, sample, graphic, specification, manual, | ||||
| documentation, or any other material posted or submitted by you to the project. | ||||
| 
 | ||||
| 2. With respect to any worldwide copyrights, or copyright applications and | ||||
| registrations, in your contribution: | ||||
| 
 | ||||
|     * you hereby assign to us joint ownership, and to the extent that such | ||||
|     assignment is or becomes invalid, ineffective or unenforceable, you hereby | ||||
|     grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge, | ||||
|     royalty-free, unrestricted license to exercise all rights under those | ||||
|     copyrights. This includes, at our option, the right to sublicense these same | ||||
|     rights to third parties through multiple levels of sublicensees or other | ||||
|     licensing arrangements; | ||||
| 
 | ||||
|     * you agree that each of us can do all things in relation to your | ||||
|     contribution as if each of us were the sole owners, and if one of us makes | ||||
|     a derivative work of your contribution, the one who makes the derivative | ||||
|     work (or has it made will be the sole owner of that derivative work; | ||||
| 
 | ||||
|     * you agree that you will not assert any moral rights in your contribution | ||||
|     against us, our licensees or transferees; | ||||
| 
 | ||||
|     * you agree that we may register a copyright in your contribution and | ||||
|     exercise all ownership rights associated with it; and | ||||
| 
 | ||||
|     * you agree that neither of us has any duty to consult with, obtain the | ||||
|     consent of, pay or render an accounting to the other for any use or | ||||
|     distribution of your contribution. | ||||
| 
 | ||||
| 3. With respect to any patents you own, or that you can license without payment | ||||
| to any third party, you hereby grant to us a perpetual, irrevocable, | ||||
| non-exclusive, worldwide, no-charge, royalty-free license to: | ||||
| 
 | ||||
|     * make, have made, use, sell, offer to sell, import, and otherwise transfer | ||||
|     your contribution in whole or in part, alone or in combination with or | ||||
|     included in any product, work or materials arising out of the project to | ||||
|     which your contribution was submitted, and | ||||
| 
 | ||||
|     * at our option, to sublicense these same rights to third parties through | ||||
|     multiple levels of sublicensees or other licensing arrangements. | ||||
| 
 | ||||
| 4. Except as set out above, you keep all right, title, and interest in your | ||||
| contribution. The rights that you grant to us under these terms are effective | ||||
| on the date you first submitted a contribution to us, even if your submission | ||||
| took place before the date you sign these terms. | ||||
| 
 | ||||
| 5. You covenant, represent, warrant and agree that: | ||||
| 
 | ||||
|     * Each contribution that you submit is and shall be an original work of | ||||
|     authorship and you can legally grant the rights set out in this SCA; | ||||
| 
 | ||||
|     * to the best of your knowledge, each contribution will not violate any | ||||
|     third party's copyrights, trademarks, patents, or other intellectual | ||||
|     property rights; and | ||||
| 
 | ||||
|     * each contribution shall be in compliance with U.S. export control laws and | ||||
|     other applicable export and import laws. You agree to notify us if you | ||||
|     become aware of any circumstance which would make any of the foregoing | ||||
|     representations inaccurate in any respect. We may publicly disclose your | ||||
|     participation in the project, including the fact that you have signed the SCA. | ||||
| 
 | ||||
| 6. This SCA is governed by the laws of the State of California and applicable | ||||
| U.S. Federal law. Any choice of law rules will not apply. | ||||
| 
 | ||||
| 7. Please place an “x” on one of the applicable statement below. Please do NOT | ||||
| mark both statements: | ||||
| 
 | ||||
|     * [ ] I am signing on behalf of myself as an individual and no other person | ||||
|     or entity, including my employer, has or will have rights with respect to my | ||||
|     contributions. | ||||
| 
 | ||||
|     * [ ] I am signing on behalf of my employer or a legal entity and I have the | ||||
|     actual authority to contractually bind that entity. | ||||
| 
 | ||||
| ## Contributor Details | ||||
| 
 | ||||
| | Field                          | Entry                | | ||||
| |------------------------------- | -------------------- | | ||||
| | Name                           | Yanai Elazar                     | | ||||
| | Company name (if applicable)   |                      | | ||||
| | Title or role (if applicable)  |                      | | ||||
| | Date                           | 14/8/2019                     | | ||||
| | GitHub username                | yanaiela                     | | ||||
| | Website (optional)             |  https://yanaiela.github.io                    | | ||||
							
								
								
									
										1
									
								
								.gitignore
									
									
									
									
										vendored
									
									
								
							
							
						
						
									
										1
									
								
								.gitignore
									
									
									
									
										vendored
									
									
								
							|  | @ -3,6 +3,7 @@ spacy/data/ | |||
| corpora/ | ||||
| /models/ | ||||
| keys/ | ||||
| *.json.gz | ||||
| 
 | ||||
| # Website | ||||
| website/.cache/ | ||||
|  |  | |||
							
								
								
									
										16
									
								
								spacy/_ml.py
									
									
									
									
									
								
							
							
						
						
									
										16
									
								
								spacy/_ml.py
									
									
									
									
									
								
							|  | @ -674,14 +674,14 @@ def build_nel_encoder(embed_width, hidden_width, ner_types, **cfg): | |||
|     with Model.define_operators({">>": chain, "**": clone}): | ||||
|         # context encoder | ||||
|         tok2vec = Tok2Vec( | ||||
|                 width=hidden_width, | ||||
|                 embed_size=embed_width, | ||||
|                 pretrained_vectors=pretrained_vectors, | ||||
|                 cnn_maxout_pieces=cnn_maxout_pieces, | ||||
|                 subword_features=True, | ||||
|                 conv_depth=conv_depth, | ||||
|                 bilstm_depth=0, | ||||
|             ) | ||||
|             width=hidden_width, | ||||
|             embed_size=embed_width, | ||||
|             pretrained_vectors=pretrained_vectors, | ||||
|             cnn_maxout_pieces=cnn_maxout_pieces, | ||||
|             subword_features=True, | ||||
|             conv_depth=conv_depth, | ||||
|             bilstm_depth=0, | ||||
|         ) | ||||
| 
 | ||||
|         model = ( | ||||
|             tok2vec | ||||
|  |  | |||
|  | @ -8,7 +8,7 @@ import sys | |||
| import srsly | ||||
| from wasabi import Printer, MESSAGES | ||||
| 
 | ||||
| from ..gold import GoldCorpus, read_json_object | ||||
| from ..gold import GoldCorpus | ||||
| from ..syntax import nonproj | ||||
| from ..util import load_model, get_lang_class | ||||
| 
 | ||||
|  | @ -95,13 +95,19 @@ def debug_data( | |||
|         corpus = GoldCorpus(train_path, dev_path) | ||||
|         try: | ||||
|             train_docs = list(corpus.train_docs(nlp)) | ||||
|             train_docs_unpreprocessed = list(corpus.train_docs_without_preprocessing(nlp)) | ||||
|             train_docs_unpreprocessed = list( | ||||
|                 corpus.train_docs_without_preprocessing(nlp) | ||||
|             ) | ||||
|         except ValueError as e: | ||||
|             loading_train_error_message = "Training data cannot be loaded: {}".format(str(e)) | ||||
|             loading_train_error_message = "Training data cannot be loaded: {}".format( | ||||
|                 str(e) | ||||
|             ) | ||||
|         try: | ||||
|             dev_docs = list(corpus.dev_docs(nlp)) | ||||
|         except ValueError as e: | ||||
|             loading_dev_error_message = "Development data cannot be loaded: {}".format(str(e)) | ||||
|             loading_dev_error_message = "Development data cannot be loaded: {}".format( | ||||
|                 str(e) | ||||
|             ) | ||||
|     if loading_train_error_message or loading_dev_error_message: | ||||
|         if loading_train_error_message: | ||||
|             msg.fail(loading_train_error_message) | ||||
|  | @ -158,11 +164,15 @@ def debug_data( | |||
|     ) | ||||
|     if gold_train_data["n_misaligned_words"] > 0: | ||||
|         msg.warn( | ||||
|             "{} misaligned tokens in the training data".format(gold_train_data["n_misaligned_words"]) | ||||
|             "{} misaligned tokens in the training data".format( | ||||
|                 gold_train_data["n_misaligned_words"] | ||||
|             ) | ||||
|         ) | ||||
|     if gold_dev_data["n_misaligned_words"] > 0: | ||||
|         msg.warn( | ||||
|             "{} misaligned tokens in the dev data".format(gold_dev_data["n_misaligned_words"]) | ||||
|             "{} misaligned tokens in the dev data".format( | ||||
|                 gold_dev_data["n_misaligned_words"] | ||||
|             ) | ||||
|         ) | ||||
|     most_common_words = gold_train_data["words"].most_common(10) | ||||
|     msg.text( | ||||
|  | @ -184,7 +194,9 @@ def debug_data( | |||
| 
 | ||||
|     if "ner" in pipeline: | ||||
|         # Get all unique NER labels present in the data | ||||
|         labels = set(label for label in gold_train_data["ner"] if label not in ("O", "-")) | ||||
|         labels = set( | ||||
|             label for label in gold_train_data["ner"] if label not in ("O", "-") | ||||
|         ) | ||||
|         label_counts = gold_train_data["ner"] | ||||
|         model_labels = _get_labels_from_model(nlp, "ner") | ||||
|         new_labels = [l for l in labels if l not in model_labels] | ||||
|  | @ -222,7 +234,9 @@ def debug_data( | |||
|             ) | ||||
| 
 | ||||
|         if gold_train_data["ws_ents"]: | ||||
|             msg.fail("{} invalid whitespace entity spans".format(gold_train_data["ws_ents"])) | ||||
|             msg.fail( | ||||
|                 "{} invalid whitespace entity spans".format(gold_train_data["ws_ents"]) | ||||
|             ) | ||||
|             has_ws_ents_error = True | ||||
| 
 | ||||
|         for label in new_labels: | ||||
|  | @ -323,33 +337,36 @@ def debug_data( | |||
|             "Found {} sentence{} with an average length of {:.1f} words.".format( | ||||
|                 gold_train_data["n_sents"], | ||||
|                 "s" if len(train_docs) > 1 else "", | ||||
|                 gold_train_data["n_words"] / gold_train_data["n_sents"] | ||||
|                 gold_train_data["n_words"] / gold_train_data["n_sents"], | ||||
|             ) | ||||
|         ) | ||||
| 
 | ||||
|         # profile labels | ||||
|         labels_train = [label for label in gold_train_data["deps"]] | ||||
|         labels_train_unpreprocessed = [label for label in gold_train_unpreprocessed_data["deps"]] | ||||
|         labels_train_unpreprocessed = [ | ||||
|             label for label in gold_train_unpreprocessed_data["deps"] | ||||
|         ] | ||||
|         labels_dev = [label for label in gold_dev_data["deps"]] | ||||
| 
 | ||||
|         if gold_train_unpreprocessed_data["n_nonproj"] > 0: | ||||
|             msg.info( | ||||
|                 "Found {} nonprojective train sentence{}".format( | ||||
|                     gold_train_unpreprocessed_data["n_nonproj"], | ||||
|                     "s" if gold_train_unpreprocessed_data["n_nonproj"] > 1 else "" | ||||
|                     "s" if gold_train_unpreprocessed_data["n_nonproj"] > 1 else "", | ||||
|                 ) | ||||
|             ) | ||||
|         if gold_dev_data["n_nonproj"] > 0: | ||||
|             msg.info( | ||||
|                 "Found {} nonprojective dev sentence{}".format( | ||||
|                     gold_dev_data["n_nonproj"], | ||||
|                     "s" if gold_dev_data["n_nonproj"] > 1 else "" | ||||
|                     "s" if gold_dev_data["n_nonproj"] > 1 else "", | ||||
|                 ) | ||||
|             ) | ||||
| 
 | ||||
|         msg.info( | ||||
|             "{} {} in train data".format( | ||||
|                 len(labels_train_unpreprocessed), "label" if len(labels_train) == 1 else "labels" | ||||
|                 len(labels_train_unpreprocessed), | ||||
|                 "label" if len(labels_train) == 1 else "labels", | ||||
|             ) | ||||
|         ) | ||||
|         msg.info( | ||||
|  | @ -373,43 +390,45 @@ def debug_data( | |||
|                 ) | ||||
|                 has_low_data_warning = True | ||||
| 
 | ||||
| 
 | ||||
|         # rare labels in projectivized train | ||||
|         rare_projectivized_labels = [] | ||||
|         for label in gold_train_data["deps"]: | ||||
|             if gold_train_data["deps"][label] <= DEP_LABEL_THRESHOLD and "||" in label: | ||||
|                 rare_projectivized_labels.append("{}: {}".format(label, str(gold_train_data["deps"][label]))) | ||||
|                 rare_projectivized_labels.append( | ||||
|                     "{}: {}".format(label, str(gold_train_data["deps"][label])) | ||||
|                 ) | ||||
| 
 | ||||
|         if len(rare_projectivized_labels) > 0: | ||||
|                 msg.warn( | ||||
|                     "Low number of examples for {} label{} in the " | ||||
|                     "projectivized dependency trees used for training. You may " | ||||
|                     "want to projectivize labels such as punct before " | ||||
|                     "training in order to improve parser performance.".format( | ||||
|                         len(rare_projectivized_labels), | ||||
|                         "s" if len(rare_projectivized_labels) > 1 else "") | ||||
|             msg.warn( | ||||
|                 "Low number of examples for {} label{} in the " | ||||
|                 "projectivized dependency trees used for training. You may " | ||||
|                 "want to projectivize labels such as punct before " | ||||
|                 "training in order to improve parser performance.".format( | ||||
|                     len(rare_projectivized_labels), | ||||
|                     "s" if len(rare_projectivized_labels) > 1 else "", | ||||
|                 ) | ||||
|                 msg.warn( | ||||
|                     "Projectivized labels with low numbers of examples: " | ||||
|                     "{}".format("\n".join(rare_projectivized_labels)), | ||||
|                     show=verbose | ||||
|                 ) | ||||
|                 has_low_data_warning = True | ||||
|             ) | ||||
|             msg.warn( | ||||
|                 "Projectivized labels with low numbers of examples: " | ||||
|                 "{}".format("\n".join(rare_projectivized_labels)), | ||||
|                 show=verbose, | ||||
|             ) | ||||
|             has_low_data_warning = True | ||||
| 
 | ||||
|         # labels only in train | ||||
|         if set(labels_train) - set(labels_dev): | ||||
|             msg.warn( | ||||
|                 "The following labels were found only in the train data: " | ||||
|                 "{}".format(", ".join(set(labels_train) - set(labels_dev))), | ||||
|                 show=verbose | ||||
|                 show=verbose, | ||||
|             ) | ||||
| 
 | ||||
|         # labels only in dev | ||||
|         if set(labels_dev) - set(labels_train): | ||||
|             msg.warn( | ||||
|                 "The following labels were found only in the dev data: " + | ||||
|                 ", ".join(set(labels_dev) - set(labels_train)), | ||||
|                 show=verbose | ||||
|                 "The following labels were found only in the dev data: " | ||||
|                 + ", ".join(set(labels_dev) - set(labels_train)), | ||||
|                 show=verbose, | ||||
|             ) | ||||
| 
 | ||||
|         if has_low_data_warning: | ||||
|  | @ -422,8 +441,10 @@ def debug_data( | |||
|         # multiple root labels | ||||
|         if len(gold_train_unpreprocessed_data["roots"]) > 1: | ||||
|             msg.warn( | ||||
|                 "Multiple root labels ({}) ".format(", ".join(gold_train_unpreprocessed_data["roots"])) + | ||||
|                 "found in training data. spaCy's parser uses a single root " | ||||
|                 "Multiple root labels ({}) ".format( | ||||
|                     ", ".join(gold_train_unpreprocessed_data["roots"]) | ||||
|                 ) | ||||
|                 + "found in training data. spaCy's parser uses a single root " | ||||
|                 "label ROOT so this distinction will not be available." | ||||
|             ) | ||||
| 
 | ||||
|  | @ -432,14 +453,14 @@ def debug_data( | |||
|             msg.fail( | ||||
|                 "Found {} nonprojective projectivized train sentence{}".format( | ||||
|                     gold_train_data["n_nonproj"], | ||||
|                     "s" if gold_train_data["n_nonproj"] > 1 else "" | ||||
|                     "s" if gold_train_data["n_nonproj"] > 1 else "", | ||||
|                 ) | ||||
|             ) | ||||
|         if gold_train_data["n_cycles"] > 0: | ||||
|             msg.fail( | ||||
|                 "Found {} projectivized train sentence{} with cycles".format( | ||||
|                     gold_train_data["n_cycles"], | ||||
|                     "s" if gold_train_data["n_cycles"] > 1 else "" | ||||
|                     "s" if gold_train_data["n_cycles"] > 1 else "", | ||||
|                 ) | ||||
|             ) | ||||
| 
 | ||||
|  |  | |||
|  | @ -84,12 +84,12 @@ def evaluate( | |||
| def render_parses(docs, output_path, model_name="", limit=250, deps=True, ents=True): | ||||
|     docs[0].user_data["title"] = model_name | ||||
|     if ents: | ||||
|         with (output_path / "entities.html").open("w") as file_: | ||||
|             html = displacy.render(docs[:limit], style="ent", page=True) | ||||
|         html = displacy.render(docs[:limit], style="ent", page=True) | ||||
|         with (output_path / "entities.html").open("w", encoding="utf8") as file_: | ||||
|             file_.write(html) | ||||
|     if deps: | ||||
|         with (output_path / "parses.html").open("w") as file_: | ||||
|             html = displacy.render( | ||||
|                 docs[:limit], style="dep", page=True, options={"compact": True} | ||||
|             ) | ||||
|         html = displacy.render( | ||||
|             docs[:limit], style="dep", page=True, options={"compact": True} | ||||
|         ) | ||||
|         with (output_path / "parses.html").open("w", encoding="utf8") as file_: | ||||
|             file_.write(html) | ||||
|  |  | |||
|  | @ -114,7 +114,7 @@ def read_attrs_from_deprecated(freqs_loc, clusters_loc): | |||
|             probs, _ = read_freqs(freqs_loc) | ||||
|         msg.good("Counted frequencies") | ||||
|     else: | ||||
|         probs, _ = ({}, DEFAULT_OOV_PROB) | ||||
|         probs, _ = ({}, DEFAULT_OOV_PROB)  # noqa: F841 | ||||
|     if clusters_loc: | ||||
|         with msg.loading("Reading clusters..."): | ||||
|             clusters = read_clusters(clusters_loc) | ||||
|  |  | |||
|  | @ -247,6 +247,15 @@ class EntityRenderer(object): | |||
|         self.direction = DEFAULT_DIR | ||||
|         self.lang = DEFAULT_LANG | ||||
| 
 | ||||
|         template = options.get("template") | ||||
|         if template: | ||||
|             self.ent_template = template | ||||
|         else: | ||||
|             if self.direction == "rtl": | ||||
|                 self.ent_template = TPL_ENT_RTL | ||||
|             else: | ||||
|                 self.ent_template = TPL_ENT | ||||
| 
 | ||||
|     def render(self, parsed, page=False, minify=False): | ||||
|         """Render complete markup. | ||||
| 
 | ||||
|  | @ -284,6 +293,7 @@ class EntityRenderer(object): | |||
|             label = span["label"] | ||||
|             start = span["start"] | ||||
|             end = span["end"] | ||||
|             additional_params = span.get("params", {}) | ||||
|             entity = escape_html(text[start:end]) | ||||
|             fragments = text[offset:start].split("\n") | ||||
|             for i, fragment in enumerate(fragments): | ||||
|  | @ -293,10 +303,8 @@ class EntityRenderer(object): | |||
|             if self.ents is None or label.upper() in self.ents: | ||||
|                 color = self.colors.get(label.upper(), self.default_color) | ||||
|                 ent_settings = {"label": label, "text": entity, "bg": color} | ||||
|                 if self.direction == "rtl": | ||||
|                     markup += TPL_ENT_RTL.format(**ent_settings) | ||||
|                 else: | ||||
|                     markup += TPL_ENT.format(**ent_settings) | ||||
|                 ent_settings.update(additional_params) | ||||
|                 markup += self.ent_template.format(**ent_settings) | ||||
|             else: | ||||
|                 markup += entity | ||||
|             offset = end | ||||
|  |  | |||
|  | @ -429,6 +429,7 @@ class Errors(object): | |||
|     E155 = ("The `nlp` object should have access to pre-trained word vectors, cf. " | ||||
|             "https://spacy.io/usage/models#languages.") | ||||
| 
 | ||||
| 
 | ||||
| @add_codes | ||||
| class TempErrors(object): | ||||
|     T003 = ("Resizing pre-trained Tagger models is not currently supported.") | ||||
|  |  | |||
							
								
								
									
										18
									
								
								spacy/lang/hr/examples.py
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										18
									
								
								spacy/lang/hr/examples.py
									
									
									
									
									
										Normal file
									
								
							|  | @ -0,0 +1,18 @@ | |||
| # coding: utf8 | ||||
| from __future__ import unicode_literals | ||||
| 
 | ||||
| """ | ||||
| Example sentences to test spaCy and its language models. | ||||
| 
 | ||||
| >>> from spacy.lang.hr.examples import sentences | ||||
| >>> docs = nlp.pipe(sentences) | ||||
| """ | ||||
| 
 | ||||
| sentences = [ | ||||
|     "Ovo je rečenica.", | ||||
|     "Kako se popravlja auto?", | ||||
|     "Zagreb je udaljen od Ljubljane svega 150 km.", | ||||
|     "Nećete vjerovati što se dogodilo na ovogodišnjem festivalu!", | ||||
|     "Budućnost Apple je upitna nakon dugotrajnog pada vrijednosti dionica firme.", | ||||
|     "Trgovina oružjem predstavlja prijetnju za globalni mir.", | ||||
| ] | ||||
|  | @ -1,10 +1,8 @@ | |||
| # encoding: utf8 | ||||
| from __future__ import unicode_literals, print_function | ||||
| 
 | ||||
| import re | ||||
| import sys | ||||
| 
 | ||||
| 
 | ||||
| from .stop_words import STOP_WORDS | ||||
| from .tag_map import TAG_MAP | ||||
| from ...attrs import LANG | ||||
|  | @ -32,7 +30,7 @@ else: | |||
|     from typing import NamedTuple | ||||
| 
 | ||||
|     class Morpheme(NamedTuple): | ||||
|          | ||||
| 
 | ||||
|         surface = str("") | ||||
|         lemma = str("") | ||||
|         tag = str("") | ||||
|  |  | |||
|  | @ -109,7 +109,7 @@ for orth in [ | |||
| 
 | ||||
| 
 | ||||
| emoticons = set( | ||||
|     """ | ||||
|     r""" | ||||
| :) | ||||
| :-) | ||||
| :)) | ||||
|  |  | |||
|  | @ -8,6 +8,7 @@ from ..tokenizer_exceptions import BASE_EXCEPTIONS | |||
| from .stop_words import STOP_WORDS | ||||
| from .tag_map import TAG_MAP | ||||
| 
 | ||||
| 
 | ||||
| class ChineseDefaults(Language.Defaults): | ||||
|     lex_attr_getters = dict(Language.Defaults.lex_attr_getters) | ||||
|     lex_attr_getters[LANG] = lambda text: "zh" | ||||
|  | @ -45,4 +46,4 @@ class Chinese(Language): | |||
|             return Doc(self.vocab, words=words, spaces=spaces) | ||||
| 
 | ||||
| 
 | ||||
| __all__ = ["Chinese"] | ||||
| __all__ = ["Chinese"] | ||||
|  |  | |||
|  | @ -1,8 +1,8 @@ | |||
| # coding: utf8 | ||||
| from __future__ import unicode_literals | ||||
| 
 | ||||
| from ...symbols import POS, PUNCT, SYM, ADJ, CONJ, CCONJ, NUM, DET, ADV, ADP, X, VERB | ||||
| from ...symbols import NOUN, PROPN, PART, INTJ, SPACE, PRON, AUX | ||||
| from ...symbols import POS, PUNCT, ADJ, CONJ, CCONJ, NUM, DET, ADV, ADP, X, VERB | ||||
| from ...symbols import NOUN, PART, INTJ, PRON | ||||
| 
 | ||||
| # The Chinese part-of-speech tagger uses the OntoNotes 5 version of the Penn Treebank tag set. | ||||
| # We also map the tags to the simpler Google Universal POS tag set. | ||||
|  | @ -43,5 +43,5 @@ TAG_MAP = { | |||
|     "JJ": {POS: ADJ}, | ||||
|     "P": {POS: ADP}, | ||||
|     "PN": {POS: PRON}, | ||||
|     "PU": {POS: PUNCT} | ||||
| } | ||||
|     "PU": {POS: PUNCT}, | ||||
| } | ||||
|  |  | |||
|  | @ -160,14 +160,15 @@ class Scorer(object): | |||
|                     cand_deps.add((gold_i, gold_head, token.dep_.lower())) | ||||
|         if "-" not in [token[-1] for token in gold.orig_annot]: | ||||
|             # Find all NER labels in gold and doc | ||||
|             ent_labels = set([x[0] for x in gold_ents] | ||||
|                     + [k.label_ for k in doc.ents]) | ||||
|             ent_labels = set([x[0] for x in gold_ents] + [k.label_ for k in doc.ents]) | ||||
|             # Set up all labels for per type scoring and prepare gold per type | ||||
|             gold_per_ents = {ent_label: set() for ent_label in ent_labels} | ||||
|             for ent_label in ent_labels: | ||||
|                 if ent_label not in self.ner_per_ents: | ||||
|                     self.ner_per_ents[ent_label] = PRFScore() | ||||
|                 gold_per_ents[ent_label].update([x for x in gold_ents if x[0] == ent_label]) | ||||
|                 gold_per_ents[ent_label].update( | ||||
|                     [x for x in gold_ents if x[0] == ent_label] | ||||
|                 ) | ||||
|             # Find all candidate labels, for all and per type | ||||
|             cand_ents = set() | ||||
|             cand_per_ents = {ent_label: set() for ent_label in ent_labels} | ||||
|  |  | |||
|  | @ -1,7 +1,6 @@ | |||
| # coding: utf8 | ||||
| from __future__ import unicode_literals | ||||
| 
 | ||||
| import pytest | ||||
| from spacy.matcher import PhraseMatcher | ||||
| from spacy.tokens import Doc | ||||
| 
 | ||||
|  |  | |||
|  | @ -3,12 +3,13 @@ from __future__ import unicode_literals | |||
| 
 | ||||
| from ..util import get_doc | ||||
| 
 | ||||
| 
 | ||||
| def test_issue4104(en_vocab): | ||||
|     """Test that English lookup lemmatization of spun & dry are correct | ||||
|     expected mapping = {'dry': 'dry', 'spun': 'spin', 'spun-dry': 'spin-dry'} | ||||
| 	""" | ||||
|     text = 'dry spun spun-dry' | ||||
|     """ | ||||
|     text = "dry spun spun-dry" | ||||
|     doc = get_doc(en_vocab, [t for t in text.split(" ")]) | ||||
|     # using a simple list to preserve order | ||||
|     expected = ['dry', 'spin', 'spin-dry'] | ||||
|     expected = ["dry", "spin", "spin-dry"] | ||||
|     assert [token.lemma_ for token in doc] == expected | ||||
|  |  | |||
|  | @ -6,6 +6,7 @@ from spacy.gold import spans_from_biluo_tags, GoldParse | |||
| from spacy.tokens import Doc | ||||
| import pytest | ||||
| 
 | ||||
| 
 | ||||
| def test_gold_biluo_U(en_vocab): | ||||
|     words = ["I", "flew", "to", "London", "."] | ||||
|     spaces = [True, True, True, False, True] | ||||
|  | @ -32,14 +33,18 @@ def test_gold_biluo_BIL(en_vocab): | |||
|     tags = biluo_tags_from_offsets(doc, entities) | ||||
|     assert tags == ["O", "O", "O", "B-LOC", "I-LOC", "L-LOC", "O"] | ||||
| 
 | ||||
| 
 | ||||
| def test_gold_biluo_overlap(en_vocab): | ||||
|     words = ["I", "flew", "to", "San", "Francisco", "Valley", "."] | ||||
|     spaces = [True, True, True, True, True, False, True] | ||||
|     doc = Doc(en_vocab, words=words, spaces=spaces) | ||||
|     entities = [(len("I flew to "), len("I flew to San Francisco Valley"), "LOC"), | ||||
|     (len("I flew to "), len("I flew to San Francisco"), "LOC")] | ||||
|     entities = [ | ||||
|         (len("I flew to "), len("I flew to San Francisco Valley"), "LOC"), | ||||
|         (len("I flew to "), len("I flew to San Francisco"), "LOC"), | ||||
|     ] | ||||
|     with pytest.raises(ValueError): | ||||
|         tags = biluo_tags_from_offsets(doc, entities) | ||||
|         biluo_tags_from_offsets(doc, entities) | ||||
| 
 | ||||
| 
 | ||||
| def test_gold_biluo_misalign(en_vocab): | ||||
|     words = ["I", "flew", "to", "San", "Francisco", "Valley."] | ||||
|  |  | |||
|  | @ -7,67 +7,62 @@ from spacy.scorer import Scorer | |||
| from .util import get_doc | ||||
| 
 | ||||
| test_ner_cardinal = [ | ||||
|     [ | ||||
|         "100 - 200", | ||||
|         { | ||||
|             "entities": [ | ||||
|                 [0, 3, "CARDINAL"], | ||||
|                 [6, 9, "CARDINAL"] | ||||
|             ] | ||||
|         } | ||||
|     ] | ||||
|     ["100 - 200", {"entities": [[0, 3, "CARDINAL"], [6, 9, "CARDINAL"]]}] | ||||
| ] | ||||
| 
 | ||||
| test_ner_apple = [ | ||||
|     [ | ||||
|         "Apple is looking at buying U.K. startup for $1 billion", | ||||
|         { | ||||
|             "entities": [ | ||||
|                 (0, 5, "ORG"), | ||||
|                 (27, 31, "GPE"), | ||||
|                 (44, 54, "MONEY"), | ||||
|             ] | ||||
|         } | ||||
|         {"entities": [(0, 5, "ORG"), (27, 31, "GPE"), (44, 54, "MONEY")]}, | ||||
|     ] | ||||
| ] | ||||
| 
 | ||||
| 
 | ||||
| def test_ner_per_type(en_vocab): | ||||
|     # Gold and Doc are identical | ||||
|     scorer = Scorer() | ||||
|     for input_, annot in test_ner_cardinal: | ||||
|         doc = get_doc(en_vocab, words = input_.split(' '), ents = [[0, 1, 'CARDINAL'], [2, 3, 'CARDINAL']]) | ||||
|         gold = GoldParse(doc, entities = annot['entities']) | ||||
|         doc = get_doc( | ||||
|             en_vocab, | ||||
|             words=input_.split(" "), | ||||
|             ents=[[0, 1, "CARDINAL"], [2, 3, "CARDINAL"]], | ||||
|         ) | ||||
|         gold = GoldParse(doc, entities=annot["entities"]) | ||||
|         scorer.score(doc, gold) | ||||
|     results = scorer.scores | ||||
| 
 | ||||
|     assert results['ents_p'] == 100 | ||||
|     assert results['ents_f'] == 100 | ||||
|     assert results['ents_r'] == 100 | ||||
|     assert results['ents_per_type']['CARDINAL']['p'] == 100 | ||||
|     assert results['ents_per_type']['CARDINAL']['f'] == 100 | ||||
|     assert results['ents_per_type']['CARDINAL']['r'] == 100 | ||||
|     assert results["ents_p"] == 100 | ||||
|     assert results["ents_f"] == 100 | ||||
|     assert results["ents_r"] == 100 | ||||
|     assert results["ents_per_type"]["CARDINAL"]["p"] == 100 | ||||
|     assert results["ents_per_type"]["CARDINAL"]["f"] == 100 | ||||
|     assert results["ents_per_type"]["CARDINAL"]["r"] == 100 | ||||
| 
 | ||||
|     # Doc has one missing and one extra entity | ||||
|     # Entity type MONEY is not present in Doc | ||||
|     scorer = Scorer() | ||||
|     for input_, annot in test_ner_apple: | ||||
|         doc = get_doc(en_vocab, words = input_.split(' '), ents = [[0, 1, 'ORG'], [5, 6, 'GPE'], [6, 7, 'ORG']]) | ||||
|         gold = GoldParse(doc, entities = annot['entities']) | ||||
|         doc = get_doc( | ||||
|             en_vocab, | ||||
|             words=input_.split(" "), | ||||
|             ents=[[0, 1, "ORG"], [5, 6, "GPE"], [6, 7, "ORG"]], | ||||
|         ) | ||||
|         gold = GoldParse(doc, entities=annot["entities"]) | ||||
|         scorer.score(doc, gold) | ||||
|     results = scorer.scores | ||||
| 
 | ||||
|     assert results['ents_p'] == approx(66.66666) | ||||
|     assert results['ents_r'] == approx(66.66666) | ||||
|     assert results['ents_f'] == approx(66.66666) | ||||
|     assert 'GPE' in results['ents_per_type'] | ||||
|     assert 'MONEY' in results['ents_per_type'] | ||||
|     assert 'ORG' in results['ents_per_type'] | ||||
|     assert results['ents_per_type']['GPE']['p'] == 100 | ||||
|     assert results['ents_per_type']['GPE']['r'] == 100 | ||||
|     assert results['ents_per_type']['GPE']['f'] == 100 | ||||
|     assert results['ents_per_type']['MONEY']['p'] == 0 | ||||
|     assert results['ents_per_type']['MONEY']['r'] == 0 | ||||
|     assert results['ents_per_type']['MONEY']['f'] == 0 | ||||
|     assert results['ents_per_type']['ORG']['p'] == 50 | ||||
|     assert results['ents_per_type']['ORG']['r'] == 100 | ||||
|     assert results['ents_per_type']['ORG']['f'] == approx(66.66666) | ||||
|     assert results["ents_p"] == approx(66.66666) | ||||
|     assert results["ents_r"] == approx(66.66666) | ||||
|     assert results["ents_f"] == approx(66.66666) | ||||
|     assert "GPE" in results["ents_per_type"] | ||||
|     assert "MONEY" in results["ents_per_type"] | ||||
|     assert "ORG" in results["ents_per_type"] | ||||
|     assert results["ents_per_type"]["GPE"]["p"] == 100 | ||||
|     assert results["ents_per_type"]["GPE"]["r"] == 100 | ||||
|     assert results["ents_per_type"]["GPE"]["f"] == 100 | ||||
|     assert results["ents_per_type"]["MONEY"]["p"] == 0 | ||||
|     assert results["ents_per_type"]["MONEY"]["r"] == 0 | ||||
|     assert results["ents_per_type"]["MONEY"]["f"] == 0 | ||||
|     assert results["ents_per_type"]["ORG"]["p"] == 50 | ||||
|     assert results["ents_per_type"]["ORG"]["r"] == 100 | ||||
|     assert results["ents_per_type"]["ORG"]["f"] == approx(66.66666) | ||||
|  |  | |||
|  | @ -1,18 +0,0 @@ | |||
| <div class="entities" style="line-height: 2.5; font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Helvetica, Arial, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol'; font-size: 18px">But  | ||||
| <mark class="entity" style="background: #7aecec; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone">Google  | ||||
| <span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem">ORG</span></mark>is starting from behind. The company made a late push into hardware, | ||||
| and  | ||||
| <mark class="entity" style="background: #7aecec; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone">Apple  | ||||
| <span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem">ORG</span></mark>’s  | ||||
| <mark class="entity" style="background: #bfeeb7; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone">Siri  | ||||
| <span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem">PRODUCT</span></mark>, available on  | ||||
| <mark class="entity" style="background: #bfeeb7; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone">iPhones  | ||||
| <span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem">PRODUCT</span></mark>, and  | ||||
| <mark class="entity" style="background: #7aecec; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone">Amazon  | ||||
| <span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem">ORG</span></mark>’s  | ||||
| <mark class="entity" style="background: #bfeeb7; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone">Alexa  | ||||
| <span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem">PRODUCT</span></mark>software, which runs on its  | ||||
| <mark class="entity" style="background: #bfeeb7; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone">Echo  | ||||
| <span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem">PRODUCT</span></mark>and  | ||||
| <mark class="entity" style="background: #bfeeb7; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone">Dot  | ||||
| <span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem">PRODUCT</span></mark>devices, have clear leads in consumer adoption.</div> | ||||
							
								
								
									
										16
									
								
								website/docs/images/displacy-ent1.html
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										16
									
								
								website/docs/images/displacy-ent1.html
									
									
									
									
									
										Normal file
									
								
							|  | @ -0,0 +1,16 @@ | |||
| <div class="entities" style="line-height: 2.5; font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Helvetica, Arial, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol'; font-size: 16px"> | ||||
|     <mark class="entity" style="background: #7aecec; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone"> | ||||
|         Apple | ||||
|         <span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem">ORG</span> | ||||
|     </mark> | ||||
|     is looking at buying | ||||
|     <mark class="entity" style="background: #feca74; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone"> | ||||
|         U.K. | ||||
|         <span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem">GPE</span> | ||||
|     </mark> | ||||
|     startup for | ||||
|     <mark class="entity" style="background: #e4e7d2; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone"> | ||||
|         $1 billion | ||||
|         <span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem">MONEY</span> | ||||
|     </mark> | ||||
| </div> | ||||
							
								
								
									
										18
									
								
								website/docs/images/displacy-ent2.html
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										18
									
								
								website/docs/images/displacy-ent2.html
									
									
									
									
									
										Normal file
									
								
							|  | @ -0,0 +1,18 @@ | |||
| <div class="entities" style="line-height: 2.5; font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Helvetica, Arial, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol'; font-size: 18px"> | ||||
|     When | ||||
|     <mark class="entity" style="background: #aa9cfc; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone"> | ||||
|         Sebastian Thrun | ||||
|         <span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem">PERSON</span> | ||||
|     </mark> | ||||
|     started working on self-driving cars at | ||||
|     <mark class="entity" style="background: #7aecec; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone"> | ||||
|         Google | ||||
|         <span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem">ORG</span> | ||||
|     </mark> | ||||
|     in | ||||
|     <mark class="entity" style="background: #bfe1d9; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone"> | ||||
|         2007 | ||||
|         <span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem">DATE</span> | ||||
|     </mark> | ||||
|     , few people outside of the company took him seriously. | ||||
| </div> | ||||
|  | @ -32,7 +32,7 @@ for ent in doc.ents: | |||
| Using spaCy's built-in [displaCy visualizer](/usage/visualizers), here's what | ||||
| our example sentence and its named entities look like: | ||||
| 
 | ||||
| import DisplaCyEntHtml from 'images/displacy-ent.html'; import { Iframe } from | ||||
| import DisplaCyEntHtml from 'images/displacy-ent1.html'; import { Iframe } from | ||||
| 'components/embed' | ||||
| 
 | ||||
| <Iframe title="displaCy visualization of entities" html={DisplaCyEntHtml} height={450} /> | ||||
| <Iframe title="displaCy visualization of entities" html={DisplaCyEntHtml} height={100} /> | ||||
|  |  | |||
|  | @ -564,19 +564,16 @@ For more details and examples, see the | |||
| import spacy | ||||
| from spacy import displacy | ||||
| 
 | ||||
| text = """But Google is starting from behind. The company made a late push | ||||
| into hardware, and Apple’s Siri, available on iPhones, and Amazon’s Alexa | ||||
| software, which runs on its Echo and Dot devices, have clear leads in | ||||
| consumer adoption.""" | ||||
| text = u"When Sebastian Thrun started working on self-driving cars at Google in 2007, few people outside of the company took him seriously." | ||||
| 
 | ||||
| nlp = spacy.load("custom_ner_model") | ||||
| nlp = spacy.load("en_core_web_sm") | ||||
| doc = nlp(text) | ||||
| displacy.serve(doc, style="ent") | ||||
| ``` | ||||
| 
 | ||||
| import DisplacyEntHtml from 'images/displacy-ent.html' | ||||
| import DisplacyEntHtml from 'images/displacy-ent2.html' | ||||
| 
 | ||||
| <Iframe title="displaCy visualizer for entities" html={DisplacyEntHtml} height={275} /> | ||||
| <Iframe title="displaCy visualizer for entities" html={DisplacyEntHtml} height={180} /> | ||||
| 
 | ||||
| ## Tokenization {#tokenization} | ||||
| 
 | ||||
|  |  | |||
|  | @ -117,19 +117,16 @@ text. | |||
| import spacy | ||||
| from spacy import displacy | ||||
| 
 | ||||
| text = """But Google is starting from behind. The company made a late push | ||||
| into hardware, and Apple’s Siri, available on iPhones, and Amazon’s Alexa | ||||
| software, which runs on its Echo and Dot devices, have clear leads in | ||||
| consumer adoption.""" | ||||
| text = u"When Sebastian Thrun started working on self-driving cars at Google in 2007, few people outside of the company took him seriously." | ||||
| 
 | ||||
| nlp = spacy.load("custom_ner_model") | ||||
| nlp = spacy.load("en_core_web_sm") | ||||
| doc = nlp(text) | ||||
| displacy.serve(doc, style="ent") | ||||
| ``` | ||||
| 
 | ||||
| import DisplacyEntHtml from 'images/displacy-ent.html' | ||||
| import DisplacyEntHtml from 'images/displacy-ent2.html' | ||||
| 
 | ||||
| <Iframe title="displaCy visualizer for entities" html={DisplacyEntHtml} height={275} /> | ||||
| <Iframe title="displaCy visualizer for entities" html={DisplacyEntHtml} height={180} /> | ||||
| 
 | ||||
| The entity visualizer lets you customize the following `options`: | ||||
| 
 | ||||
|  | @ -204,11 +201,14 @@ doc2 = nlp(LONG_NEWS_ARTICLE) | |||
| displacy.render(doc2, style="ent") | ||||
| ``` | ||||
| 
 | ||||
| > #### Enabling or disabling Jupyter mode | ||||
| > | ||||
| > To explicitly enable or disable "Jupyter mode", you can use the jupyter` | ||||
| > keyword argument – e.g. to return raw HTML in a notebook, or to force Jupyter | ||||
| > rendering if auto-detection fails. | ||||
| <Infobox variant="warning" title="Important note"> | ||||
| 
 | ||||
| To explicitly enable or disable "Jupyter mode", you can use the `jupyter` | ||||
| keyword argument – e.g. to return raw HTML in a notebook, or to force Jupyter | ||||
| rendering if auto-detection fails. | ||||
| 
 | ||||
| </Infobox> | ||||
| 
 | ||||
| 
 | ||||
|  | ||||
| 
 | ||||
|  | @ -284,7 +284,7 @@ nlp = spacy.load("en_core_web_sm") | |||
| sentences = [u"This is an example.", u"This is another one."] | ||||
| for sent in sentences: | ||||
|     doc = nlp(sent) | ||||
|     svg = displacy.render(doc, style="dep") | ||||
|     svg = displacy.render(doc, style="dep", jupyter=False) | ||||
|     file_name = '-'.join([w.text for w in doc if not w.is_punct]) + ".svg" | ||||
|     output_path = Path("/images/" + file_name) | ||||
|     output_path.open("w", encoding="utf-8").write(svg) | ||||
|  |  | |||
		Loading…
	
		Reference in New Issue
	
	Block a user