diff --git a/.github/contributors/isaric.md b/.github/contributors/isaric.md new file mode 100644 index 000000000..698eb1d07 --- /dev/null +++ b/.github/contributors/isaric.md @@ -0,0 +1,106 @@ +# spaCy contributor agreement + +This spaCy Contributor Agreement (**"SCA"**) is based on the +[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf). +The SCA applies to any contribution that you make to any product or project +managed by us (the **"project"**), and sets out the intellectual property rights +you grant to us in the contributed materials. The term **"us"** shall mean +[ExplosionAI GmbH](https://explosion.ai/legal). The term +**"you"** shall mean the person or entity identified below. + +If you agree to be bound by these terms, fill in the information requested +below and include the filled-in version with your first pull request, under the +folder [`.github/contributors/`](/.github/contributors/). The name of the file +should be your GitHub username, with the extension `.md`. For example, the user +example_user would create the file `.github/contributors/example_user.md`. + +Read this agreement carefully before signing. These terms and conditions +constitute a binding legal agreement. + +## Contributor Agreement + +1. The term "contribution" or "contributed materials" means any source code, +object code, patch, tool, sample, graphic, specification, manual, +documentation, or any other material posted or submitted by you to the project. + +2. With respect to any worldwide copyrights, or copyright applications and +registrations, in your contribution: + + * you hereby assign to us joint ownership, and to the extent that such + assignment is or becomes invalid, ineffective or unenforceable, you hereby + grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge, + royalty-free, unrestricted license to exercise all rights under those + copyrights. This includes, at our option, the right to sublicense these same + rights to third parties through multiple levels of sublicensees or other + licensing arrangements; + + * you agree that each of us can do all things in relation to your + contribution as if each of us were the sole owners, and if one of us makes + a derivative work of your contribution, the one who makes the derivative + work (or has it made will be the sole owner of that derivative work; + + * you agree that you will not assert any moral rights in your contribution + against us, our licensees or transferees; + + * you agree that we may register a copyright in your contribution and + exercise all ownership rights associated with it; and + + * you agree that neither of us has any duty to consult with, obtain the + consent of, pay or render an accounting to the other for any use or + distribution of your contribution. + +3. With respect to any patents you own, or that you can license without payment +to any third party, you hereby grant to us a perpetual, irrevocable, +non-exclusive, worldwide, no-charge, royalty-free license to: + + * make, have made, use, sell, offer to sell, import, and otherwise transfer + your contribution in whole or in part, alone or in combination with or + included in any product, work or materials arising out of the project to + which your contribution was submitted, and + + * at our option, to sublicense these same rights to third parties through + multiple levels of sublicensees or other licensing arrangements. + +4. Except as set out above, you keep all right, title, and interest in your +contribution. The rights that you grant to us under these terms are effective +on the date you first submitted a contribution to us, even if your submission +took place before the date you sign these terms. + +5. You covenant, represent, warrant and agree that: + + * Each contribution that you submit is and shall be an original work of + authorship and you can legally grant the rights set out in this SCA; + + * to the best of your knowledge, each contribution will not violate any + third party's copyrights, trademarks, patents, or other intellectual + property rights; and + + * each contribution shall be in compliance with U.S. export control laws and + other applicable export and import laws. You agree to notify us if you + become aware of any circumstance which would make any of the foregoing + representations inaccurate in any respect. We may publicly disclose your + participation in the project, including the fact that you have signed the SCA. + +6. This SCA is governed by the laws of the State of California and applicable +U.S. Federal law. Any choice of law rules will not apply. + +7. Please place an “x” on one of the applicable statement below. Please do NOT +mark both statements: + + * [x] I am signing on behalf of myself as an individual and no other person + or entity, including my employer, has or will have rights with respect to my + contributions. + + * [ ] I am signing on behalf of my employer or a legal entity and I have the + actual authority to contractually bind that entity. + +## Contributor Details + +| Field | Entry | +|------------------------------- | -------------------- | +| Name | Ivan Šarić | +| Company name (if applicable) | | +| Title or role (if applicable) | | +| Date | 18.08.2019. | +| GitHub username | isaric | +| Website (optional) | | diff --git a/.github/contributors/yanaiela.md b/.github/contributors/yanaiela.md new file mode 100644 index 000000000..ee76318c3 --- /dev/null +++ b/.github/contributors/yanaiela.md @@ -0,0 +1,106 @@ +# spaCy contributor agreement + +This spaCy Contributor Agreement (**"SCA"**) is based on the +[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf). +The SCA applies to any contribution that you make to any product or project +managed by us (the **"project"**), and sets out the intellectual property rights +you grant to us in the contributed materials. The term **"us"** shall mean +[ExplosionAI GmbH](https://explosion.ai/legal). The term +**"you"** shall mean the person or entity identified below. + +If you agree to be bound by these terms, fill in the information requested +below and include the filled-in version with your first pull request, under the +folder [`.github/contributors/`](/.github/contributors/). The name of the file +should be your GitHub username, with the extension `.md`. For example, the user +example_user would create the file `.github/contributors/example_user.md`. + +Read this agreement carefully before signing. These terms and conditions +constitute a binding legal agreement. + +## Contributor Agreement + +1. The term "contribution" or "contributed materials" means any source code, +object code, patch, tool, sample, graphic, specification, manual, +documentation, or any other material posted or submitted by you to the project. + +2. With respect to any worldwide copyrights, or copyright applications and +registrations, in your contribution: + + * you hereby assign to us joint ownership, and to the extent that such + assignment is or becomes invalid, ineffective or unenforceable, you hereby + grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge, + royalty-free, unrestricted license to exercise all rights under those + copyrights. This includes, at our option, the right to sublicense these same + rights to third parties through multiple levels of sublicensees or other + licensing arrangements; + + * you agree that each of us can do all things in relation to your + contribution as if each of us were the sole owners, and if one of us makes + a derivative work of your contribution, the one who makes the derivative + work (or has it made will be the sole owner of that derivative work; + + * you agree that you will not assert any moral rights in your contribution + against us, our licensees or transferees; + + * you agree that we may register a copyright in your contribution and + exercise all ownership rights associated with it; and + + * you agree that neither of us has any duty to consult with, obtain the + consent of, pay or render an accounting to the other for any use or + distribution of your contribution. + +3. With respect to any patents you own, or that you can license without payment +to any third party, you hereby grant to us a perpetual, irrevocable, +non-exclusive, worldwide, no-charge, royalty-free license to: + + * make, have made, use, sell, offer to sell, import, and otherwise transfer + your contribution in whole or in part, alone or in combination with or + included in any product, work or materials arising out of the project to + which your contribution was submitted, and + + * at our option, to sublicense these same rights to third parties through + multiple levels of sublicensees or other licensing arrangements. + +4. Except as set out above, you keep all right, title, and interest in your +contribution. The rights that you grant to us under these terms are effective +on the date you first submitted a contribution to us, even if your submission +took place before the date you sign these terms. + +5. You covenant, represent, warrant and agree that: + + * Each contribution that you submit is and shall be an original work of + authorship and you can legally grant the rights set out in this SCA; + + * to the best of your knowledge, each contribution will not violate any + third party's copyrights, trademarks, patents, or other intellectual + property rights; and + + * each contribution shall be in compliance with U.S. export control laws and + other applicable export and import laws. You agree to notify us if you + become aware of any circumstance which would make any of the foregoing + representations inaccurate in any respect. We may publicly disclose your + participation in the project, including the fact that you have signed the SCA. + +6. This SCA is governed by the laws of the State of California and applicable +U.S. Federal law. Any choice of law rules will not apply. + +7. Please place an “x” on one of the applicable statement below. Please do NOT +mark both statements: + + * [ ] I am signing on behalf of myself as an individual and no other person + or entity, including my employer, has or will have rights with respect to my + contributions. + + * [ ] I am signing on behalf of my employer or a legal entity and I have the + actual authority to contractually bind that entity. + +## Contributor Details + +| Field | Entry | +|------------------------------- | -------------------- | +| Name | Yanai Elazar | +| Company name (if applicable) | | +| Title or role (if applicable) | | +| Date | 14/8/2019 | +| GitHub username | yanaiela | +| Website (optional) | https://yanaiela.github.io | \ No newline at end of file diff --git a/.gitignore b/.gitignore index 35d431d48..c4ad59fc7 100644 --- a/.gitignore +++ b/.gitignore @@ -3,6 +3,7 @@ spacy/data/ corpora/ /models/ keys/ +*.json.gz # Website website/.cache/ diff --git a/spacy/_ml.py b/spacy/_ml.py index 1e8c0f27b..0411c4bd4 100644 --- a/spacy/_ml.py +++ b/spacy/_ml.py @@ -674,14 +674,14 @@ def build_nel_encoder(embed_width, hidden_width, ner_types, **cfg): with Model.define_operators({">>": chain, "**": clone}): # context encoder tok2vec = Tok2Vec( - width=hidden_width, - embed_size=embed_width, - pretrained_vectors=pretrained_vectors, - cnn_maxout_pieces=cnn_maxout_pieces, - subword_features=True, - conv_depth=conv_depth, - bilstm_depth=0, - ) + width=hidden_width, + embed_size=embed_width, + pretrained_vectors=pretrained_vectors, + cnn_maxout_pieces=cnn_maxout_pieces, + subword_features=True, + conv_depth=conv_depth, + bilstm_depth=0, + ) model = ( tok2vec diff --git a/spacy/cli/debug_data.py b/spacy/cli/debug_data.py index 656cd640a..0a9a0f7ef 100644 --- a/spacy/cli/debug_data.py +++ b/spacy/cli/debug_data.py @@ -8,7 +8,7 @@ import sys import srsly from wasabi import Printer, MESSAGES -from ..gold import GoldCorpus, read_json_object +from ..gold import GoldCorpus from ..syntax import nonproj from ..util import load_model, get_lang_class @@ -95,13 +95,19 @@ def debug_data( corpus = GoldCorpus(train_path, dev_path) try: train_docs = list(corpus.train_docs(nlp)) - train_docs_unpreprocessed = list(corpus.train_docs_without_preprocessing(nlp)) + train_docs_unpreprocessed = list( + corpus.train_docs_without_preprocessing(nlp) + ) except ValueError as e: - loading_train_error_message = "Training data cannot be loaded: {}".format(str(e)) + loading_train_error_message = "Training data cannot be loaded: {}".format( + str(e) + ) try: dev_docs = list(corpus.dev_docs(nlp)) except ValueError as e: - loading_dev_error_message = "Development data cannot be loaded: {}".format(str(e)) + loading_dev_error_message = "Development data cannot be loaded: {}".format( + str(e) + ) if loading_train_error_message or loading_dev_error_message: if loading_train_error_message: msg.fail(loading_train_error_message) @@ -158,11 +164,15 @@ def debug_data( ) if gold_train_data["n_misaligned_words"] > 0: msg.warn( - "{} misaligned tokens in the training data".format(gold_train_data["n_misaligned_words"]) + "{} misaligned tokens in the training data".format( + gold_train_data["n_misaligned_words"] + ) ) if gold_dev_data["n_misaligned_words"] > 0: msg.warn( - "{} misaligned tokens in the dev data".format(gold_dev_data["n_misaligned_words"]) + "{} misaligned tokens in the dev data".format( + gold_dev_data["n_misaligned_words"] + ) ) most_common_words = gold_train_data["words"].most_common(10) msg.text( @@ -184,7 +194,9 @@ def debug_data( if "ner" in pipeline: # Get all unique NER labels present in the data - labels = set(label for label in gold_train_data["ner"] if label not in ("O", "-")) + labels = set( + label for label in gold_train_data["ner"] if label not in ("O", "-") + ) label_counts = gold_train_data["ner"] model_labels = _get_labels_from_model(nlp, "ner") new_labels = [l for l in labels if l not in model_labels] @@ -222,7 +234,9 @@ def debug_data( ) if gold_train_data["ws_ents"]: - msg.fail("{} invalid whitespace entity spans".format(gold_train_data["ws_ents"])) + msg.fail( + "{} invalid whitespace entity spans".format(gold_train_data["ws_ents"]) + ) has_ws_ents_error = True for label in new_labels: @@ -323,33 +337,36 @@ def debug_data( "Found {} sentence{} with an average length of {:.1f} words.".format( gold_train_data["n_sents"], "s" if len(train_docs) > 1 else "", - gold_train_data["n_words"] / gold_train_data["n_sents"] + gold_train_data["n_words"] / gold_train_data["n_sents"], ) ) # profile labels labels_train = [label for label in gold_train_data["deps"]] - labels_train_unpreprocessed = [label for label in gold_train_unpreprocessed_data["deps"]] + labels_train_unpreprocessed = [ + label for label in gold_train_unpreprocessed_data["deps"] + ] labels_dev = [label for label in gold_dev_data["deps"]] if gold_train_unpreprocessed_data["n_nonproj"] > 0: msg.info( "Found {} nonprojective train sentence{}".format( gold_train_unpreprocessed_data["n_nonproj"], - "s" if gold_train_unpreprocessed_data["n_nonproj"] > 1 else "" + "s" if gold_train_unpreprocessed_data["n_nonproj"] > 1 else "", ) ) if gold_dev_data["n_nonproj"] > 0: msg.info( "Found {} nonprojective dev sentence{}".format( gold_dev_data["n_nonproj"], - "s" if gold_dev_data["n_nonproj"] > 1 else "" + "s" if gold_dev_data["n_nonproj"] > 1 else "", ) ) msg.info( "{} {} in train data".format( - len(labels_train_unpreprocessed), "label" if len(labels_train) == 1 else "labels" + len(labels_train_unpreprocessed), + "label" if len(labels_train) == 1 else "labels", ) ) msg.info( @@ -373,43 +390,45 @@ def debug_data( ) has_low_data_warning = True - # rare labels in projectivized train rare_projectivized_labels = [] for label in gold_train_data["deps"]: if gold_train_data["deps"][label] <= DEP_LABEL_THRESHOLD and "||" in label: - rare_projectivized_labels.append("{}: {}".format(label, str(gold_train_data["deps"][label]))) + rare_projectivized_labels.append( + "{}: {}".format(label, str(gold_train_data["deps"][label])) + ) if len(rare_projectivized_labels) > 0: - msg.warn( - "Low number of examples for {} label{} in the " - "projectivized dependency trees used for training. You may " - "want to projectivize labels such as punct before " - "training in order to improve parser performance.".format( - len(rare_projectivized_labels), - "s" if len(rare_projectivized_labels) > 1 else "") + msg.warn( + "Low number of examples for {} label{} in the " + "projectivized dependency trees used for training. You may " + "want to projectivize labels such as punct before " + "training in order to improve parser performance.".format( + len(rare_projectivized_labels), + "s" if len(rare_projectivized_labels) > 1 else "", ) - msg.warn( - "Projectivized labels with low numbers of examples: " - "{}".format("\n".join(rare_projectivized_labels)), - show=verbose - ) - has_low_data_warning = True + ) + msg.warn( + "Projectivized labels with low numbers of examples: " + "{}".format("\n".join(rare_projectivized_labels)), + show=verbose, + ) + has_low_data_warning = True # labels only in train if set(labels_train) - set(labels_dev): msg.warn( "The following labels were found only in the train data: " "{}".format(", ".join(set(labels_train) - set(labels_dev))), - show=verbose + show=verbose, ) # labels only in dev if set(labels_dev) - set(labels_train): msg.warn( - "The following labels were found only in the dev data: " + - ", ".join(set(labels_dev) - set(labels_train)), - show=verbose + "The following labels were found only in the dev data: " + + ", ".join(set(labels_dev) - set(labels_train)), + show=verbose, ) if has_low_data_warning: @@ -422,8 +441,10 @@ def debug_data( # multiple root labels if len(gold_train_unpreprocessed_data["roots"]) > 1: msg.warn( - "Multiple root labels ({}) ".format(", ".join(gold_train_unpreprocessed_data["roots"])) + - "found in training data. spaCy's parser uses a single root " + "Multiple root labels ({}) ".format( + ", ".join(gold_train_unpreprocessed_data["roots"]) + ) + + "found in training data. spaCy's parser uses a single root " "label ROOT so this distinction will not be available." ) @@ -432,14 +453,14 @@ def debug_data( msg.fail( "Found {} nonprojective projectivized train sentence{}".format( gold_train_data["n_nonproj"], - "s" if gold_train_data["n_nonproj"] > 1 else "" + "s" if gold_train_data["n_nonproj"] > 1 else "", ) ) if gold_train_data["n_cycles"] > 0: msg.fail( "Found {} projectivized train sentence{} with cycles".format( gold_train_data["n_cycles"], - "s" if gold_train_data["n_cycles"] > 1 else "" + "s" if gold_train_data["n_cycles"] > 1 else "", ) ) diff --git a/spacy/cli/evaluate.py b/spacy/cli/evaluate.py index 468698e2f..0a57ef2da 100644 --- a/spacy/cli/evaluate.py +++ b/spacy/cli/evaluate.py @@ -84,12 +84,12 @@ def evaluate( def render_parses(docs, output_path, model_name="", limit=250, deps=True, ents=True): docs[0].user_data["title"] = model_name if ents: - with (output_path / "entities.html").open("w") as file_: - html = displacy.render(docs[:limit], style="ent", page=True) + html = displacy.render(docs[:limit], style="ent", page=True) + with (output_path / "entities.html").open("w", encoding="utf8") as file_: file_.write(html) if deps: - with (output_path / "parses.html").open("w") as file_: - html = displacy.render( - docs[:limit], style="dep", page=True, options={"compact": True} - ) + html = displacy.render( + docs[:limit], style="dep", page=True, options={"compact": True} + ) + with (output_path / "parses.html").open("w", encoding="utf8") as file_: file_.write(html) diff --git a/spacy/cli/init_model.py b/spacy/cli/init_model.py index f3b60e7fa..93d37d4c9 100644 --- a/spacy/cli/init_model.py +++ b/spacy/cli/init_model.py @@ -114,7 +114,7 @@ def read_attrs_from_deprecated(freqs_loc, clusters_loc): probs, _ = read_freqs(freqs_loc) msg.good("Counted frequencies") else: - probs, _ = ({}, DEFAULT_OOV_PROB) + probs, _ = ({}, DEFAULT_OOV_PROB) # noqa: F841 if clusters_loc: with msg.loading("Reading clusters..."): clusters = read_clusters(clusters_loc) diff --git a/spacy/displacy/render.py b/spacy/displacy/render.py index c2a903a56..8e980d203 100644 --- a/spacy/displacy/render.py +++ b/spacy/displacy/render.py @@ -247,6 +247,15 @@ class EntityRenderer(object): self.direction = DEFAULT_DIR self.lang = DEFAULT_LANG + template = options.get("template") + if template: + self.ent_template = template + else: + if self.direction == "rtl": + self.ent_template = TPL_ENT_RTL + else: + self.ent_template = TPL_ENT + def render(self, parsed, page=False, minify=False): """Render complete markup. @@ -284,6 +293,7 @@ class EntityRenderer(object): label = span["label"] start = span["start"] end = span["end"] + additional_params = span.get("params", {}) entity = escape_html(text[start:end]) fragments = text[offset:start].split("\n") for i, fragment in enumerate(fragments): @@ -293,10 +303,8 @@ class EntityRenderer(object): if self.ents is None or label.upper() in self.ents: color = self.colors.get(label.upper(), self.default_color) ent_settings = {"label": label, "text": entity, "bg": color} - if self.direction == "rtl": - markup += TPL_ENT_RTL.format(**ent_settings) - else: - markup += TPL_ENT.format(**ent_settings) + ent_settings.update(additional_params) + markup += self.ent_template.format(**ent_settings) else: markup += entity offset = end diff --git a/spacy/errors.py b/spacy/errors.py index 0a4875d96..d23ad66e8 100644 --- a/spacy/errors.py +++ b/spacy/errors.py @@ -429,6 +429,7 @@ class Errors(object): E155 = ("The `nlp` object should have access to pre-trained word vectors, cf. " "https://spacy.io/usage/models#languages.") + @add_codes class TempErrors(object): T003 = ("Resizing pre-trained Tagger models is not currently supported.") diff --git a/spacy/lang/hr/examples.py b/spacy/lang/hr/examples.py new file mode 100644 index 000000000..dc52ce4f0 --- /dev/null +++ b/spacy/lang/hr/examples.py @@ -0,0 +1,18 @@ +# coding: utf8 +from __future__ import unicode_literals + +""" +Example sentences to test spaCy and its language models. + +>>> from spacy.lang.hr.examples import sentences +>>> docs = nlp.pipe(sentences) +""" + +sentences = [ + "Ovo je rečenica.", + "Kako se popravlja auto?", + "Zagreb je udaljen od Ljubljane svega 150 km.", + "Nećete vjerovati što se dogodilo na ovogodišnjem festivalu!", + "Budućnost Apple je upitna nakon dugotrajnog pada vrijednosti dionica firme.", + "Trgovina oružjem predstavlja prijetnju za globalni mir.", +] diff --git a/spacy/lang/ko/__init__.py b/spacy/lang/ko/__init__.py index 3ea72a67f..6dc6456e5 100644 --- a/spacy/lang/ko/__init__.py +++ b/spacy/lang/ko/__init__.py @@ -1,10 +1,8 @@ # encoding: utf8 from __future__ import unicode_literals, print_function -import re import sys - from .stop_words import STOP_WORDS from .tag_map import TAG_MAP from ...attrs import LANG @@ -32,7 +30,7 @@ else: from typing import NamedTuple class Morpheme(NamedTuple): - + surface = str("") lemma = str("") tag = str("") diff --git a/spacy/lang/tokenizer_exceptions.py b/spacy/lang/tokenizer_exceptions.py index 90141e81a..4d5ff4423 100644 --- a/spacy/lang/tokenizer_exceptions.py +++ b/spacy/lang/tokenizer_exceptions.py @@ -109,7 +109,7 @@ for orth in [ emoticons = set( - """ + r""" :) :-) :)) diff --git a/spacy/lang/zh/__init__.py b/spacy/lang/zh/__init__.py index b1ee5105c..91daea099 100644 --- a/spacy/lang/zh/__init__.py +++ b/spacy/lang/zh/__init__.py @@ -8,6 +8,7 @@ from ..tokenizer_exceptions import BASE_EXCEPTIONS from .stop_words import STOP_WORDS from .tag_map import TAG_MAP + class ChineseDefaults(Language.Defaults): lex_attr_getters = dict(Language.Defaults.lex_attr_getters) lex_attr_getters[LANG] = lambda text: "zh" @@ -45,4 +46,4 @@ class Chinese(Language): return Doc(self.vocab, words=words, spaces=spaces) -__all__ = ["Chinese"] \ No newline at end of file +__all__ = ["Chinese"] diff --git a/spacy/lang/zh/tag_map.py b/spacy/lang/zh/tag_map.py index 6aa988a98..8d2f99d01 100644 --- a/spacy/lang/zh/tag_map.py +++ b/spacy/lang/zh/tag_map.py @@ -1,8 +1,8 @@ # coding: utf8 from __future__ import unicode_literals -from ...symbols import POS, PUNCT, SYM, ADJ, CONJ, CCONJ, NUM, DET, ADV, ADP, X, VERB -from ...symbols import NOUN, PROPN, PART, INTJ, SPACE, PRON, AUX +from ...symbols import POS, PUNCT, ADJ, CONJ, CCONJ, NUM, DET, ADV, ADP, X, VERB +from ...symbols import NOUN, PART, INTJ, PRON # The Chinese part-of-speech tagger uses the OntoNotes 5 version of the Penn Treebank tag set. # We also map the tags to the simpler Google Universal POS tag set. @@ -43,5 +43,5 @@ TAG_MAP = { "JJ": {POS: ADJ}, "P": {POS: ADP}, "PN": {POS: PRON}, - "PU": {POS: PUNCT} -} \ No newline at end of file + "PU": {POS: PUNCT}, +} diff --git a/spacy/scorer.py b/spacy/scorer.py index 1362e9b4d..4032cc4dd 100644 --- a/spacy/scorer.py +++ b/spacy/scorer.py @@ -160,14 +160,15 @@ class Scorer(object): cand_deps.add((gold_i, gold_head, token.dep_.lower())) if "-" not in [token[-1] for token in gold.orig_annot]: # Find all NER labels in gold and doc - ent_labels = set([x[0] for x in gold_ents] - + [k.label_ for k in doc.ents]) + ent_labels = set([x[0] for x in gold_ents] + [k.label_ for k in doc.ents]) # Set up all labels for per type scoring and prepare gold per type gold_per_ents = {ent_label: set() for ent_label in ent_labels} for ent_label in ent_labels: if ent_label not in self.ner_per_ents: self.ner_per_ents[ent_label] = PRFScore() - gold_per_ents[ent_label].update([x for x in gold_ents if x[0] == ent_label]) + gold_per_ents[ent_label].update( + [x for x in gold_ents if x[0] == ent_label] + ) # Find all candidate labels, for all and per type cand_ents = set() cand_per_ents = {ent_label: set() for ent_label in ent_labels} diff --git a/spacy/tests/regression/test_issue4002.py b/spacy/tests/regression/test_issue4002.py index d5d7bc86c..37e054b3e 100644 --- a/spacy/tests/regression/test_issue4002.py +++ b/spacy/tests/regression/test_issue4002.py @@ -1,7 +1,6 @@ # coding: utf8 from __future__ import unicode_literals -import pytest from spacy.matcher import PhraseMatcher from spacy.tokens import Doc diff --git a/spacy/tests/regression/test_issue4104.py b/spacy/tests/regression/test_issue4104.py index b7c6af773..10ae2d360 100644 --- a/spacy/tests/regression/test_issue4104.py +++ b/spacy/tests/regression/test_issue4104.py @@ -3,12 +3,13 @@ from __future__ import unicode_literals from ..util import get_doc + def test_issue4104(en_vocab): """Test that English lookup lemmatization of spun & dry are correct expected mapping = {'dry': 'dry', 'spun': 'spin', 'spun-dry': 'spin-dry'} - """ - text = 'dry spun spun-dry' + """ + text = "dry spun spun-dry" doc = get_doc(en_vocab, [t for t in text.split(" ")]) # using a simple list to preserve order - expected = ['dry', 'spin', 'spin-dry'] + expected = ["dry", "spin", "spin-dry"] assert [token.lemma_ for token in doc] == expected diff --git a/spacy/tests/test_gold.py b/spacy/tests/test_gold.py index ac08716b7..860540be2 100644 --- a/spacy/tests/test_gold.py +++ b/spacy/tests/test_gold.py @@ -6,6 +6,7 @@ from spacy.gold import spans_from_biluo_tags, GoldParse from spacy.tokens import Doc import pytest + def test_gold_biluo_U(en_vocab): words = ["I", "flew", "to", "London", "."] spaces = [True, True, True, False, True] @@ -32,14 +33,18 @@ def test_gold_biluo_BIL(en_vocab): tags = biluo_tags_from_offsets(doc, entities) assert tags == ["O", "O", "O", "B-LOC", "I-LOC", "L-LOC", "O"] + def test_gold_biluo_overlap(en_vocab): words = ["I", "flew", "to", "San", "Francisco", "Valley", "."] spaces = [True, True, True, True, True, False, True] doc = Doc(en_vocab, words=words, spaces=spaces) - entities = [(len("I flew to "), len("I flew to San Francisco Valley"), "LOC"), - (len("I flew to "), len("I flew to San Francisco"), "LOC")] + entities = [ + (len("I flew to "), len("I flew to San Francisco Valley"), "LOC"), + (len("I flew to "), len("I flew to San Francisco"), "LOC"), + ] with pytest.raises(ValueError): - tags = biluo_tags_from_offsets(doc, entities) + biluo_tags_from_offsets(doc, entities) + def test_gold_biluo_misalign(en_vocab): words = ["I", "flew", "to", "San", "Francisco", "Valley."] diff --git a/spacy/tests/test_scorer.py b/spacy/tests/test_scorer.py index a88aef368..a747d3adb 100644 --- a/spacy/tests/test_scorer.py +++ b/spacy/tests/test_scorer.py @@ -7,67 +7,62 @@ from spacy.scorer import Scorer from .util import get_doc test_ner_cardinal = [ - [ - "100 - 200", - { - "entities": [ - [0, 3, "CARDINAL"], - [6, 9, "CARDINAL"] - ] - } - ] + ["100 - 200", {"entities": [[0, 3, "CARDINAL"], [6, 9, "CARDINAL"]]}] ] test_ner_apple = [ [ "Apple is looking at buying U.K. startup for $1 billion", - { - "entities": [ - (0, 5, "ORG"), - (27, 31, "GPE"), - (44, 54, "MONEY"), - ] - } + {"entities": [(0, 5, "ORG"), (27, 31, "GPE"), (44, 54, "MONEY")]}, ] ] + def test_ner_per_type(en_vocab): # Gold and Doc are identical scorer = Scorer() for input_, annot in test_ner_cardinal: - doc = get_doc(en_vocab, words = input_.split(' '), ents = [[0, 1, 'CARDINAL'], [2, 3, 'CARDINAL']]) - gold = GoldParse(doc, entities = annot['entities']) + doc = get_doc( + en_vocab, + words=input_.split(" "), + ents=[[0, 1, "CARDINAL"], [2, 3, "CARDINAL"]], + ) + gold = GoldParse(doc, entities=annot["entities"]) scorer.score(doc, gold) results = scorer.scores - assert results['ents_p'] == 100 - assert results['ents_f'] == 100 - assert results['ents_r'] == 100 - assert results['ents_per_type']['CARDINAL']['p'] == 100 - assert results['ents_per_type']['CARDINAL']['f'] == 100 - assert results['ents_per_type']['CARDINAL']['r'] == 100 + assert results["ents_p"] == 100 + assert results["ents_f"] == 100 + assert results["ents_r"] == 100 + assert results["ents_per_type"]["CARDINAL"]["p"] == 100 + assert results["ents_per_type"]["CARDINAL"]["f"] == 100 + assert results["ents_per_type"]["CARDINAL"]["r"] == 100 # Doc has one missing and one extra entity # Entity type MONEY is not present in Doc scorer = Scorer() for input_, annot in test_ner_apple: - doc = get_doc(en_vocab, words = input_.split(' '), ents = [[0, 1, 'ORG'], [5, 6, 'GPE'], [6, 7, 'ORG']]) - gold = GoldParse(doc, entities = annot['entities']) + doc = get_doc( + en_vocab, + words=input_.split(" "), + ents=[[0, 1, "ORG"], [5, 6, "GPE"], [6, 7, "ORG"]], + ) + gold = GoldParse(doc, entities=annot["entities"]) scorer.score(doc, gold) results = scorer.scores - assert results['ents_p'] == approx(66.66666) - assert results['ents_r'] == approx(66.66666) - assert results['ents_f'] == approx(66.66666) - assert 'GPE' in results['ents_per_type'] - assert 'MONEY' in results['ents_per_type'] - assert 'ORG' in results['ents_per_type'] - assert results['ents_per_type']['GPE']['p'] == 100 - assert results['ents_per_type']['GPE']['r'] == 100 - assert results['ents_per_type']['GPE']['f'] == 100 - assert results['ents_per_type']['MONEY']['p'] == 0 - assert results['ents_per_type']['MONEY']['r'] == 0 - assert results['ents_per_type']['MONEY']['f'] == 0 - assert results['ents_per_type']['ORG']['p'] == 50 - assert results['ents_per_type']['ORG']['r'] == 100 - assert results['ents_per_type']['ORG']['f'] == approx(66.66666) + assert results["ents_p"] == approx(66.66666) + assert results["ents_r"] == approx(66.66666) + assert results["ents_f"] == approx(66.66666) + assert "GPE" in results["ents_per_type"] + assert "MONEY" in results["ents_per_type"] + assert "ORG" in results["ents_per_type"] + assert results["ents_per_type"]["GPE"]["p"] == 100 + assert results["ents_per_type"]["GPE"]["r"] == 100 + assert results["ents_per_type"]["GPE"]["f"] == 100 + assert results["ents_per_type"]["MONEY"]["p"] == 0 + assert results["ents_per_type"]["MONEY"]["r"] == 0 + assert results["ents_per_type"]["MONEY"]["f"] == 0 + assert results["ents_per_type"]["ORG"]["p"] == 50 + assert results["ents_per_type"]["ORG"]["r"] == 100 + assert results["ents_per_type"]["ORG"]["f"] == approx(66.66666) diff --git a/website/docs/images/displacy-ent.html b/website/docs/images/displacy-ent.html deleted file mode 100644 index 4432bfd45..000000000 --- a/website/docs/images/displacy-ent.html +++ /dev/null @@ -1,18 +0,0 @@ -
But -Google -ORGis starting from behind. The company made a late push into hardware, -and -Apple -ORG’s -Siri -PRODUCT, available on -iPhones -PRODUCT, and -Amazon -ORG’s -Alexa -PRODUCTsoftware, which runs on its -Echo -PRODUCTand -Dot -PRODUCTdevices, have clear leads in consumer adoption.
diff --git a/website/docs/images/displacy-ent1.html b/website/docs/images/displacy-ent1.html new file mode 100644 index 000000000..6e3de2675 --- /dev/null +++ b/website/docs/images/displacy-ent1.html @@ -0,0 +1,16 @@ +
+ + Apple + ORG + + is looking at buying + + U.K. + GPE + + startup for + + $1 billion + MONEY + +
diff --git a/website/docs/images/displacy-ent2.html b/website/docs/images/displacy-ent2.html new file mode 100644 index 000000000..e72640b51 --- /dev/null +++ b/website/docs/images/displacy-ent2.html @@ -0,0 +1,18 @@ +
+ When + + Sebastian Thrun + PERSON + + started working on self-driving cars at + + Google + ORG + + in + + 2007 + DATE + + , few people outside of the company took him seriously. +
diff --git a/website/docs/usage/101/_named-entities.md b/website/docs/usage/101/_named-entities.md index e6bdfdf32..54db6dbe8 100644 --- a/website/docs/usage/101/_named-entities.md +++ b/website/docs/usage/101/_named-entities.md @@ -32,7 +32,7 @@ for ent in doc.ents: Using spaCy's built-in [displaCy visualizer](/usage/visualizers), here's what our example sentence and its named entities look like: -import DisplaCyEntHtml from 'images/displacy-ent.html'; import { Iframe } from +import DisplaCyEntHtml from 'images/displacy-ent1.html'; import { Iframe } from 'components/embed' -