Merge branch 'master' into spacy.io

2026-01-14 04:19:07 +03:00 · 2019-05-24 14:06:47 +02:00 · 2019-05-24 14:06:47 +02:00 · 1572490d57
commit 1572490d57
parent 3cbbc4afcb 7634812172
7 changed files with 314 additions and 61 deletions
--- a/.github/contributors/ujwal-narayan.md
+++ b/.github/contributors/ujwal-narayan.md
@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI GmbH](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [x] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect to my
+    contributions.
+
+    * [ ] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                |
+|------------------------------- | -------------------- |
+| Name                           |      Ujwal Narayan   |
+| Company name (if applicable)   |                      |
+| Title or role (if applicable)  |                      |
+| Date                           |   17/05/2019         |
+| GitHub username                |   ujwal-narayan      |
+| Website (optional)             |                      |
--- a/spacy/lang/kn/stop_words.py
+++ b/spacy/lang/kn/stop_words.py
@ -4,67 +4,87 @@ from __future__ import unicode_literals

 STOP_WORDS = set(
    """
-ಈ
-ಮತ್ತು
-ಹಾಗೂ
-ಅವರು
-ಅವರ
-ಬಗ್ಗೆ
-ಎಂಬ
-ಆದರೆ
-ಅವರನ್ನು
-ಆದರೆ
-ತಮ್ಮ
-ಒಂದು
-ಎಂದರು
-ಮೇಲೆ
-ಹೇಳಿದರು
-ಸೇರಿದಂತೆ
-ಬಳಿಕ
-ಆ
-ಯಾವುದೇ
-ಅವರಿಗೆ
-ನಡೆದ
-ಕುರಿತು
-ಇದು
-ಅವರು
-ಕಳೆದ
-ಇದೇ
-ತಿಳಿಸಿದರು
-ಹೀಗಾಗಿ
-ಕೂಡ
-ತನ್ನ
-ತಿಳಿಸಿದ್ದಾರೆ
-ನಾನು
-ಹೇಳಿದ್ದಾರೆ
-ಈಗ
-ಎಲ್ಲ
-ನನ್ನ
-ನಮ್ಮ
-ಈಗಾಗಲೇ
-ಇದಕ್ಕೆ
 ಹಲವು
-ಇದೆ
-ಮತ್ತೆ
-ಮಾಡುವ
-ನೀಡಿದರು
-ನಾವು
-ನೀಡಿದ
-ಇದರಿಂದ
+ಮೂಲಕ
+ಹಾಗೂ
 ಅದು
-ಇದನ್ನು
 ನೀಡಿದ್ದಾರೆ
+ಯಾವ
+ಎಂದರು
+ಅವರು
+ಈಗ
+ಎಂಬ
+ಹಾಗಾಗಿ
+ಅಷ್ಟೇ
+ನಾವು
+ಇದೇ
+ಹೇಳಿ
+ತಮ್ಮ
+ಹೀಗೆ
+ನಮ್ಮ
+ಬೇರೆ
+ನೀಡಿದರು
+ಮತ್ತೆ
+ಇದು
+ಈ
+ನೀವು
+ನಾನು
+ಇತ್ತು
+ಎಲ್ಲಾ
+ಯಾವುದೇ
+ನಡೆದ
 ಅದನ್ನು
-ಇಲ್ಲಿ
-ಆಗ
-ಬಂದಿದೆ.
-ಅದೇ
-ಇರುವ
-ಅಲ್ಲದೆ
-ಕೆಲವು
+ಎಂದರೆ
 ನೀಡಿದೆ
+ಹೀಗಾಗಿ
+ಜೊತೆಗೆ
+ಇದರಿಂದ
+ನನಗೆ
+ಅಲ್ಲದೆ
+ಎಷ್ಟು
 ಇದರ
+ಇಲ್ಲ
+ಕಳೆದ
+ತುಂಬಾ
+ಈಗಾಗಲೇ
+ಮಾಡಿ
+ಅದಕ್ಕೆ
+ಬಗ್ಗೆ
+ಅವರ
+ಇದನ್ನು
+ಆ
+ಇದೆ
+ಹೆಚ್ಚು
 ಇನ್ನು
+ಎಲ್ಲ
+ಇರುವ
+ಅವರಿಗೆ
+ನಿಮ್ಮ
+ಏನು
+ಕೂಡ
+ಇಲ್ಲಿ
+ನನ್ನನ್ನು
+ಕೆಲವು
+ಮಾತ್ರ
+ಬಳಿಕ
+ಅಂತ
+ತನ್ನ
+ಆಗ
+ಅಥವಾ
+ಅಲ್ಲ
+ಕೇವಲ
+ಆದರೆ
+ಮತ್ತು
+ಇನ್ನೂ
+ಅದೇ
+ಆಗಿ
+ಅವರನ್ನು
+ಹೇಳಿದ್ದಾರೆ
 ನಡೆದಿದೆ
+ಇದಕ್ಕೆ
+ಎಂಬುದು
+ಎಂದು
+ನನ್ನ
+ಮೇಲೆ
 """.split()
 )
--- a/spacy/language.py
+++ b/spacy/language.py
@ -417,7 +417,9 @@ class Language(object):
        golds (iterable): A batch of `GoldParse` objects.
        drop (float): The droput rate.
        sgd (callable): An optimizer.
-        RETURNS (dict): Results from the update.
+        losses (dict): Dictionary to update with the loss, keyed by component.
+        component_cfg (dict): Config parameters for specific pipeline
+            components, keyed by component name.

        DOCS: https://spacy.io/api/language#update
        """
@ -598,6 +600,19 @@ class Language(object):
    def evaluate(
        self, docs_golds, verbose=False, batch_size=256, scorer=None, component_cfg=None
    ):
+        """Evaluate a model's pipeline components.
+
+        docs_golds (iterable): Tuples of `Doc` and `GoldParse` objects.
+        verbose (bool): Print debugging information.
+        batch_size (int): Batch size to use.
+        scorer (Scorer): Optional `Scorer` to use. If not passed in, a new one
+            will be created.
+        component_cfg (dict): An optional dictionary with extra keyword
+            arguments for specific components.
+        RETURNS (Scorer): The scorer containing the evaluation results.
+
+        DOCS: https://spacy.io/api/language#evaluate
+        """
        if scorer is None:
            scorer = Scorer()
        if component_cfg is None:
--- a/spacy/scorer.py
+++ b/spacy/scorer.py
@ -35,7 +35,17 @@ class PRFScore(object):


 class Scorer(object):
+    """Compute evaluation scores."""
+
    def __init__(self, eval_punct=False):
+        """Initialize the Scorer.
+
+        eval_punct (bool): Evaluate the dependency attachments to and from
+            punctuation.
+        RETURNS (Scorer): The newly created object.
+
+        DOCS: https://spacy.io/api/scorer#init
+        """
        self.tokens = PRFScore()
        self.sbd = PRFScore()
        self.unlabelled = PRFScore()
@ -46,34 +56,46 @@ class Scorer(object):

    @property
    def tags_acc(self):
+        """RETURNS (float): Part-of-speech tag accuracy (fine grained tags,
+            i.e. `Token.tag`).
+        """
        return self.tags.fscore * 100

    @property
    def token_acc(self):
+        """RETURNS (float): Tokenization accuracy."""
        return self.tokens.precision * 100

    @property
    def uas(self):
+        """RETURNS (float): Unlabelled dependency score."""
        return self.unlabelled.fscore * 100

    @property
    def las(self):
+        """RETURNS (float): Labelled depdendency score."""
        return self.labelled.fscore * 100

    @property
    def ents_p(self):
+        """RETURNS (float): Named entity accuracy (precision)."""
        return self.ner.precision * 100

    @property
    def ents_r(self):
+        """RETURNS (float): Named entity accuracy (recall)."""
        return self.ner.recall * 100

    @property
    def ents_f(self):
+        """RETURNS (float): Named entity accuracy (F-score)."""
        return self.ner.fscore * 100

    @property
    def scores(self):
+        """RETURNS (dict): All scores with keys `uas`, `las`, `ents_p`,
+            `ents_r`, `ents_f`, `tags_acc` and `token_acc`.
+        """
        return {
            "uas": self.uas,
            "las": self.las,
@ -84,9 +106,20 @@ class Scorer(object):
            "token_acc": self.token_acc,
        }

-    def score(self, tokens, gold, verbose=False, punct_labels=("p", "punct")):
-        if len(tokens) != len(gold):
-            gold = GoldParse.from_annot_tuples(tokens, zip(*gold.orig_annot))
+    def score(self, doc, gold, verbose=False, punct_labels=("p", "punct")):
+        """Update the evaluation scores from a single Doc / GoldParse pair.
+
+        doc (Doc): The predicted annotations.
+        gold (GoldParse): The correct annotations.
+        verbose (bool): Print debugging information.
+        punct_labels (tuple): Dependency labels for punctuation. Used to
+            evaluate dependency attachments to punctuation if `eval_punct` is
+            `True`.
+
+        DOCS: https://spacy.io/api/scorer#score
+        """
+        if len(doc) != len(gold):
+            gold = GoldParse.from_annot_tuples(doc, zip(*gold.orig_annot))
        gold_deps = set()
        gold_tags = set()
        gold_ents = set(tags_to_entities([annot[-1] for annot in gold.orig_annot]))
@ -96,7 +129,7 @@ class Scorer(object):
                gold_deps.add((id_, head, dep.lower()))
        cand_deps = set()
        cand_tags = set()
-        for token in tokens:
+        for token in doc:
            if token.orth_.isspace():
                continue
            gold_i = gold.cand_to_gold[token.i]
@ -116,7 +149,7 @@ class Scorer(object):
                    cand_deps.add((gold_i, gold_head, token.dep_.lower()))
        if "-" not in [token[-1] for token in gold.orig_annot]:
            cand_ents = set()
-            for ent in tokens.ents:
+            for ent in doc.ents:
                first = gold.cand_to_gold[ent.start]
                last = gold.cand_to_gold[ent.end - 1]
                if first is None or last is None:
--- a/website/docs/api/language.md
+++ b/website/docs/api/language.md
@ -119,8 +119,28 @@ Update the models in the pipeline.
 | `golds`                                      | iterable | A batch of `GoldParse` objects or dictionaries. Dictionaries will be used to create [`GoldParse`](/api/goldparse) objects. For the available keys and their usage, see [`GoldParse.__init__`](/api/goldparse#init). |
 | `drop`                                       | float    | The dropout rate.                                                                                                                                                                                                   |
 | `sgd`                                        | callable | An optimizer.                                                                                                                                                                                                       |
+| `losses`                                     | dict     | Dictionary to update with the loss, keyed by pipeline component.                                                                                                                                                    |
 | `component_cfg` <Tag variant="new">2.1</Tag> | dict     | Config parameters for specific pipeline components, keyed by component name.                                                                                                                                        |

+## Language.evaluate {#evaluate tag="method"}
+
+Evaluate a model's pipeline components.
+
+> #### Example
+>
+> ```python
+> scorer = nlp.evaluate(docs_golds, verbose=True)
+> print(scorer.scores)
+> ```
+
+| Name                                         | Type     | Description                                                                           |
+| -------------------------------------------- | -------- | ------------------------------------------------------------------------------------- |
+| `docs_golds`                                 | iterable | Tuples of `Doc` and `GoldParse` objects.                                              |
+| `verbose`                                    | bool     | Print debugging information.                                                          |
+| `batch_size`                                 | int      | The batch size to use.                                                                |
+| `scorer`                                     | `Scorer` | Optional [`Scorer`](/api/scorer) to use. If not passed in, a new one will be created. |
+| `component_cfg` <Tag variant="new">2.1</Tag> | dict     | Config parameters for specific pipeline components, keyed by component name.          |
+
 ## Language.begin_training {#begin_training tag="method"}

 Allocate models, pre-process training data and acquire an optimizer.
--- a/website/docs/api/scorer.md
+++ b/website/docs/api/scorer.md
@ -0,0 +1,58 @@
+---
+title: Scorer
+teaser: Compute evaluation scores
+tag: class
+source: spacy/scorer.py
+---
+
+The `Scorer` computes and stores evaluation scores. It's typically created by
+[`Language.evaluate`](/api/language#evaluate).
+
+## Scorer.\_\_init\_\_ {#init tag="method"}
+
+Create a new `Scorer`.
+
+> #### Example
+>
+> ```python
+> from spacy.scorer import Scorer
+>
+> scorer = Scorer()
+> ```
+
+| Name         | Type     | Description                                                  |
+| ------------ | -------- | ------------------------------------------------------------ |
+| `eval_punct` | bool     | Evaluate the dependency attachments to and from punctuation. |
+| **RETURNS**  | `Scorer` | The newly created object.                                    |
+
+## Scorer.score {#score tag="method"}
+
+Update the evaluation scores from a single [`Doc`](/api/doc) /
+[`GoldParse`](/api/goldparse) pair.
+
+> #### Example
+>
+> ```python
+> scorer = Scorer()
+> scorer.score(doc, gold)
+> ```
+
+| Name           | Type        | Description                                                                                                          |
+| -------------- | ----------- | -------------------------------------------------------------------------------------------------------------------- |
+| `doc`          | `Doc`       | The predicted annotations.                                                                                           |
+| `gold`         | `GoldParse` | The correct annotations.                                                                                             |
+| `verbose`      | bool        | Print debugging information.                                                                                         |
+| `punct_labels` | tuple       | Dependency labels for punctuation. Used to evaluate dependency attachments to punctuation if `eval_punct` is `True`. |
+
+## Properties
+
+| Name        | Type  | Description                                                                                  |
+| ----------- | ----- | -------------------------------------------------------------------------------------------- |
+| `token_acc` | float | Tokenization accuracy.                                                                       |
+| `tags_acc`  | float | Part-of-speech tag accuracy (fine grained tags, i.e. `Token.tag`).                           |
+| `uas`       | float | Unlabelled dependency score.                                                                 |
+| `las`       | float | Labelled dependency score.                                                                   |
+| `ents_p`    | float | Named entity accuracy (precision).                                                           |
+| `ents_r`    | float | Named entity accuracy (recall).                                                              |
+| `ents_f`    | float | Named entity accuracy (F-score).                                                             |
+| `scores`    | dict  | All scores with keys `uas`, `las`, `ents_p`, `ents_r`, `ents_f`, `tags_acc` and `token_acc`. |
--- a/website/meta/sidebars.json
+++ b/website/meta/sidebars.json
@ -90,7 +90,8 @@
                    { "text": "StringStore", "url": "/api/stringstore" },
                    { "text": "Vectors", "url": "/api/vectors" },
                    { "text": "GoldParse", "url": "/api/goldparse" },
-                    { "text": "GoldCorpus", "url": "/api/goldcorpus" }
+                    { "text": "GoldCorpus", "url": "/api/goldcorpus" },
+                    { "text": "Scorer", "url": "/api/scorer" }
                ]
            },
            {