Merge branch 'master' into spacy.io

2025-12-14 13:44:15 +03:00 · 2019-11-23 17:52:01 +01:00 · 2019-11-23 17:52:01 +01:00 · 4b61750985
commit 4b61750985
parent 02de21d8b4 cbacb0f1a4
12 changed files with 293 additions and 162 deletions
--- a/.github/contributors/mmaybeno.md
+++ b/.github/contributors/mmaybeno.md
@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI GmbH](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [x] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect to my
+    contributions.
+
+    * [ ] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                |
+|------------------------------- | -------------------- |
+| Name                           |  Matt Maybeno        |
+| Company name (if applicable)   |                      |
+| Title or role (if applicable)  |                      |
+| Date                           |  2019-11-19          |
+| GitHub username                |  mmaybeno            |
+| Website (optional)             |                      |
--- a/setup.cfg
+++ b/setup.cfg
@ -73,7 +73,7 @@ cuda100 =
    cupy-cuda100>=5.0.0b4
 # Language tokenizers with external dependencies
 ja =
-    mecab-python3==0.7
+    fugashi>=0.1.3
 ko =
    natto-py==0.9.0
 th =
--- a/spacy/lang/el/tag_map.py
+++ b/spacy/lang/el/tag_map.py
@ -2,7 +2,7 @@
 from __future__ import unicode_literals

 from ...symbols import POS, PUNCT, SYM, ADJ, CCONJ, SCONJ, NUM, DET, ADV, ADP, X, VERB
-from ...symbols import NOUN, PROPN, PART, INTJ, PRON
+from ...symbols import NOUN, PROPN, PART, INTJ, PRON, AUX


 TAG_MAP = {
@ -4249,4 +4249,20 @@ TAG_MAP = {
        "Voice": "Act",
        "Case": "Nom|Gen|Dat|Acc|Voc",
    },
+    'ADJ': {POS: ADJ},
+    'ADP': {POS: ADP},
+    'ADV': {POS: ADV},
+    'AtDf': {POS: DET},
+    'AUX': {POS: AUX},
+    'CCONJ': {POS: CCONJ},
+    'DET': {POS: DET},
+    'NOUN': {POS: NOUN},
+    'NUM': {POS: NUM},
+    'PART': {POS: PART},
+    'PRON': {POS: PRON},
+    'PROPN': {POS: PROPN},
+    'SCONJ': {POS: SCONJ},
+    'SYM': {POS: SYM},
+    'VERB': {POS: VERB},
+    'X': {POS: X},
 }
--- a/spacy/lang/es/tag_map.py
+++ b/spacy/lang/es/tag_map.py
@ -305,6 +305,9 @@ TAG_MAP = {
    "VERB__VerbForm=Ger": {"morph": "VerbForm=Ger", POS: VERB},
    "VERB__VerbForm=Inf": {"morph": "VerbForm=Inf", POS: VERB},
    "X___": {"morph": "_", POS: X},
+    "___PunctType=Quot": {POS: PUNCT},
+    "___VerbForm=Inf": {POS: VERB},
+    "___Number=Sing|Person=2|PronType=Prs": {POS: PRON},
    "_SP": {"morph": "_", POS: SPACE},
 }
 # fmt: on
--- a/spacy/lang/ja/init.py
+++ b/spacy/lang/ja/init.py
@ -12,21 +12,23 @@ from ...tokens import Doc
 from ...compat import copy_reg
 from ...util import DummyTokenizer

+# Handling for multiple spaces in a row is somewhat awkward, this simplifies
+# the flow by creating a dummy with the same interface.
+DummyNode = namedtuple("DummyNode", ["surface", "pos", "feature"])
+DummyNodeFeatures = namedtuple("DummyNodeFeatures", ["lemma"])
+DummySpace = DummyNode(' ', ' ', DummyNodeFeatures(' '))

-ShortUnitWord = namedtuple("ShortUnitWord", ["surface", "lemma", "pos"])
-
-
-def try_mecab_import():
-    """Mecab is required for Japanese support, so check for it.
+def try_fugashi_import():
+    """Fugashi is required for Japanese support, so check for it.
    It it's not available blow up and explain how to fix it."""
    try:
-        import MeCab
+        import fugashi

-        return MeCab
+        return fugashi
    except ImportError:
        raise ImportError(
-            "Japanese support requires MeCab: "
-            "https://github.com/SamuraiT/mecab-python3"
+            "Japanese support requires Fugashi: "
+            "https://github.com/polm/fugashi"
        )


@ -39,7 +41,7 @@ def resolve_pos(token):
    """

    # this is only used for consecutive ascii spaces
-    if token.pos == "空白":
+    if token.surface == " ":
        return "空白"

    # TODO: This is a first take. The rules here are crude approximations.
@ -53,55 +55,45 @@ def resolve_pos(token):
        return token.pos + ",ADJ"
    return token.pos

+def get_words_and_spaces(tokenizer, text):
+    """Get the individual tokens that make up the sentence and handle white space.
+
+    Japanese doesn't usually use white space, and MeCab's handling of it for
+    multiple spaces in a row is somewhat awkward.
+    """
+    
+    tokens = tokenizer.parseToNodeList(text)

-def detailed_tokens(tokenizer, text):
-    """Format Mecab output into a nice data structure, based on Janome."""
-    node = tokenizer.parseToNode(text)
-    node = node.next  # first node is beginning of sentence and empty, skip it
    words = []
    spaces = []
-    while node.posid != 0:
-        surface = node.surface
-        base = surface  # a default value. Updated if available later.
-        parts = node.feature.split(",")
-        pos = ",".join(parts[0:4])
-        if len(parts) > 7:
-            # this information is only available for words in the tokenizer
-            # dictionary
-            base = parts[7]
-        words.append(ShortUnitWord(surface, base, pos))
-
-        # The way MeCab stores spaces is that the rlength of the next token is
-        # the length of that token plus any preceding whitespace, **in bytes**.
-        # also note that this is only for half-width / ascii spaces. Full width
-        # spaces just become tokens.
-        scount = node.next.rlength - node.next.length
-        spaces.append(bool(scount))
-        while scount > 1:
-            words.append(ShortUnitWord(" ", " ", "空白"))
+    for token in tokens:
+        # If there's more than one space, spaces after the first become tokens
+        for ii in range(len(token.white_space) - 1):
+            words.append(DummySpace)
            spaces.append(False)
-            scount -= 1

-        node = node.next
+        words.append(token)
+        spaces.append(bool(token.white_space))
    return words, spaces

-
 class JapaneseTokenizer(DummyTokenizer):
    def __init__(self, cls, nlp=None):
        self.vocab = nlp.vocab if nlp is not None else cls.create_vocab(nlp)
-        self.tokenizer = try_mecab_import().Tagger()
-        self.tokenizer.parseToNode("")  # see #2901
+        self.tokenizer = try_fugashi_import().Tagger()
+        self.tokenizer.parseToNodeList("")  # see #2901

    def __call__(self, text):
-        dtokens, spaces = detailed_tokens(self.tokenizer, text)
+        dtokens, spaces = get_words_and_spaces(self.tokenizer, text)
        words = [x.surface for x in dtokens]
        doc = Doc(self.vocab, words=words, spaces=spaces)
-        mecab_tags = []
+        unidic_tags = []
        for token, dtoken in zip(doc, dtokens):
-            mecab_tags.append(dtoken.pos)
+            unidic_tags.append(dtoken.pos)
            token.tag_ = resolve_pos(dtoken)
-            token.lemma_ = dtoken.lemma
-        doc.user_data["mecab_tags"] = mecab_tags
+
+            # if there's no lemma info (it's an unk) just use the surface
+            token.lemma_ = dtoken.feature.lemma or dtoken.surface
+        doc.user_data["unidic_tags"] = unidic_tags
        return doc


@ -131,5 +123,4 @@ def pickle_japanese(instance):

 copy_reg.pickle(Japanese, pickle_japanese)

-
 __all__ = ["Japanese"]
--- a/spacy/lang/pt/tag_map.py
+++ b/spacy/lang/pt/tag_map.py
@ -5039,5 +5039,19 @@ TAG_MAP = {
    "punc": {POS: PUNCT},
    "v-pcp|M|P": {POS: VERB},
    "v-pcp|M|S": {POS: VERB},
+    "ADJ": {POS: ADJ},
+    "AUX": {POS: AUX},
+    "CCONJ": {POS: CCONJ},
+    "DET": {POS: DET},
+    "INTJ": {POS: INTJ},
+    "NUM": {POS: NUM},
+    "PART": {POS: PART},
+    "PRON": {POS: PRON},
+    "PUNCT": {POS: PUNCT},
+    "SCONJ": {POS: SCONJ},
+    "SYM": {POS: SYM},
+    "VERB": {POS: VERB},
+    "X": {POS: X},
+    "adv": {POS: ADV},
    "_SP": {POS: SPACE},
 }
--- a/spacy/tests/conftest.py
+++ b/spacy/tests/conftest.py
@ -125,7 +125,7 @@ def it_tokenizer():

@pytest.fixture(scope="session")
 def ja_tokenizer():
-    pytest.importorskip("MeCab")
+    pytest.importorskip("fugashi")
    return get_lang_class("ja").Defaults.create_tokenizer()


--- a/spacy/vocab.pyx
+++ b/spacy/vocab.pyx
@ -3,7 +3,6 @@
 from __future__ import unicode_literals
 from libc.string cimport memcpy

-import numpy
 import srsly
 from collections import OrderedDict
 from thinc.neural.util import get_array_module
@ -361,7 +360,8 @@ cdef class Vocab:
            minn = len(word)
        if maxn is None:
            maxn = len(word)
-        vectors = numpy.zeros((self.vectors_length,), dtype="f")
+        xp = get_array_module(self.vectors.data)
+        vectors = xp.zeros((self.vectors_length,), dtype="f")
        # Fasttext's ngram computation taken from
        # https://github.com/facebookresearch/fastText
        ngrams_size = 0;
@ -381,7 +381,7 @@ cdef class Vocab:
                    j = j + 1
                if (n >= minn and not (n == 1 and (i == 0 or j == len(word)))):
                    if self.strings[ngram] in self.vectors.key2row:
-                        vectors = numpy.add(self.vectors[self.strings[ngram]],vectors)
+                        vectors = xp.add(self.vectors[self.strings[ngram]], vectors)
                        ngrams_size += 1
                n = n + 1
        if ngrams_size > 0:
--- a/website/docs/api/lexeme.md
+++ b/website/docs/api/lexeme.md
@ -123,7 +123,7 @@ The L2 norm of the lexeme's vector representation.
 ## Attributes {#attributes}

 | Name                                         | Type    | Description                                                                                                                                                                                                                                                  |
-| -------------------------------------------- | ------- | ------------------------------------------------------------------------------------------------------------ |
+| -------------------------------------------- | ------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
 | `vocab`                                      | `Vocab` | The lexeme's vocabulary.                                                                                                                                                                                                                                     |
 | `text`                                       | unicode | Verbatim text content.                                                                                                                                                                                                                                       |
 | `orth`                                       | int     | ID of the verbatim text content.                                                                                                                                                                                                                             |
@ -134,8 +134,8 @@ The L2 norm of the lexeme's vector representation.
 | `norm_`                                      | unicode | The lexemes's norm, i.e. a normalized form of the lexeme text.                                                                                                                                                                                               |
 | `lower`                                      | int     | Lowercase form of the word.                                                                                                                                                                                                                                  |
 | `lower_`                                     | unicode | Lowercase form of the word.                                                                                                                                                                                                                                  |
-| `shape`                                      | int     | Transform of the word's string, to show orthographic features.                                               |
-| `shape_`                                     | unicode | Transform of the word's string, to show orthographic features.                                               |
+| `shape`                                      | int     | Transform of the words's string, to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. |
+| `shape_`                                     | unicode | Transform of the word's string, to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`.  |
 | `prefix`                                     | int     | Length-N substring from the start of the word. Defaults to `N=1`.                                                                                                                                                                                            |
 | `prefix_`                                    | unicode | Length-N substring from the start of the word. Defaults to `N=1`.                                                                                                                                                                                            |
 | `suffix`                                     | int     | Length-N substring from the end of the word. Defaults to `N=3`.                                                                                                                                                                                              |
--- a/website/docs/api/token.md
+++ b/website/docs/api/token.md
@ -409,7 +409,7 @@ The L2 norm of the token's vector representation.
 ## Attributes {#attributes}

 | Name                                         | Type         | Description                                                                                                                                                                                                                                                   |
-| -------------------------------------------- | ------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| -------------------------------------------- | ------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
 | `doc`                                        | `Doc`        | The parent document.                                                                                                                                                                                                                                          |
 | `sent` <Tag variant="new">2.0.12</Tag>       | `Span`       | The sentence span that this token is a part of.                                                                                                                                                                                                               |
 | `text`                                       | unicode      | Verbatim text content.                                                                                                                                                                                                                                        |
@ -437,8 +437,8 @@ The L2 norm of the token's vector representation.
 | `norm_`                                      | unicode      | The token's norm, i.e. a normalized form of the token text. Usually set in the language's [tokenizer exceptions](/usage/adding-languages#tokenizer-exceptions) or [norm exceptions](/usage/adding-languages#norm-exceptions).                                 |
 | `lower`                                      | int          | Lowercase form of the token.                                                                                                                                                                                                                                  |
 | `lower_`                                     | unicode      | Lowercase form of the token text. Equivalent to `Token.text.lower()`.                                                                                                                                                                                         |
-| `shape`                                      | int          | Transform of the tokens's string, to show orthographic features. For example, "Xxxx" or "dd".                                                                                                                                 |
-| `shape_`                                     | unicode      | Transform of the tokens's string, to show orthographic features. For example, "Xxxx" or "dd".                                                                                                                                 |
+| `shape`                                      | int          | Transform of the tokens's string, to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. |
+| `shape_`                                     | unicode      | Transform of the tokens's string, to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. |
 | `prefix`                                     | int          | Hash value of a length-N substring from the start of the token. Defaults to `N=1`.                                                                                                                                                                            |
 | `prefix_`                                    | unicode      | A length-N substring from the start of the token. Defaults to `N=1`.                                                                                                                                                                                          |
 | `suffix`                                     | int          | Hash value of a length-N substring from the end of the token. Defaults to `N=3`.                                                                                                                                                                              |
--- a/website/docs/usage/rule-based-matching.md
+++ b/website/docs/usage/rule-based-matching.md
@ -638,7 +638,7 @@ punctuation – depending on the

 The `IS_DIGIT` flag is not very helpful here, because it doesn't tell us
 anything about the length. However, you can use the `SHAPE` flag, with each `d`
-representing a digit:
+representing a digit (up to 4 digits / characters):

 ```python
 [{"ORTH": "("}, {"SHAPE": "ddd"}, {"ORTH": ")"}, {"SHAPE": "dddd"},
@ -654,7 +654,7 @@ match the most common formats of

 ```python
 [{"ORTH": "+"}, {"ORTH": "49"}, {"ORTH": "(", "OP": "?"}, {"SHAPE": "dddd"},
- {"ORTH": ")", "OP": "?"}, {"SHAPE": "dddddd"}]
+ {"ORTH": ")", "OP": "?"}, {"SHAPE": "dddd", "LENGTH": 6}]
 ```

 Depending on the formats your application needs to match, creating an extensive
--- a/website/meta/languages.json
+++ b/website/meta/languages.json
@ -155,7 +155,8 @@
            "name": "Japanese",
            "dependencies": [
                { "name": "Unidic", "url": "http://unidic.ninjal.ac.jp/back_number#unidic_cwj" },
-                { "name": "Mecab", "url": "https://github.com/taku910/mecab" }
+                { "name": "Mecab", "url": "https://github.com/taku910/mecab" },
+                { "name": "fugashi", "url": "https://github.com/polm/fugashi" }
            ],
            "example": "これは文章です。",
            "has_examples": true