Merge branch 'master' into spacy.io

2025-11-02 00:47:52 +03:00 · 2019-11-18 12:36:04 +01:00 · 2019-11-18 12:36:04 +01:00 · 2a38fd00bd
commit 2a38fd00bd
parent 72e1aadb3f 62e00fd9da
1 changed files with 119 additions and 44 deletions
--- a/website/docs/usage/linguistic-features.md
+++ b/website/docs/usage/linguistic-features.md
@ -715,7 +715,7 @@ assert "gimme" not in [w.text for w in nlp('("...gimme...?")')]
 The special case rules have precedence over the punctuation splitting:
 ```python
-nlp.tokenizer.add_special_case("...gimme...?", [{ORTH: "...gimme...?"}])
+nlp.tokenizer.add_special_case("...gimme...?", [{"ORTH": "...gimme...?"}])
 assert len(nlp("...gimme...?")) == 1
 ```
@ -725,40 +725,52 @@ spaCy introduces a novel tokenization algorithm, that gives a better balance
 between performance, ease of definition, and ease of alignment into the original
 string.
-After consuming a prefix or infix, we consult the special cases again. We want
+After consuming a prefix or suffix, we consult the special cases again. We want
 the special cases to handle things like "don't" in English, and we want the same
 rule to work for "(don't)!". We do this by splitting off the open bracket, then
-the exclamation, then the close bracket, and finally matching the special-case.
+the exclamation, then the close bracket, and finally matching the special case.
 Here's an implementation of the algorithm in Python, optimized for readability
 rather than performance:
 ```python
-def tokenizer_pseudo_code(text, special_cases,
+def tokenizer_pseudo_code(self, special_cases, prefix_search, suffix_search,
-                          find_prefix, find_suffix, find_infixes):
+                          infix_finditer, token_match):
    tokens = []
-    for substring in text.split(' '):
+    for substring in text.split():
        suffixes = []
        while substring:
            while prefix_search(substring) or suffix_search(substring):
                if substring in special_cases:
                    tokens.extend(special_cases[substring])
                    substring = ''
                    break
                if prefix_search(substring):
                    split = prefix_search(substring).end()
                    tokens.append(substring[:split])
                    substring = substring[split:]
                    if substring in special_cases:
                        continue
                if suffix_search(substring):
                    split = suffix_search(substring).start()
                    suffixes.append(substring[split:])
                    substring = substring[:split]
            if substring in special_cases:
                tokens.extend(special_cases[substring])
                substring = ''
-            elif find_prefix(substring) is not None:
+            elif token_match(substring):
-                split = find_prefix(substring)
+                tokens.append(substring)
-                tokens.append(substring[:split])
+                substring = ''
-                substring = substring[split:]
+            elif list(infix_finditer(substring)):
-            elif find_suffix(substring) is not None:
+                infixes = infix_finditer(substring)
                split = find_suffix(substring)
                suffixes.append(substring[-split:])
                substring = substring[:-split]
            elif find_infixes(substring):
                infixes = find_infixes(substring)
                offset = 0
                for match in infixes:
                    tokens.append(substring[offset : match.start()])
                    tokens.append(substring[match.start() : match.end()])
                    offset = match.end()
-                substring = substring[offset:]
+                if substring[offset:]:
-            else:
+                    tokens.append(substring[offset:])
                substring = ''
            elif substring:
                tokens.append(substring)
                substring = ''
        tokens.extend(reversed(suffixes))
@ -767,16 +779,18 @@ def tokenizer_pseudo_code(text, special_cases,
 The algorithm can be summarized as follows:
-1. Iterate over space-separated substrings
+1. Iterate over whitespace-separated substrings.
 2. Check whether we have an explicitly defined rule for this substring. If we
   do, use it.
-3. Otherwise, try to consume a prefix.
+3. Otherwise, try to consume one prefix. If we consumed a prefix, go back to
-4. If we consumed a prefix, go back to the beginning of the loop, so that
+   #2, so that special cases always get priority.
-   special-cases always get priority.
+4. If we didn't consume a prefix, try to consume a suffix and then go back to
-5. If we didn't consume a prefix, try to consume a suffix.
+   #2.
-6. If we can't consume a prefix or suffix, look for "infixes" — stuff like
+5. If we can't consume a prefix or a suffix, look for a special case.
-   hyphens etc.
+6. Next, look for a token match.
-7. Once we can't consume any more of the string, handle it as a single token.
+7. Look for "infixes" — stuff like hyphens etc. and split the substring into
   tokens on all infixes.
 8. Once we can't consume any more of the string, handle it as a single token.
 ### Customizing spaCy's Tokenizer class {#native-tokenizers}
@ -791,9 +805,10 @@ domain. There are five things you would need to define:
   commas, periods, close quotes, etc.
 4. A function `infixes_finditer`, to handle non-whitespace separators, such as
   hyphens etc.
-5. An optional boolean function `token_match` matching strings that should never
+5. An optional boolean function `token_match` matching strings that should
-   be split, overriding the previous rules. Useful for things like URLs or
+   never be split, overriding the infix rules. Useful for things like URLs or
-   numbers.
+   numbers. Note that prefixes and suffixes will be split off before
   `token_match` is applied.
 You shouldn't usually need to create a `Tokenizer` subclass. Standard usage is
 to use `re.compile()` to build a regular expression object, and pass its
@ -805,21 +820,23 @@ import re
 import spacy
 from spacy.tokenizer import Tokenizer
 special_cases = {":)": [{"ORTH": ":)"}]}
 prefix_re = re.compile(r'''^[\[\("']''')
 suffix_re = re.compile(r'''[\]\)"']$''')
 infix_re = re.compile(r'''[-~]''')
 simple_url_re = re.compile(r'''^https?://''')
 def custom_tokenizer(nlp):
-    return Tokenizer(nlp.vocab, prefix_search=prefix_re.search,
+    return Tokenizer(nlp.vocab, rules=special_cases,
                                prefix_search=prefix_re.search,
                                suffix_search=suffix_re.search,
                                infix_finditer=infix_re.finditer,
                                token_match=simple_url_re.match)
 nlp = spacy.load("en_core_web_sm")
 nlp.tokenizer = custom_tokenizer(nlp)
-doc = nlp("hello-world.")
+doc = nlp("hello-world. :)")
-print([t.text for t in doc])
+print([t.text for t in doc]) # ['hello', '-', 'world.', ':)']
 ```
 If you need to subclass the tokenizer instead, the relevant methods to
@ -838,15 +855,16 @@ only be applied at the **end of a token**, so your expression should end with a
 </Infobox>
-#### Adding to existing rule sets {#native-tokenizer-additions}
+#### Modifying existing rule sets {#native-tokenizer-additions}
 In many situations, you don't necessarily need entirely custom rules. Sometimes
-you just want to add another character to the prefixes, suffixes or infixes. The
+you just want to add another character to the prefixes, suffixes or infixes.
-default prefix, suffix and infix rules are available via the `nlp` object's
+The default prefix, suffix and infix rules are available via the `nlp` object's
-`Defaults` and the [`Tokenizer.suffix_search`](/api/tokenizer#attributes)
+`Defaults` and the `Tokenizer` attributes such as
-attribute is writable, so you can overwrite it with a compiled regular
+[`Tokenizer.suffix_search`](/api/tokenizer#attributes) are writable, so you can
-expression object using of the modified default rules. spaCy ships with utility
+overwrite them with compiled regular expression objects using modified default
-functions to help you compile the regular expressions – for example,
+rules. spaCy ships with utility functions to help you compile the regular
 expressions – for example,
 [`compile_suffix_regex`](/api/top-level#util.compile_suffix_regex):
 ```python
@ -855,8 +873,15 @@ suffix_regex = spacy.util.compile_suffix_regex(suffixes)
 nlp.tokenizer.suffix_search = suffix_regex.search
 ```
-For an overview of the default regular expressions, see
+Similarly, you can remove a character from the default suffixes:
-[`lang/punctuation.py`](https://github.com/explosion/spaCy/blob/master/spacy/lang/punctuation.py).
+
 ```python
 suffixes = list(nlp.Defaults.suffixes)
 suffixes.remove("\\\\[")
 suffix_regex = spacy.util.compile_suffix_regex(suffixes)
 nlp.tokenizer.suffix_search = suffix_regex.search
 ```
 The `Tokenizer.suffix_search` attribute should be a function which takes a
 unicode string and returns a **regex match object** or `None`. Usually we use
 the `.search` attribute of a compiled regex object, but you can use some other
@ -866,12 +891,62 @@ function that behaves the same way.
 If you're using a statistical model, writing to the `nlp.Defaults` or
 `English.Defaults` directly won't work, since the regular expressions are read
-from the model and will be compiled when you load it. You'll only see the effect
+from the model and will be compiled when you load it. If you modify
-if you call [`spacy.blank`](/api/top-level#spacy.blank) or
+`nlp.Defaults`, you'll only see the effect if you call
-`Defaults.create_tokenizer()`.
+[`spacy.blank`](/api/top-level#spacy.blank) or `Defaults.create_tokenizer()`.
 If you want to modify the tokenizer loaded from a statistical model, you should
 modify `nlp.tokenizer` directly.
 </Infobox>
 The prefix, infix and suffix rule sets include not only individual characters
 but also detailed regular expressions that take the surrounding context into
 account. For example, there is a regular expression that treats a hyphen
 between letters as an infix. If you do not want the tokenizer to split on
 hyphens between letters, you can modify the existing infix definition from
 [`lang/punctuation.py`](https://github.com/explosion/spaCy/blob/master/spacy/lang/punctuation.py):
 ```python
 ### {executable="true"}
 import spacy
 from spacy.lang.char_classes import ALPHA, ALPHA_LOWER, ALPHA_UPPER
 from spacy.lang.char_classes import CONCAT_QUOTES, LIST_ELLIPSES, LIST_ICONS
 from spacy.util import compile_infix_regex
 # default tokenizer
 nlp = spacy.load("en_core_web_sm")
 doc = nlp("mother-in-law")
 print([t.text for t in doc]) # ['mother', '-', 'in', '-', 'law']
 # modify tokenizer infix patterns
 infixes = (
    LIST_ELLIPSES
    + LIST_ICONS
    + [
        r"(?<=[0-9])[+\\-\\*^](?=[0-9-])",
        r"(?<=[{al}{q}])\\.(?=[{au}{q}])".format(
            al=ALPHA_LOWER, au=ALPHA_UPPER, q=CONCAT_QUOTES
        ),
        r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
        # EDIT: commented out regex that splits on hyphens between letters:
        #r"(?<=[{a}])(?:{h})(?=[{a}])".format(a=ALPHA, h=HYPHENS),
        r"(?<=[{a}0-9])[:<>=/](?=[{a}])".format(a=ALPHA),
    ]
 )
 infix_re = compile_infix_regex(infixes)
 nlp.tokenizer.infix_finditer = infix_re.finditer
 doc = nlp("mother-in-law")
 print([t.text for t in doc]) # ['mother-in-law']
 ```
 For an overview of the default regular expressions, see
 [`lang/punctuation.py`](https://github.com/explosion/spaCy/blob/master/spacy/lang/punctuation.py)
 and language-specific definitions such as
 [`lang/de/punctuation.py`](https://github.com/explosion/spaCy/blob/master/spacy/lang/de/punctuation.py)
 for German.
 ### Hooking an arbitrary tokenizer into the pipeline {#custom-tokenizer}
 The tokenizer is the first component of the processing pipeline and the only one