Update tokenization usage docs (#4666)

Update pseudo-code and algorithm description to correspond to current tokenizer behavior. Add more examples for customizing tokenizers while preserving the existing defaults. Minor edits / clarifications.
2025-11-07 03:17:37 +03:00 · 2019-11-18 12:35:13 +01:00 · 2019-11-18 12:35:13 +01:00 · 62e00fd9da
commit 62e00fd9da
parent 5adcb352e9
1 changed files with 119 additions and 44 deletions
--- a/website/docs/usage/linguistic-features.md
+++ b/website/docs/usage/linguistic-features.md
@ -715,7 +715,7 @@ assert "gimme" not in [w.text for w in nlp('("...gimme...?")')]
 The special case rules have precedence over the punctuation splitting:

 ```python
-nlp.tokenizer.add_special_case("...gimme...?", [{ORTH: "...gimme...?"}])
+nlp.tokenizer.add_special_case("...gimme...?", [{"ORTH": "...gimme...?"}])
 assert len(nlp("...gimme...?")) == 1
 ```

@ -725,40 +725,52 @@ spaCy introduces a novel tokenization algorithm, that gives a better balance
 between performance, ease of definition, and ease of alignment into the original
 string.

-After consuming a prefix or infix, we consult the special cases again. We want
+After consuming a prefix or suffix, we consult the special cases again. We want
 the special cases to handle things like "don't" in English, and we want the same
 rule to work for "(don't)!". We do this by splitting off the open bracket, then
-the exclamation, then the close bracket, and finally matching the special-case.
+the exclamation, then the close bracket, and finally matching the special case.
 Here's an implementation of the algorithm in Python, optimized for readability
 rather than performance:

 ```python
-def tokenizer_pseudo_code(text, special_cases,
-                          find_prefix, find_suffix, find_infixes):
+def tokenizer_pseudo_code(self, special_cases, prefix_search, suffix_search,
+                          infix_finditer, token_match):
    tokens = []
-    for substring in text.split(' '):
+    for substring in text.split():
        suffixes = []
        while substring:
+            while prefix_search(substring) or suffix_search(substring):
+                if substring in special_cases:
+                    tokens.extend(special_cases[substring])
+                    substring = ''
+                    break
+                if prefix_search(substring):
+                    split = prefix_search(substring).end()
+                    tokens.append(substring[:split])
+                    substring = substring[split:]
+                    if substring in special_cases:
+                        continue
+                if suffix_search(substring):
+                    split = suffix_search(substring).start()
+                    suffixes.append(substring[split:])
+                    substring = substring[:split]
            if substring in special_cases:
                tokens.extend(special_cases[substring])
                substring = ''
-            elif find_prefix(substring) is not None:
-                split = find_prefix(substring)
-                tokens.append(substring[:split])
-                substring = substring[split:]
-            elif find_suffix(substring) is not None:
-                split = find_suffix(substring)
-                suffixes.append(substring[-split:])
-                substring = substring[:-split]
-            elif find_infixes(substring):
-                infixes = find_infixes(substring)
+            elif token_match(substring):
+                tokens.append(substring)
+                substring = ''
+            elif list(infix_finditer(substring)):
+                infixes = infix_finditer(substring)
                offset = 0
                for match in infixes:
                    tokens.append(substring[offset : match.start()])
                    tokens.append(substring[match.start() : match.end()])
                    offset = match.end()
-                substring = substring[offset:]
-            else:
+                if substring[offset:]:
+                    tokens.append(substring[offset:])
+                substring = ''
+            elif substring:
                tokens.append(substring)
                substring = ''
        tokens.extend(reversed(suffixes))
@ -767,16 +779,18 @@ def tokenizer_pseudo_code(text, special_cases,

 The algorithm can be summarized as follows:

-1. Iterate over space-separated substrings
+1. Iterate over whitespace-separated substrings.
 2. Check whether we have an explicitly defined rule for this substring. If we
   do, use it.
-3. Otherwise, try to consume a prefix.
-4. If we consumed a prefix, go back to the beginning of the loop, so that
-   special-cases always get priority.
-5. If we didn't consume a prefix, try to consume a suffix.
-6. If we can't consume a prefix or suffix, look for "infixes" — stuff like
-   hyphens etc.
-7. Once we can't consume any more of the string, handle it as a single token.
+3. Otherwise, try to consume one prefix. If we consumed a prefix, go back to
+   #2, so that special cases always get priority.
+4. If we didn't consume a prefix, try to consume a suffix and then go back to
+   #2.
+5. If we can't consume a prefix or a suffix, look for a special case.
+6. Next, look for a token match.
+7. Look for "infixes" — stuff like hyphens etc. and split the substring into
+   tokens on all infixes.
+8. Once we can't consume any more of the string, handle it as a single token.

 ### Customizing spaCy's Tokenizer class {#native-tokenizers}

@ -791,9 +805,10 @@ domain. There are five things you would need to define:
   commas, periods, close quotes, etc.
 4. A function `infixes_finditer`, to handle non-whitespace separators, such as
   hyphens etc.
-5. An optional boolean function `token_match` matching strings that should never
-   be split, overriding the previous rules. Useful for things like URLs or
-   numbers.
+5. An optional boolean function `token_match` matching strings that should
+   never be split, overriding the infix rules. Useful for things like URLs or
+   numbers. Note that prefixes and suffixes will be split off before
+   `token_match` is applied.

 You shouldn't usually need to create a `Tokenizer` subclass. Standard usage is
 to use `re.compile()` to build a regular expression object, and pass its
@ -805,21 +820,23 @@ import re
 import spacy
 from spacy.tokenizer import Tokenizer

+special_cases = {":)": [{"ORTH": ":)"}]}
 prefix_re = re.compile(r'''^[\[\("']''')
 suffix_re = re.compile(r'''[\]\)"']$''')
 infix_re = re.compile(r'''[-~]''')
 simple_url_re = re.compile(r'''^https?://''')

 def custom_tokenizer(nlp):
-    return Tokenizer(nlp.vocab, prefix_search=prefix_re.search,
+    return Tokenizer(nlp.vocab, rules=special_cases,
+                                prefix_search=prefix_re.search,
                                suffix_search=suffix_re.search,
                                infix_finditer=infix_re.finditer,
                                token_match=simple_url_re.match)

 nlp = spacy.load("en_core_web_sm")
 nlp.tokenizer = custom_tokenizer(nlp)
-doc = nlp("hello-world.")
-print([t.text for t in doc])
+doc = nlp("hello-world. :)")
+print([t.text for t in doc]) # ['hello', '-', 'world.', ':)']
 ```

 If you need to subclass the tokenizer instead, the relevant methods to
@ -838,15 +855,16 @@ only be applied at the **end of a token**, so your expression should end with a

 </Infobox>

-#### Adding to existing rule sets {#native-tokenizer-additions}
+#### Modifying existing rule sets {#native-tokenizer-additions}

 In many situations, you don't necessarily need entirely custom rules. Sometimes
-you just want to add another character to the prefixes, suffixes or infixes. The
-default prefix, suffix and infix rules are available via the `nlp` object's
-`Defaults` and the [`Tokenizer.suffix_search`](/api/tokenizer#attributes)
-attribute is writable, so you can overwrite it with a compiled regular
-expression object using of the modified default rules. spaCy ships with utility
-functions to help you compile the regular expressions – for example,
+you just want to add another character to the prefixes, suffixes or infixes.
+The default prefix, suffix and infix rules are available via the `nlp` object's
+`Defaults` and the `Tokenizer` attributes such as
+[`Tokenizer.suffix_search`](/api/tokenizer#attributes) are writable, so you can
+overwrite them with compiled regular expression objects using modified default
+rules. spaCy ships with utility functions to help you compile the regular
+expressions – for example,
 [`compile_suffix_regex`](/api/top-level#util.compile_suffix_regex):

 ```python
@ -855,8 +873,15 @@ suffix_regex = spacy.util.compile_suffix_regex(suffixes)
 nlp.tokenizer.suffix_search = suffix_regex.search
 ```

-For an overview of the default regular expressions, see
-[`lang/punctuation.py`](https://github.com/explosion/spaCy/blob/master/spacy/lang/punctuation.py).
+Similarly, you can remove a character from the default suffixes:
+
+```python
+suffixes = list(nlp.Defaults.suffixes)
+suffixes.remove("\\\\[")
+suffix_regex = spacy.util.compile_suffix_regex(suffixes)
+nlp.tokenizer.suffix_search = suffix_regex.search
+```
+
 The `Tokenizer.suffix_search` attribute should be a function which takes a
 unicode string and returns a **regex match object** or `None`. Usually we use
 the `.search` attribute of a compiled regex object, but you can use some other
@ -866,12 +891,62 @@ function that behaves the same way.

 If you're using a statistical model, writing to the `nlp.Defaults` or
 `English.Defaults` directly won't work, since the regular expressions are read
-from the model and will be compiled when you load it. You'll only see the effect
-if you call [`spacy.blank`](/api/top-level#spacy.blank) or
-`Defaults.create_tokenizer()`.
+from the model and will be compiled when you load it. If you modify
+`nlp.Defaults`, you'll only see the effect if you call
+[`spacy.blank`](/api/top-level#spacy.blank) or `Defaults.create_tokenizer()`.
+If you want to modify the tokenizer loaded from a statistical model, you should
+modify `nlp.tokenizer` directly.

 </Infobox>

+The prefix, infix and suffix rule sets include not only individual characters
+but also detailed regular expressions that take the surrounding context into
+account. For example, there is a regular expression that treats a hyphen
+between letters as an infix. If you do not want the tokenizer to split on
+hyphens between letters, you can modify the existing infix definition from
+[`lang/punctuation.py`](https://github.com/explosion/spaCy/blob/master/spacy/lang/punctuation.py):
+
+
+```python
+### {executable="true"}
+import spacy
+from spacy.lang.char_classes import ALPHA, ALPHA_LOWER, ALPHA_UPPER
+from spacy.lang.char_classes import CONCAT_QUOTES, LIST_ELLIPSES, LIST_ICONS
+from spacy.util import compile_infix_regex
+
+# default tokenizer
+nlp = spacy.load("en_core_web_sm")
+doc = nlp("mother-in-law")
+print([t.text for t in doc]) # ['mother', '-', 'in', '-', 'law']
+
+# modify tokenizer infix patterns
+infixes = (
+    LIST_ELLIPSES
+    + LIST_ICONS
+    + [
+        r"(?<=[0-9])[+\\-\\*^](?=[0-9-])",
+        r"(?<=[{al}{q}])\\.(?=[{au}{q}])".format(
+            al=ALPHA_LOWER, au=ALPHA_UPPER, q=CONCAT_QUOTES
+        ),
+        r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
+        # EDIT: commented out regex that splits on hyphens between letters:
+        #r"(?<=[{a}])(?:{h})(?=[{a}])".format(a=ALPHA, h=HYPHENS),
+        r"(?<=[{a}0-9])[:<>=/](?=[{a}])".format(a=ALPHA),
+    ]
+)
+
+infix_re = compile_infix_regex(infixes)
+nlp.tokenizer.infix_finditer = infix_re.finditer
+doc = nlp("mother-in-law")
+print([t.text for t in doc]) # ['mother-in-law']
+```
+
+For an overview of the default regular expressions, see
+[`lang/punctuation.py`](https://github.com/explosion/spaCy/blob/master/spacy/lang/punctuation.py)
+and language-specific definitions such as
+[`lang/de/punctuation.py`](https://github.com/explosion/spaCy/blob/master/spacy/lang/de/punctuation.py)
+for German.
+
 ### Hooking an arbitrary tokenizer into the pipeline {#custom-tokenizer}

 The tokenizer is the first component of the processing pipeline and the only one