Document Tokenizer(token_match) and clarify tokenizer_pseudo_code

Closes #835 In the `tokenizer_pseudo_code` I put the `special_cases` kwarg before `find_prefix` because this now matches the order the args are used in the pseudocode, and it also matches spacy's actual code.
2025-10-27 22:21:08 +03:00 · 2017-09-25 13:13:25 -07:00 · 2017-09-25 13:13:25 -07:00 · b6ebedd09c
commit b6ebedd09c
parent 2f8d535f65
1 changed files with 14 additions and 5 deletions
--- a/website/docs/usage/customizing-tokenizer.jade
+++ b/website/docs/usage/customizing-tokenizer.jade
@ -87,8 +87,8 @@ p
    |  algorithm in Python, optimized for readability rather than performance:
 +code.
-    def tokenizer_pseudo_code(text, find_prefix, find_suffix,
+    def tokenizer_pseudo_code(text, special_cases,
-                              find_infixes, special_cases):
+                              find_prefix, find_suffix, find_infixes):
        tokens = []
        for substring in text.split(' '):
            suffixes = []
@ -140,7 +140,7 @@ p
 p
    |  Let's imagine you wanted to create a tokenizer for a new language. There
-    |  are four things you would need to define:
+    |  are five things you would need to define:
 +list("numbers")
    +item
@ -162,6 +162,11 @@ p
        |  A function #[code infixes_finditer], to handle non-whitespace
        |  separators, such as hyphens etc.
    +item
        |  (Optional) A boolean function #[code token_match] matching strings
        |  that should never be split, overriding the previous rules.
        |  Useful for things like URLs or numbers.
 p
    |  You shouldn't usually need to create a #[code Tokenizer] subclass.
    |  Standard usage is to use #[code re.compile()] to build a regular
@ -175,11 +180,15 @@ p
    prefix_re = re.compile(r'''[\[\(&quot;']''')
    suffix_re = re.compile(r'''[\]\)&quot;']''')
    infix_re = re.compile(r'''[-~]''')
    simple_url_re = re.compile(r'''^https?://''')
    def create_tokenizer(nlp):
-        return Tokenizer(nlp.vocab, rules={},
+        return Tokenizer(nlp.vocab,
                rules={},
                prefix_search=prefix_re.search,
                suffix_search=suffix_re.search,
-                infix_finditer=infix_re.finditer)
+                infix_finditer=infix_re.finditer,
                token_match=simple_url_re.match
                )
    nlp = spacy.load('en', create_make_doc=create_tokenizer)