Add docs on adding to existing tokenizer rules [ci skip]

2025-11-01 00:17:44 +03:00 · 2019-02-24 18:35:19 +01:00 · 2019-02-24 18:35:19 +01:00 · 403b9cd58b
commit 403b9cd58b
parent 1ea1bc98e7
1 changed files with 34 additions and 0 deletions
--- a/website/docs/usage/linguistic-features.md
+++ b/website/docs/usage/linguistic-features.md
@ -812,6 +812,40 @@ only be applied at the **end of a token**, so your expression should end with a
 </Infobox>
 #### Adding to existing rule sets {#native-tokenizer-additions}
 In many situations, you don't necessarily need entirely custom rules. Sometimes
 you just want to add another character to the prefixes, suffixes or infixes. The
 default prefix, suffix and infix rules are available via the `nlp` object's
 `Defaults` and the [`Tokenizer.suffix_search`](/api/tokenizer#attributes)
 attribute is writable, so you can overwrite it with a compiled regular
 expression object using of the modified default rules. spaCy ships with utility
 functions to help you compile the regular expressions – for example,
 [`compile_suffix_regex`](/api/top-level#util.compile_suffix_regex):
 ```python
 suffixes = nlp.Defaults.suffixes + (r'''-+$''',)
 suffix_regex = spacy.util.compile_suffix_regex(suffixes)
 nlp.tokenizer.suffix_search = suffix_regex.search
 ```
 For an overview of the default regular expressions, see
 [`lang/punctuation.py`](https://github.com/explosion/spaCy/blob/master/spacy/lang/punctuation.py).
 The `Tokenizer.suffix_search` attribute should be a function which takes a
 unicode string and returns a **regex match object** or `None`. Usually we use
 the `.search` attribute of a compiled regex object, but you can use some other
 function that behaves the same way.
 <Infobox title="Important note" variant="warning">
 If you're using a statistical model, writing to the `nlp.Defaults` or
 `English.Defaults` directly won't work, since the regular expressions are read
 from the model and will be compiled when you load it. You'll only see the effect
 if you call [`spacy.blank`](/api/top-level#spacy.blank) or
 `Defaults.create_tokenizer()`.
 </Infobox>
 ### Hooking an arbitrary tokenizer into the pipeline {#custom-tokenizer}
 The tokenizer is the first component of the processing pipeline and the only one