mirror of
				https://github.com/explosion/spaCy.git
				synced 2025-11-01 00:17:44 +03:00 
			
		
		
		
	Add docs on adding to existing tokenizer rules [ci skip]
This commit is contained in:
		
							parent
							
								
									1ea1bc98e7
								
							
						
					
					
						commit
						403b9cd58b
					
				|  | @ -812,6 +812,40 @@ only be applied at the **end of a token**, so your expression should end with a | ||||||
| 
 | 
 | ||||||
| </Infobox> | </Infobox> | ||||||
| 
 | 
 | ||||||
|  | #### Adding to existing rule sets {#native-tokenizer-additions} | ||||||
|  | 
 | ||||||
|  | In many situations, you don't necessarily need entirely custom rules. Sometimes | ||||||
|  | you just want to add another character to the prefixes, suffixes or infixes. The | ||||||
|  | default prefix, suffix and infix rules are available via the `nlp` object's | ||||||
|  | `Defaults` and the [`Tokenizer.suffix_search`](/api/tokenizer#attributes) | ||||||
|  | attribute is writable, so you can overwrite it with a compiled regular | ||||||
|  | expression object using of the modified default rules. spaCy ships with utility | ||||||
|  | functions to help you compile the regular expressions – for example, | ||||||
|  | [`compile_suffix_regex`](/api/top-level#util.compile_suffix_regex): | ||||||
|  | 
 | ||||||
|  | ```python | ||||||
|  | suffixes = nlp.Defaults.suffixes + (r'''-+$''',) | ||||||
|  | suffix_regex = spacy.util.compile_suffix_regex(suffixes) | ||||||
|  | nlp.tokenizer.suffix_search = suffix_regex.search | ||||||
|  | ``` | ||||||
|  | 
 | ||||||
|  | For an overview of the default regular expressions, see | ||||||
|  | [`lang/punctuation.py`](https://github.com/explosion/spaCy/blob/master/spacy/lang/punctuation.py). | ||||||
|  | The `Tokenizer.suffix_search` attribute should be a function which takes a | ||||||
|  | unicode string and returns a **regex match object** or `None`. Usually we use | ||||||
|  | the `.search` attribute of a compiled regex object, but you can use some other | ||||||
|  | function that behaves the same way. | ||||||
|  | 
 | ||||||
|  | <Infobox title="Important note" variant="warning"> | ||||||
|  | 
 | ||||||
|  | If you're using a statistical model, writing to the `nlp.Defaults` or | ||||||
|  | `English.Defaults` directly won't work, since the regular expressions are read | ||||||
|  | from the model and will be compiled when you load it. You'll only see the effect | ||||||
|  | if you call [`spacy.blank`](/api/top-level#spacy.blank) or | ||||||
|  | `Defaults.create_tokenizer()`. | ||||||
|  | 
 | ||||||
|  | </Infobox> | ||||||
|  | 
 | ||||||
| ### Hooking an arbitrary tokenizer into the pipeline {#custom-tokenizer} | ### Hooking an arbitrary tokenizer into the pipeline {#custom-tokenizer} | ||||||
| 
 | 
 | ||||||
| The tokenizer is the first component of the processing pipeline and the only one | The tokenizer is the first component of the processing pipeline and the only one | ||||||
|  |  | ||||||
		Loading…
	
		Reference in New Issue
	
	Block a user