* Add tokenizer option to allow Matcher handling for all rules
Add tokenizer option `with_faster_rules_heuristics` that determines
whether the special cases applied by the internal `Matcher` are filtered
by whether they contain affixes or space. If `True` (default), the rules
are filtered to prioritize speed over rare edge cases. If `False`, all
rules are included in the final `Matcher`-based pass over the doc.
* Reset all caches when reloading special cases
* Revert "Reset all caches when reloading special cases"
This reverts commit 4ef6bd171d.
* Initialize max_length properly
* Add new tag to API docs
* Rename to faster heuristics
Segment text, and create Doc objects with the discovered segment boundaries.
For a deeper understanding, see the docs on
how spaCy's tokenizer works.
The tokenizer is typically created automatically when a
Language subclass is initialized and it reads its settings
like punctuation and special case rules from the
Language.Defaults provided by the language subclass.
Tokenizer.__init__
Create a Tokenizer to create Doc objects given unicode text. For examples of
how to construct a custom tokenizer with different tokenization rules, see the
usage documentation.
Example
# Construction 1fromspacy.tokenizerimportTokenizerfromspacy.lang.enimportEnglishnlp=English()# Create a blank Tokenizer with just the English vocabtokenizer=Tokenizer(nlp.vocab)# Construction 2fromspacy.lang.enimportEnglishnlp=English()# Create a Tokenizer with the default settings for English# including punctuation rules and exceptionstokenizer=nlp.tokenizer
Name
Description
vocab
A storage container for lexical types. Vocab
rules
Exceptions and special-cases for the tokenizer. Optional[Dict[str, List[Dict[int, str]]]]
prefix_search
A function matching the signature of re.compile(string).search to match prefixes. Optional[Callablestr], Optional[Match]
suffix_search
A function matching the signature of re.compile(string).search to match suffixes. Optional[Callablestr], Optional[Match]
infix_finditer
A function matching the signature of re.compile(string).finditer to find infixes. Optional[Callablestr], Iterator[Match]
token_match
A function matching the signature of re.compile(string).match to find token matches. Optional[Callablestr], Optional[Match]
url_match
A function matching the signature of re.compile(string).match to find token matches after considering prefixes and suffixes. Optional[Callablestr], Optional[Match]
faster_heuristics 3.3.0
Whether to restrict the final Matcher-based pass for rules to those containing affixes or space. Defaults to True. bool
Tokenizer.__call__
Tokenize a string.
Example
tokens=tokenizer("This is a sentence")assertlen(tokens)==4
Name
Description
string
The string to tokenize. str
RETURNS
A container for linguistic annotations. Doc
Tokenizer.pipe
Tokenize a stream of texts.
Example
texts=["One document.","...","Lots of documents"]fordocintokenizer.pipe(texts,batch_size=50):pass
Name
Description
texts
A sequence of unicode texts. Iterable[str]
batch_size
The number of texts to accumulate in an internal buffer. Defaults to 1000. int
YIELDS
The tokenized Doc objects, in order. Doc
Tokenizer.find_infix
Find internal split points of the string.
Name
Description
string
The string to split. str
RETURNS
A list of re.MatchObject objects that have .start() and .end() methods, denoting the placement of internal segment separators, e.g. hyphens. List[Match]
Tokenizer.find_prefix
Find the length of a prefix that should be segmented from the string, or None
if no prefix rules match.
Name
Description
string
The string to segment. str
RETURNS
The length of the prefix if present, otherwise None. Optional[int]
Tokenizer.find_suffix
Find the length of a suffix that should be segmented from the string, or None
if no suffix rules match.
Name
Description
string
The string to segment. str
RETURNS
The length of the suffix if present, otherwise None. Optional[int]
Tokenizer.add_special_case
Add a special-case tokenization rule. This mechanism is also used to add custom
tokenizer exceptions to the language data. See the usage guide on the
languages data and
tokenizer special cases for more
details and examples.
A sequence of dicts, where each dict describes a token and its attributes. The ORTH fields of the attributes must exactly match the string when they are concatenated. Iterable[Dict[int, str]]
Tokenizer.explain
Tokenize a string with a slow debugging tokenizer that provides information
about which tokenizer rule or pattern was matched for each token. The tokens
produced are identical to Tokenizer.__call__ except for whitespace tokens.
A function to find segment boundaries from the start of a string. Returns the length of the segment, or None. Optional[Callablestr], Optional[Match]
suffix_search
A function to find segment boundaries from the end of a string. Returns the length of the segment, or None. Optional[Callablestr], Optional[Match]
infix_finditer
A function to find internal segment separators, e.g. hyphens. Returns a (possibly empty) sequence of re.MatchObject objects. Optional[Callablestr], Iterator[Match]
token_match
A function matching the signature of re.compile(string).match to find token matches. Returns an re.MatchObject or None. Optional[Callablestr], Optional[Match]
rules
A dictionary of tokenizer exceptions and special cases. Optional[Dict[str, List[Dict[int, str]]]]
Serialization fields
During serialization, spaCy will export several data fields used to restore
different aspects of the object. If needed, you can exclude them from
serialization by passing in the string names via the exclude argument.