* Expose tokenizer rules as a property
Expose the tokenizer rules property in the same way as the other core
properties. (The cache resetting is overkill, but consistent with
`from_bytes` for now.)
Add tests and update Tokenizer API docs.
* Update Hungarian punctuation to remove empty string
Update Hungarian punctuation definitions so that `_units` does not match
an empty string.
* Use _load_special_tokenization consistently
Use `_load_special_tokenization()` and have it to handle `None` checks.
* Fix precedence of `token_match` vs. special cases
Remove `token_match` check from `_split_affixes()` so that special cases
have precedence over `token_match`. `token_match` is checked only before
infixes are split.
* Add `make_debug_doc()` to the Tokenizer
Add `make_debug_doc()` to the Tokenizer as a working implementation of
the pseudo-code in the docs.
Add a test (marked as slow) that checks that `nlp.tokenizer()` and
`nlp.tokenizer.make_debug_doc()` return the same non-whitespace tokens
for all languages that have `examples.sentences` that can be imported.
* Update tokenization usage docs
Update pseudo-code and algorithm description to correspond to
`nlp.tokenizer.make_debug_doc()` with example debugging usage.
Add more examples for customizing tokenizers while preserving the
existing defaults.
Minor edits / clarifications.
* Revert "Update Hungarian punctuation to remove empty string"
This reverts commit f0a577f7a5.
* Rework `make_debug_doc()` as `explain()`
Rework `make_debug_doc()` as `explain()`, which returns a list of
`(pattern_string, token_string)` tuples rather than a non-standard
`Doc`. Update docs and tests accordingly, leaving the visualization for
future work.
* Handle cases with bad tokenizer patterns
Detect when tokenizer patterns match empty prefixes and suffixes so that
`explain()` does not hang on bad patterns.
* Remove unused displacy image
* Add tokenizer.explain() to usage docs
Segment text, and create Doc objects with the discovered segment boundaries.
For a deeper understanding, see the docs on
how spaCy's tokenizer works.
Tokenizer.__init__
Create a Tokenizer, to create Doc objects given unicode text. For examples
of how to construct a custom tokenizer with different tokenization rules, see
the
usage documentation.
Example
# Construction 1fromspacy.tokenizerimportTokenizerfromspacy.lang.enimportEnglishnlp=English()# Create a blank Tokenizer with just the English vocabtokenizer=Tokenizer(nlp.vocab)# Construction 2fromspacy.lang.enimportEnglishnlp=English()# Create a Tokenizer with the default settings for English# including punctuation rules and exceptionstokenizer=nlp.Defaults.create_tokenizer(nlp)
Name
Type
Description
vocab
Vocab
A storage container for lexical types.
rules
dict
Exceptions and special-cases for the tokenizer.
prefix_search
callable
A function matching the signature of re.compile(string).search to match prefixes.
suffix_search
callable
A function matching the signature of re.compile(string).search to match suffixes.
infix_finditer
callable
A function matching the signature of re.compile(string).finditer to find infixes.
token_match
callable
A function matching the signature of `re.compile(string).match to find token matches.
RETURNS
Tokenizer
The newly constructed object.
Tokenizer.__call__
Tokenize a string.
Example
tokens=tokenizer("This is a sentence")assertlen(tokens)==4
Name
Type
Description
string
unicode
The string to tokenize.
RETURNS
Doc
A container for linguistic annotations.
Tokenizer.pipe
Tokenize a stream of texts.
Example
texts=["One document.","...","Lots of documents"]fordocintokenizer.pipe(texts,batch_size=50):pass
Name
Type
Description
texts
-
A sequence of unicode texts.
batch_size
int
The number of texts to accumulate in an internal buffer. Defaults to 1000.
YIELDS
Doc
A sequence of Doc objects, in order.
Tokenizer.find_infix
Find internal split points of the string.
Name
Type
Description
string
unicode
The string to split.
RETURNS
list
A list of re.MatchObject objects that have .start() and .end() methods, denoting the placement of internal segment separators, e.g. hyphens.
Tokenizer.find_prefix
Find the length of a prefix that should be segmented from the string, or None
if no prefix rules match.
Name
Type
Description
string
unicode
The string to segment.
RETURNS
int
The length of the prefix if present, otherwise None.
Tokenizer.find_suffix
Find the length of a suffix that should be segmented from the string, or None
if no suffix rules match.
Name
Type
Description
string
unicode
The string to segment.
RETURNS
int / None
The length of the suffix if present, otherwise None.
Tokenizer.add_special_case
Add a special-case tokenization rule. This mechanism is also used to add custom
tokenizer exceptions to the language data. See the usage guide on
adding languages and
linguistic features for more details
and examples.
A sequence of dicts, where each dict describes a token and its attributes. The ORTH fields of the attributes must exactly match the string when they are concatenated.
Tokenizer.explain
Tokenize a string with a slow debugging tokenizer that provides information
about which tokenizer rule or pattern was matched for each token. The tokens
produced are identical to Tokenizer.__call__ except for whitespace tokens.
A function to find segment boundaries from the start of a string. Returns the length of the segment, or None.
suffix_search
-
A function to find segment boundaries from the end of a string. Returns the length of the segment, or None.
infix_finditer
-
A function to find internal segment separators, e.g. hyphens. Returns a (possibly empty) list of re.MatchObject objects.
token_match
-
A function matching the signature of re.compile(string).match to find token matches. Returns an re.MatchObjectorNone.
rules
dict
A dictionary of tokenizer exceptions and special cases.
Serialization fields
During serialization, spaCy will export several data fields used to restore
different aspects of the object. If needed, you can exclude them from
serialization by passing in the string names via the exclude argument.