spaCy/website/docs/api/tokenizer.md
Adriane Boyd 3711af74e5
Add tokenizer option to allow Matcher handling for all rules (#10452)
* Add tokenizer option to allow Matcher handling for all rules

Add tokenizer option `with_faster_rules_heuristics` that determines
whether the special cases applied by the internal `Matcher` are filtered
by whether they contain affixes or space. If `True` (default), the rules
are filtered to prioritize speed over rare edge cases. If `False`, all
rules are included in the final `Matcher`-based pass over the doc.

* Reset all caches when reloading special cases

* Revert "Reset all caches when reloading special cases"

This reverts commit 4ef6bd171d.

* Initialize max_length properly

* Add new tag to API docs

* Rename to faster heuristics
2022-03-24 13:21:32 +01:00

15 KiB

title teaser tag source
Tokenizer Segment text into words, punctuations marks, etc. class spacy/tokenizer.pyx

Default config

[nlp.tokenizer]
@tokenizers = "spacy.Tokenizer.v1"

Segment text, and create Doc objects with the discovered segment boundaries. For a deeper understanding, see the docs on how spaCy's tokenizer works. The tokenizer is typically created automatically when a Language subclass is initialized and it reads its settings like punctuation and special case rules from the Language.Defaults provided by the language subclass.

Tokenizer.__init__

Create a Tokenizer to create Doc objects given unicode text. For examples of how to construct a custom tokenizer with different tokenization rules, see the usage documentation.

Example

# Construction 1
from spacy.tokenizer import Tokenizer
from spacy.lang.en import English
nlp = English()
# Create a blank Tokenizer with just the English vocab
tokenizer = Tokenizer(nlp.vocab)

# Construction 2
from spacy.lang.en import English
nlp = English()
# Create a Tokenizer with the default settings for English
# including punctuation rules and exceptions
tokenizer = nlp.tokenizer
Name Description
vocab A storage container for lexical types. Vocab
rules Exceptions and special-cases for the tokenizer. Optional[Dict[str, List[Dict[int, str]]]]
prefix_search A function matching the signature of re.compile(string).search to match prefixes. Optional[Callablestr], Optional[Match]
suffix_search A function matching the signature of re.compile(string).search to match suffixes. Optional[Callablestr], Optional[Match]
infix_finditer A function matching the signature of re.compile(string).finditer to find infixes. Optional[Callablestr], Iterator[Match]
token_match A function matching the signature of re.compile(string).match to find token matches. Optional[Callablestr], Optional[Match]
url_match A function matching the signature of re.compile(string).match to find token matches after considering prefixes and suffixes. Optional[Callablestr], Optional[Match]
faster_heuristics 3.3.0 Whether to restrict the final Matcher-based pass for rules to those containing affixes or space. Defaults to True. bool

Tokenizer.__call__

Tokenize a string.

Example

tokens = tokenizer("This is a sentence")
assert len(tokens) == 4
Name Description
string The string to tokenize. str
RETURNS A container for linguistic annotations. Doc

Tokenizer.pipe

Tokenize a stream of texts.

Example

texts = ["One document.", "...", "Lots of documents"]
for doc in tokenizer.pipe(texts, batch_size=50):
    pass
Name Description
texts A sequence of unicode texts. Iterable[str]
batch_size The number of texts to accumulate in an internal buffer. Defaults to 1000. int
YIELDS The tokenized Doc objects, in order. Doc

Tokenizer.find_infix

Find internal split points of the string.

Name Description
string The string to split. str
RETURNS A list of re.MatchObject objects that have .start() and .end() methods, denoting the placement of internal segment separators, e.g. hyphens. List[Match]

Tokenizer.find_prefix

Find the length of a prefix that should be segmented from the string, or None if no prefix rules match.

Name Description
string The string to segment. str
RETURNS The length of the prefix if present, otherwise None. Optional[int]

Tokenizer.find_suffix

Find the length of a suffix that should be segmented from the string, or None if no suffix rules match.

Name Description
string The string to segment. str
RETURNS The length of the suffix if present, otherwise None. Optional[int]

Tokenizer.add_special_case

Add a special-case tokenization rule. This mechanism is also used to add custom tokenizer exceptions to the language data. See the usage guide on the languages data and tokenizer special cases for more details and examples.

Example

from spacy.attrs import ORTH, NORM
case = [{ORTH: "do"}, {ORTH: "n't", NORM: "not"}]
tokenizer.add_special_case("don't", case)
Name Description
string The string to specially tokenize. str
token_attrs A sequence of dicts, where each dict describes a token and its attributes. The ORTH fields of the attributes must exactly match the string when they are concatenated. Iterable[Dict[int, str]]

Tokenizer.explain

Tokenize a string with a slow debugging tokenizer that provides information about which tokenizer rule or pattern was matched for each token. The tokens produced are identical to Tokenizer.__call__ except for whitespace tokens.

Example

tok_exp = nlp.tokenizer.explain("(don't)")
assert [t[0] for t in tok_exp] == ["PREFIX", "SPECIAL-1", "SPECIAL-2", "SUFFIX"]
assert [t[1] for t in tok_exp] == ["(", "do", "n't", ")"]
Name Description
string The string to tokenize with the debugging tokenizer. str
RETURNS A list of (pattern_string, token_string) tuples. List[Tuple[str, str]]

Tokenizer.to_disk

Serialize the tokenizer to disk.

Example

tokenizer = Tokenizer(nlp.vocab)
tokenizer.to_disk("/path/to/tokenizer")
Name Description
path A path to a directory, which will be created if it doesn't exist. Paths may be either strings or Path-like objects. Union[str, Path]
keyword-only
exclude String names of serialization fields to exclude. Iterable[str]

Tokenizer.from_disk

Load the tokenizer from disk. Modifies the object in place and returns it.

Example

tokenizer = Tokenizer(nlp.vocab)
tokenizer.from_disk("/path/to/tokenizer")
Name Description
path A path to a directory. Paths may be either strings or Path-like objects. Union[str, Path]
keyword-only
exclude String names of serialization fields to exclude. Iterable[str]
RETURNS The modified Tokenizer object. Tokenizer

Tokenizer.to_bytes

Example

tokenizer = tokenizer(nlp.vocab)
tokenizer_bytes = tokenizer.to_bytes()

Serialize the tokenizer to a bytestring.

Name Description
keyword-only
exclude String names of serialization fields to exclude. Iterable[str]
RETURNS The serialized form of the Tokenizer object. bytes

Tokenizer.from_bytes

Load the tokenizer from a bytestring. Modifies the object in place and returns it.

Example

tokenizer_bytes = tokenizer.to_bytes()
tokenizer = Tokenizer(nlp.vocab)
tokenizer.from_bytes(tokenizer_bytes)
Name Description
bytes_data The data to load from. bytes
keyword-only
exclude String names of serialization fields to exclude. Iterable[str]
RETURNS The Tokenizer object. Tokenizer

Attributes

Name Description
vocab The vocab object of the parent Doc. Vocab
prefix_search A function to find segment boundaries from the start of a string. Returns the length of the segment, or None. Optional[Callablestr], Optional[Match]
suffix_search A function to find segment boundaries from the end of a string. Returns the length of the segment, or None. Optional[Callablestr], Optional[Match]
infix_finditer A function to find internal segment separators, e.g. hyphens. Returns a (possibly empty) sequence of re.MatchObject objects. Optional[Callablestr], Iterator[Match]
token_match A function matching the signature of re.compile(string).match to find token matches. Returns an re.MatchObject or None. Optional[Callablestr], Optional[Match]
rules A dictionary of tokenizer exceptions and special cases. Optional[Dict[str, List[Dict[int, str]]]]

Serialization fields

During serialization, spaCy will export several data fields used to restore different aspects of the object. If needed, you can exclude them from serialization by passing in the string names via the exclude argument.

Example

data = tokenizer.to_bytes(exclude=["vocab", "exceptions"])
tokenizer.from_disk("./data", exclude=["token_match"])
Name Description
vocab The shared Vocab.
prefix_search The prefix rules.
suffix_search The suffix rules.
infix_finditer The infix rules.
token_match The token match expression.
exceptions The tokenizer exception rules.