1
1
mirror of https://github.com/explosion/spaCy.git synced 2025-01-16 20:46:24 +03:00
spaCy/website/docs/api/tokenizer.md
adrianeboyd 2c876eb672 Add tokenizer explain() debugging method ()
* Expose tokenizer rules as a property

Expose the tokenizer rules property in the same way as the other core
properties. (The cache resetting is overkill, but consistent with
`from_bytes` for now.)

Add tests and update Tokenizer API docs.

* Update Hungarian punctuation to remove empty string

Update Hungarian punctuation definitions so that `_units` does not match
an empty string.

* Use _load_special_tokenization consistently

Use `_load_special_tokenization()` and have it to handle `None` checks.

* Fix precedence of `token_match` vs. special cases

Remove `token_match` check from `_split_affixes()` so that special cases
have precedence over `token_match`. `token_match` is checked only before
infixes are split.

* Add `make_debug_doc()` to the Tokenizer

Add `make_debug_doc()` to the Tokenizer as a working implementation of
the pseudo-code in the docs.

Add a test (marked as slow) that checks that `nlp.tokenizer()` and
`nlp.tokenizer.make_debug_doc()` return the same non-whitespace tokens
for all languages that have `examples.sentences` that can be imported.

* Update tokenization usage docs

Update pseudo-code and algorithm description to correspond to
`nlp.tokenizer.make_debug_doc()` with example debugging usage.

Add more examples for customizing tokenizers while preserving the
existing defaults.

Minor edits / clarifications.

* Revert "Update Hungarian punctuation to remove empty string"

This reverts commit f0a577f7a5.

* Rework `make_debug_doc()` as `explain()`

Rework `make_debug_doc()` as `explain()`, which returns a list of
`(pattern_string, token_string)` tuples rather than a non-standard
`Doc`. Update docs and tests accordingly, leaving the visualization for
future work.

* Handle cases with bad tokenizer patterns

Detect when tokenizer patterns match empty prefixes and suffixes so that
`explain()` does not hang on bad patterns.

* Remove unused displacy image

* Add tokenizer.explain() to usage docs
2019-11-20 13:07:25 +01:00

13 KiB

title teaser tag source
Tokenizer Segment text into words, punctuations marks etc. class spacy/tokenizer.pyx

Segment text, and create Doc objects with the discovered segment boundaries. For a deeper understanding, see the docs on how spaCy's tokenizer works.

Tokenizer.__init__

Create a Tokenizer, to create Doc objects given unicode text. For examples of how to construct a custom tokenizer with different tokenization rules, see the usage documentation.

Example

# Construction 1
from spacy.tokenizer import Tokenizer
from spacy.lang.en import English
nlp = English()
# Create a blank Tokenizer with just the English vocab
tokenizer = Tokenizer(nlp.vocab)

# Construction 2
from spacy.lang.en import English
nlp = English()
# Create a Tokenizer with the default settings for English
# including punctuation rules and exceptions
tokenizer = nlp.Defaults.create_tokenizer(nlp)
Name Type Description
vocab Vocab A storage container for lexical types.
rules dict Exceptions and special-cases for the tokenizer.
prefix_search callable A function matching the signature of re.compile(string).search to match prefixes.
suffix_search callable A function matching the signature of re.compile(string).search to match suffixes.
infix_finditer callable A function matching the signature of re.compile(string).finditer to find infixes.
token_match callable A function matching the signature of `re.compile(string).match to find token matches.
RETURNS Tokenizer The newly constructed object.

Tokenizer.__call__

Tokenize a string.

Example

tokens = tokenizer("This is a sentence")
assert len(tokens) == 4
Name Type Description
string unicode The string to tokenize.
RETURNS Doc A container for linguistic annotations.

Tokenizer.pipe

Tokenize a stream of texts.

Example

texts = ["One document.", "...", "Lots of documents"]
for doc in tokenizer.pipe(texts, batch_size=50):
    pass
Name Type Description
texts - A sequence of unicode texts.
batch_size int The number of texts to accumulate in an internal buffer. Defaults to 1000.
YIELDS Doc A sequence of Doc objects, in order.

Tokenizer.find_infix

Find internal split points of the string.

Name Type Description
string unicode The string to split.
RETURNS list A list of re.MatchObject objects that have .start() and .end() methods, denoting the placement of internal segment separators, e.g. hyphens.

Tokenizer.find_prefix

Find the length of a prefix that should be segmented from the string, or None if no prefix rules match.

Name Type Description
string unicode The string to segment.
RETURNS int The length of the prefix if present, otherwise None.

Tokenizer.find_suffix

Find the length of a suffix that should be segmented from the string, or None if no suffix rules match.

Name Type Description
string unicode The string to segment.
RETURNS int / None The length of the suffix if present, otherwise None.

Tokenizer.add_special_case

Add a special-case tokenization rule. This mechanism is also used to add custom tokenizer exceptions to the language data. See the usage guide on adding languages and linguistic features for more details and examples.

Example

from spacy.attrs import ORTH, NORM
case = [{ORTH: "do"}, {ORTH: "n't", NORM: "not"}]
tokenizer.add_special_case("don't", case)
Name Type Description
string unicode The string to specially tokenize.
token_attrs iterable A sequence of dicts, where each dict describes a token and its attributes. The ORTH fields of the attributes must exactly match the string when they are concatenated.

Tokenizer.explain

Tokenize a string with a slow debugging tokenizer that provides information about which tokenizer rule or pattern was matched for each token. The tokens produced are identical to Tokenizer.__call__ except for whitespace tokens.

Example

tok_exp = nlp.tokenizer.explain("(don't)")
assert [t[0] for t in tok_exp] == ["PREFIX", "SPECIAL-1", "SPECIAL-2", "SUFFIX"]
assert [t[1] for t in tok_exp] == ["(", "do", "n't", ")"]
Name Type Description
string unicode The string to tokenize with the debugging tokenizer
RETURNS list A list of (pattern_string, token_string) tuples

Tokenizer.to_disk

Serialize the tokenizer to disk.

Example

tokenizer = Tokenizer(nlp.vocab)
tokenizer.to_disk("/path/to/tokenizer")
Name Type Description
path unicode / Path A path to a directory, which will be created if it doesn't exist. Paths may be either strings or Path-like objects.
exclude list String names of serialization fields to exclude.

Tokenizer.from_disk

Load the tokenizer from disk. Modifies the object in place and returns it.

Example

tokenizer = Tokenizer(nlp.vocab)
tokenizer.from_disk("/path/to/tokenizer")
Name Type Description
path unicode / Path A path to a directory. Paths may be either strings or Path-like objects.
exclude list String names of serialization fields to exclude.
RETURNS Tokenizer The modified Tokenizer object.

Tokenizer.to_bytes

Example

tokenizer = tokenizer(nlp.vocab)
tokenizer_bytes = tokenizer.to_bytes()

Serialize the tokenizer to a bytestring.

Name Type Description
exclude list String names of serialization fields to exclude.
RETURNS bytes The serialized form of the Tokenizer object.

Tokenizer.from_bytes

Load the tokenizer from a bytestring. Modifies the object in place and returns it.

Example

tokenizer_bytes = tokenizer.to_bytes()
tokenizer = Tokenizer(nlp.vocab)
tokenizer.from_bytes(tokenizer_bytes)
Name Type Description
bytes_data bytes The data to load from.
exclude list String names of serialization fields to exclude.
RETURNS Tokenizer The Tokenizer object.

Attributes

Name Type Description
vocab Vocab The vocab object of the parent Doc.
prefix_search - A function to find segment boundaries from the start of a string. Returns the length of the segment, or None.
suffix_search - A function to find segment boundaries from the end of a string. Returns the length of the segment, or None.
infix_finditer - A function to find internal segment separators, e.g. hyphens. Returns a (possibly empty) list of re.MatchObject objects.
token_match - A function matching the signature of re.compile(string).match to find token matches. Returns an re.MatchObjectorNone.
rules dict A dictionary of tokenizer exceptions and special cases.

Serialization fields

During serialization, spaCy will export several data fields used to restore different aspects of the object. If needed, you can exclude them from serialization by passing in the string names via the exclude argument.

Example

data = tokenizer.to_bytes(exclude=["vocab", "exceptions"])
tokenizer.from_disk("./data", exclude=["token_match"])
Name Description
vocab The shared Vocab.
prefix_search The prefix rules.
suffix_search The suffix rules.
infix_finditer The infix rules.
token_match The token match expression.
exceptions The tokenizer exception rules.