spaCy/website/docs/api/tokenizer.md
Ines Montani 296446a1c8
Tidy up and improve docs and docstrings (#3370)
<!--- Provide a general summary of your changes in the title. -->

## Description
* tidy up and adjust Cython code to code style
* improve docstrings and make calling `help()` nicer
* add URLs to new docs pages to docstrings wherever possible, mostly to user-facing objects
* fix various typos and inconsistencies in docs

### Types of change
enhancement, docs

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-03-08 11:42:26 +01:00

9.1 KiB

title teaser tag source
Tokenizer Segment text into words, punctuations marks etc. class spacy/tokenizer.pyx

Segment text, and create Doc objects with the discovered segment boundaries.

Tokenizer.__init__

Create a Tokenizer, to create Doc objects given unicode text.

Example

# Construction 1
from spacy.tokenizer import Tokenizer
tokenizer = Tokenizer(nlp.vocab)

# Construction 2
from spacy.lang.en import English
tokenizer = English().Defaults.create_tokenizer(nlp)
Name Type Description
vocab Vocab A storage container for lexical types.
rules dict Exceptions and special-cases for the tokenizer.
prefix_search callable A function matching the signature of re.compile(string).search to match prefixes.
suffix_search callable A function matching the signature of re.compile(string).search to match suffixes.
infix_finditer callable A function matching the signature of re.compile(string).finditer to find infixes.
token_match callable A boolean function matching strings to be recognized as tokens.
RETURNS Tokenizer The newly constructed object.

Tokenizer.__call__

Tokenize a string.

Example

tokens = tokenizer(u"This is a sentence")
assert len(tokens) == 4
Name Type Description
string unicode The string to tokenize.
RETURNS Doc A container for linguistic annotations.

Tokenizer.pipe

Tokenize a stream of texts.

Example

texts = [u"One document.", u"...", u"Lots of documents"]
for doc in tokenizer.pipe(texts, batch_size=50):
    pass
Name Type Description
texts - A sequence of unicode texts.
batch_size int The number of texts to accumulate in an internal buffer.
YIELDS Doc A sequence of Doc objects, in order.

Tokenizer.find_infix

Find internal split points of the string.

Name Type Description
string unicode The string to split.
RETURNS list A list of re.MatchObject objects that have .start() and .end() methods, denoting the placement of internal segment separators, e.g. hyphens.

Tokenizer.find_prefix

Find the length of a prefix that should be segmented from the string, or None if no prefix rules match.

Name Type Description
string unicode The string to segment.
RETURNS int The length of the prefix if present, otherwise None.

Tokenizer.find_suffix

Find the length of a suffix that should be segmented from the string, or None if no suffix rules match.

Name Type Description
string unicode The string to segment.
RETURNS int / None The length of the suffix if present, otherwise None.

Tokenizer.add_special_case

Add a special-case tokenization rule. This mechanism is also used to add custom tokenizer exceptions to the language data. See the usage guide on adding languages for more details and examples.

Example

from spacy.attrs import ORTH, LEMMA
case = [{ORTH: "do"}, {ORTH: "n't", LEMMA: "not"}]
tokenizer.add_special_case("don't", case)
Name Type Description
string unicode The string to specially tokenize.
token_attrs iterable A sequence of dicts, where each dict describes a token and its attributes. The ORTH fields of the attributes must exactly match the string when they are concatenated.

Tokenizer.to_disk

Serialize the tokenizer to disk.

Example

tokenizer = Tokenizer(nlp.vocab)
tokenizer.to_disk("/path/to/tokenizer")
Name Type Description
path unicode / Path A path to a directory, which will be created if it doesn't exist. Paths may be either strings or Path-like objects.

Tokenizer.from_disk

Load the tokenizer from disk. Modifies the object in place and returns it.

Example

tokenizer = Tokenizer(nlp.vocab)
tokenizer.from_disk("/path/to/tokenizer")
Name Type Description
path unicode / Path A path to a directory. Paths may be either strings or Path-like objects.
RETURNS Tokenizer The modified Tokenizer object.

Tokenizer.to_bytes

Example

tokenizer = tokenizer(nlp.vocab)
tokenizer_bytes = tokenizer.to_bytes()

Serialize the tokenizer to a bytestring.

Name Type Description
**exclude - Named attributes to prevent from being serialized.
RETURNS bytes The serialized form of the Tokenizer object.

Tokenizer.from_bytes

Load the tokenizer from a bytestring. Modifies the object in place and returns it.

Example

tokenizer_bytes = tokenizer.to_bytes()
tokenizer = Tokenizer(nlp.vocab)
tokenizer.from_bytes(tokenizer_bytes)
Name Type Description
bytes_data bytes The data to load from.
**exclude - Named attributes to prevent from being loaded.
RETURNS Tokenizer The Tokenizer object.

Attributes

Name Type Description
vocab Vocab The vocab object of the parent Doc.
prefix_search - A function to find segment boundaries from the start of a string. Returns the length of the segment, or None.
suffix_search - A function to find segment boundaries from the end of a string. Returns the length of the segment, or None.
infix_finditer - A function to find internal segment separators, e.g. hyphens. Returns a (possibly empty) list of re.MatchObject objects.