spaCy/tokenizer.md at c91577db028e3343e7280f0614f7bd89451f93f0

mirror of https://github.com/explosion/spaCy.git synced 2024-11-14 13:47:13 +03:00

Tidy up and improve docs and docstrings (#3370 )

<!--- Provide a general summary of your changes in the title. -->

## Description
* tidy up and adjust Cython code to code style
* improve docstrings and make calling `help()` nicer
* add URLs to new docs pages to docstrings wherever possible, mostly to user-facing objects
* fix various typos and inconsistencies in docs

### Types of change
enhancement, docs

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

2019-03-08 11:42:26 +01:00

9.1 KiB

Raw Blame History

title	teaser	tag	source
Tokenizer	Segment text into words, punctuations marks etc.	class	spacy/tokenizer.pyx

Segment text, and create Doc objects with the discovered segment boundaries.

Tokenizer.init

Create a Tokenizer, to create Doc objects given unicode text.

Example

# Construction 1
from spacy.tokenizer import Tokenizer
tokenizer = Tokenizer(nlp.vocab)

# Construction 2
from spacy.lang.en import English
tokenizer = English().Defaults.create_tokenizer(nlp)

Name	Type	Description
`vocab`	`Vocab`	A storage container for lexical types.
`rules`	dict	Exceptions and special-cases for the tokenizer.
`prefix_search`	callable	A function matching the signature of `re.compile(string).search` to match prefixes.
`suffix_search`	callable	A function matching the signature of `re.compile(string).search` to match suffixes.
`infix_finditer`	callable	A function matching the signature of `re.compile(string).finditer` to find infixes.
`token_match`	callable	A boolean function matching strings to be recognized as tokens.
RETURNS	`Tokenizer`	The newly constructed object.

Tokenizer.call

Tokenize a string.

Example

tokens = tokenizer(u"This is a sentence")
assert len(tokens) == 4

Name	Type	Description
`string`	unicode	The string to tokenize.
RETURNS	`Doc`	A container for linguistic annotations.

Tokenizer.pipe

Tokenize a stream of texts.

Example

texts = [u"One document.", u"...", u"Lots of documents"]
for doc in tokenizer.pipe(texts, batch_size=50):
    pass

Name	Type	Description
`texts`	-	A sequence of unicode texts.
`batch_size`	int	The number of texts to accumulate in an internal buffer.
YIELDS	`Doc`	A sequence of Doc objects, in order.

Tokenizer.find_infix

Find internal split points of the string.

Name	Type	Description
`string`	unicode	The string to split.
RETURNS	list	A list of `re.MatchObject` objects that have `.start()` and `.end()` methods, denoting the placement of internal segment separators, e.g. hyphens.

Tokenizer.find_prefix

Find the length of a prefix that should be segmented from the string, or None if no prefix rules match.

Name	Type	Description
`string`	unicode	The string to segment.
RETURNS	int	The length of the prefix if present, otherwise `None`.

Tokenizer.find_suffix

Find the length of a suffix that should be segmented from the string, or None if no suffix rules match.

Name	Type	Description
`string`	unicode	The string to segment.
RETURNS	int / `None`	The length of the suffix if present, otherwise `None`.

Tokenizer.add_special_case

Add a special-case tokenization rule. This mechanism is also used to add custom tokenizer exceptions to the language data. See the usage guide on adding languages for more details and examples.

Example

from spacy.attrs import ORTH, LEMMA
case = [{ORTH: "do"}, {ORTH: "n't", LEMMA: "not"}]
tokenizer.add_special_case("don't", case)

Name	Type	Description
`string`	unicode	The string to specially tokenize.
`token_attrs`	iterable	A sequence of dicts, where each dict describes a token and its attributes. The `ORTH` fields of the attributes must exactly match the string when they are concatenated.

Tokenizer.to_disk

Serialize the tokenizer to disk.

Example

tokenizer = Tokenizer(nlp.vocab)
tokenizer.to_disk("/path/to/tokenizer")

Name	Type	Description
`path`	unicode / `Path`	A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects.

Tokenizer.from_disk

Load the tokenizer from disk. Modifies the object in place and returns it.

Example

tokenizer = Tokenizer(nlp.vocab)
tokenizer.from_disk("/path/to/tokenizer")

Name	Type	Description
`path`	unicode / `Path`	A path to a directory. Paths may be either strings or `Path`-like objects.
RETURNS	`Tokenizer`	The modified `Tokenizer` object.

Tokenizer.to_bytes

Example

tokenizer = tokenizer(nlp.vocab)
tokenizer_bytes = tokenizer.to_bytes()

Serialize the tokenizer to a bytestring.

Name	Type	Description
`**exclude`	-	Named attributes to prevent from being serialized.
RETURNS	bytes	The serialized form of the `Tokenizer` object.

Tokenizer.from_bytes

Load the tokenizer from a bytestring. Modifies the object in place and returns it.

Example

tokenizer_bytes = tokenizer.to_bytes()
tokenizer = Tokenizer(nlp.vocab)
tokenizer.from_bytes(tokenizer_bytes)

Name	Type	Description
`bytes_data`	bytes	The data to load from.
`**exclude`	-	Named attributes to prevent from being loaded.
RETURNS	`Tokenizer`	The `Tokenizer` object.

Attributes

Name	Type	Description
`vocab`	`Vocab`	The vocab object of the parent `Doc`.
`prefix_search`	-	A function to find segment boundaries from the start of a string. Returns the length of the segment, or `None`.
`suffix_search`	-	A function to find segment boundaries from the end of a string. Returns the length of the segment, or `None`.
`infix_finditer`	-	A function to find internal segment separators, e.g. hyphens. Returns a (possibly empty) list of `re.MatchObject` objects.

9.1 KiB Raw Blame History

Tokenizer.__init__

Example

Tokenizer.__call__

Example

Tokenizer.pipe

Example

Tokenizer.find_infix

Tokenizer.find_prefix

Tokenizer.find_suffix

Tokenizer.add_special_case

Example

Tokenizer.to_disk

Example

Tokenizer.from_disk

Example

Tokenizer.to_bytes

Example

Tokenizer.from_bytes

Example

Attributes

9.1 KiB

Raw Blame History

Tokenizer.init

Tokenizer.call