spaCy/website/docs/api/tokenizer.md

264 lines
14 KiB
Markdown
Raw Normal View History

---
title: Tokenizer
teaser: Segment text into words, punctuations marks, etc.
tag: class
source: spacy/tokenizer.pyx
---
2020-08-06 20:30:43 +03:00
> #### Default config
>
> ```ini
> [nlp.tokenizer]
> @tokenizers = "spacy.Tokenizer.v1"
> ```
Segment text, and create `Doc` objects with the discovered segment boundaries.
For a deeper understanding, see the docs on
[how spaCy's tokenizer works](/usage/linguistic-features#how-tokenizer-works).
The tokenizer is typically created automatically when a
2020-08-06 20:30:43 +03:00
[`Language`](/api/language) subclass is initialized and it reads its settings
like punctuation and special case rules from the
[`Language.Defaults`](/api/language#defaults) provided by the language subclass.
## Tokenizer.\_\_init\_\_ {#init tag="method"}
2020-10-02 14:24:33 +03:00
Create a `Tokenizer` to create `Doc` objects given unicode text. For examples of
how to construct a custom tokenizer with different tokenization rules, see the
[usage documentation](https://spacy.io/usage/linguistic-features#native-tokenizers).
> #### Example
>
> ```python
> # Construction 1
> from spacy.tokenizer import Tokenizer
> from spacy.lang.en import English
> nlp = English()
> # Create a blank Tokenizer with just the English vocab
> tokenizer = Tokenizer(nlp.vocab)
>
> # Construction 2
> from spacy.lang.en import English
> nlp = English()
> # Create a Tokenizer with the default settings for English
> # including punctuation rules and exceptions
2020-07-27 01:29:45 +03:00
> tokenizer = nlp.tokenizer
> ```
2020-08-17 17:45:24 +03:00
| Name | Description |
| ---------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `vocab` | A storage container for lexical types. ~~Vocab~~ |
| `rules` | Exceptions and special-cases for the tokenizer. ~~Optional[Dict[str, List[Dict[int, str]]]]~~ |
| `prefix_search` | A function matching the signature of `re.compile(string).search` to match prefixes. ~~Optional[Callable[[str], Optional[Match]]]~~ |
| `suffix_search` | A function matching the signature of `re.compile(string).search` to match suffixes. ~~Optional[Callable[[str], Optional[Match]]]~~ |
| `infix_finditer` | A function matching the signature of `re.compile(string).finditer` to find infixes. ~~Optional[Callable[[str], Iterator[Match]]]~~ |
| `token_match` | A function matching the signature of `re.compile(string).match` to find token matches. ~~Optional[Callable[[str], Optional[Match]]]~~ |
| `url_match` | A function matching the signature of `re.compile(string).match` to find token matches after considering prefixes and suffixes. ~~Optional[Callable[[str], Optional[Match]]]~~ |
## Tokenizer.\_\_call\_\_ {#call tag="method"}
Tokenize a string.
> #### Example
>
> ```python
> tokens = tokenizer("This is a sentence")
> assert len(tokens) == 4
> ```
2020-08-17 17:45:24 +03:00
| Name | Description |
| ----------- | ----------------------------------------------- |
| `string` | The string to tokenize. ~~str~~ |
| **RETURNS** | A container for linguistic annotations. ~~Doc~~ |
## Tokenizer.pipe {#pipe tag="method"}
Tokenize a stream of texts.
> #### Example
>
> ```python
> texts = ["One document.", "...", "Lots of documents"]
> for doc in tokenizer.pipe(texts, batch_size=50):
> pass
> ```
2020-08-17 17:45:24 +03:00
| Name | Description |
| ------------ | ------------------------------------------------------------------------------------ |
| `texts` | A sequence of unicode texts. ~~Iterable[str]~~ |
| `batch_size` | The number of texts to accumulate in an internal buffer. Defaults to `1000`. ~~int~~ |
2020-10-02 14:24:33 +03:00
| **YIELDS** | The tokenized `Doc` objects, in order. ~~Doc~~ |
## Tokenizer.find_infix {#find_infix tag="method"}
Find internal split points of the string.
2020-08-17 17:45:24 +03:00
| Name | Description |
| ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `string` | The string to split. ~~str~~ |
| **RETURNS** | A list of `re.MatchObject` objects that have `.start()` and `.end()` methods, denoting the placement of internal segment separators, e.g. hyphens. ~~List[Match]~~ |
## Tokenizer.find_prefix {#find_prefix tag="method"}
Find the length of a prefix that should be segmented from the string, or `None`
if no prefix rules match.
2020-08-17 17:45:24 +03:00
| Name | Description |
| ----------- | ------------------------------------------------------------------------ |
| `string` | The string to segment. ~~str~~ |
| **RETURNS** | The length of the prefix if present, otherwise `None`. ~~Optional[int]~~ |
## Tokenizer.find_suffix {#find_suffix tag="method"}
Find the length of a suffix that should be segmented from the string, or `None`
if no suffix rules match.
2020-08-17 17:45:24 +03:00
| Name | Description |
| ----------- | ------------------------------------------------------------------------ |
| `string` | The string to segment. ~~str~~ |
| **RETURNS** | The length of the suffix if present, otherwise `None`. ~~Optional[int]~~ |
## Tokenizer.add_special_case {#add_special_case tag="method"}
Add a special-case tokenization rule. This mechanism is also used to add custom
2020-10-02 14:24:33 +03:00
tokenizer exceptions to the language data. See the usage guide on the
[languages data](/usage/linguistic-features#language-data) and
[tokenizer special cases](/usage/linguistic-features#special-cases) for more
details and examples.
> #### Example
>
> ```python
> from spacy.attrs import ORTH, NORM
> case = [{ORTH: "do"}, {ORTH: "n't", NORM: "not"}]
2019-02-21 16:22:06 +03:00
> tokenizer.add_special_case("don't", case)
> ```
2020-08-17 17:45:24 +03:00
| Name | Description |
| ------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `string` | The string to specially tokenize. ~~str~~ |
| `token_attrs` | A sequence of dicts, where each dict describes a token and its attributes. The `ORTH` fields of the attributes must exactly match the string when they are concatenated. ~~Iterable[Dict[int, str]]~~ |
Add tokenizer explain() debugging method (#4596) * Expose tokenizer rules as a property Expose the tokenizer rules property in the same way as the other core properties. (The cache resetting is overkill, but consistent with `from_bytes` for now.) Add tests and update Tokenizer API docs. * Update Hungarian punctuation to remove empty string Update Hungarian punctuation definitions so that `_units` does not match an empty string. * Use _load_special_tokenization consistently Use `_load_special_tokenization()` and have it to handle `None` checks. * Fix precedence of `token_match` vs. special cases Remove `token_match` check from `_split_affixes()` so that special cases have precedence over `token_match`. `token_match` is checked only before infixes are split. * Add `make_debug_doc()` to the Tokenizer Add `make_debug_doc()` to the Tokenizer as a working implementation of the pseudo-code in the docs. Add a test (marked as slow) that checks that `nlp.tokenizer()` and `nlp.tokenizer.make_debug_doc()` return the same non-whitespace tokens for all languages that have `examples.sentences` that can be imported. * Update tokenization usage docs Update pseudo-code and algorithm description to correspond to `nlp.tokenizer.make_debug_doc()` with example debugging usage. Add more examples for customizing tokenizers while preserving the existing defaults. Minor edits / clarifications. * Revert "Update Hungarian punctuation to remove empty string" This reverts commit f0a577f7a5c67f55807fdbda9e9a936464723931. * Rework `make_debug_doc()` as `explain()` Rework `make_debug_doc()` as `explain()`, which returns a list of `(pattern_string, token_string)` tuples rather than a non-standard `Doc`. Update docs and tests accordingly, leaving the visualization for future work. * Handle cases with bad tokenizer patterns Detect when tokenizer patterns match empty prefixes and suffixes so that `explain()` does not hang on bad patterns. * Remove unused displacy image * Add tokenizer.explain() to usage docs
2019-11-20 15:07:25 +03:00
## Tokenizer.explain {#explain tag="method"}
Tokenize a string with a slow debugging tokenizer that provides information
about which tokenizer rule or pattern was matched for each token. The tokens
produced are identical to `Tokenizer.__call__` except for whitespace tokens.
> #### Example
>
> ```python
> tok_exp = nlp.tokenizer.explain("(don't)")
> assert [t[0] for t in tok_exp] == ["PREFIX", "SPECIAL-1", "SPECIAL-2", "SUFFIX"]
> assert [t[1] for t in tok_exp] == ["(", "do", "n't", ")"]
> ```
2020-08-17 17:45:24 +03:00
| Name | Description |
| ----------- | ---------------------------------------------------------------------------- |
| `string` | The string to tokenize with the debugging tokenizer. ~~str~~ |
| **RETURNS** | A list of `(pattern_string, token_string)` tuples. ~~List[Tuple[str, str]]~~ |
Add tokenizer explain() debugging method (#4596) * Expose tokenizer rules as a property Expose the tokenizer rules property in the same way as the other core properties. (The cache resetting is overkill, but consistent with `from_bytes` for now.) Add tests and update Tokenizer API docs. * Update Hungarian punctuation to remove empty string Update Hungarian punctuation definitions so that `_units` does not match an empty string. * Use _load_special_tokenization consistently Use `_load_special_tokenization()` and have it to handle `None` checks. * Fix precedence of `token_match` vs. special cases Remove `token_match` check from `_split_affixes()` so that special cases have precedence over `token_match`. `token_match` is checked only before infixes are split. * Add `make_debug_doc()` to the Tokenizer Add `make_debug_doc()` to the Tokenizer as a working implementation of the pseudo-code in the docs. Add a test (marked as slow) that checks that `nlp.tokenizer()` and `nlp.tokenizer.make_debug_doc()` return the same non-whitespace tokens for all languages that have `examples.sentences` that can be imported. * Update tokenization usage docs Update pseudo-code and algorithm description to correspond to `nlp.tokenizer.make_debug_doc()` with example debugging usage. Add more examples for customizing tokenizers while preserving the existing defaults. Minor edits / clarifications. * Revert "Update Hungarian punctuation to remove empty string" This reverts commit f0a577f7a5c67f55807fdbda9e9a936464723931. * Rework `make_debug_doc()` as `explain()` Rework `make_debug_doc()` as `explain()`, which returns a list of `(pattern_string, token_string)` tuples rather than a non-standard `Doc`. Update docs and tests accordingly, leaving the visualization for future work. * Handle cases with bad tokenizer patterns Detect when tokenizer patterns match empty prefixes and suffixes so that `explain()` does not hang on bad patterns. * Remove unused displacy image * Add tokenizer.explain() to usage docs
2019-11-20 15:07:25 +03:00
## Tokenizer.to_disk {#to_disk tag="method"}
Serialize the tokenizer to disk.
> #### Example
>
> ```python
> tokenizer = Tokenizer(nlp.vocab)
> tokenizer.to_disk("/path/to/tokenizer")
> ```
2020-08-17 17:45:24 +03:00
| Name | Description |
| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------ |
| `path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ |
| _keyword-only_ | |
| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
## Tokenizer.from_disk {#from_disk tag="method"}
Load the tokenizer from disk. Modifies the object in place and returns it.
> #### Example
>
> ```python
> tokenizer = Tokenizer(nlp.vocab)
> tokenizer.from_disk("/path/to/tokenizer")
> ```
2020-08-17 17:45:24 +03:00
| Name | Description |
| -------------- | ----------------------------------------------------------------------------------------------- |
| `path` | A path to a directory. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ |
| _keyword-only_ | |
| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
| **RETURNS** | The modified `Tokenizer` object. ~~Tokenizer~~ |
## Tokenizer.to_bytes {#to_bytes tag="method"}
> #### Example
>
> ```python
> tokenizer = tokenizer(nlp.vocab)
> tokenizer_bytes = tokenizer.to_bytes()
> ```
Serialize the tokenizer to a bytestring.
2020-08-17 17:45:24 +03:00
| Name | Description |
| -------------- | ------------------------------------------------------------------------------------------- |
| _keyword-only_ | |
| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
| **RETURNS** | The serialized form of the `Tokenizer` object. ~~bytes~~ |
## Tokenizer.from_bytes {#from_bytes tag="method"}
Load the tokenizer from a bytestring. Modifies the object in place and returns
it.
> #### Example
>
> ```python
> tokenizer_bytes = tokenizer.to_bytes()
> tokenizer = Tokenizer(nlp.vocab)
> tokenizer.from_bytes(tokenizer_bytes)
> ```
2020-08-17 17:45:24 +03:00
| Name | Description |
| -------------- | ------------------------------------------------------------------------------------------- |
| `bytes_data` | The data to load from. ~~bytes~~ |
| _keyword-only_ | |
| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
| **RETURNS** | The `Tokenizer` object. ~~Tokenizer~~ |
## Attributes {#attributes}
2020-08-17 17:45:24 +03:00
| Name | Description |
| ---------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `vocab` | The vocab object of the parent `Doc`. ~~Vocab~~ |
| `prefix_search` | A function to find segment boundaries from the start of a string. Returns the length of the segment, or `None`. ~~Optional[Callable[[str], Optional[Match]]]~~ |
| `suffix_search` | A function to find segment boundaries from the end of a string. Returns the length of the segment, or `None`. ~~Optional[Callable[[str], Optional[Match]]]~~ |
| `infix_finditer` | A function to find internal segment separators, e.g. hyphens. Returns a (possibly empty) sequence of `re.MatchObject` objects. ~~Optional[Callable[[str], Iterator[Match]]]~~ |
| `token_match` | A function matching the signature of `re.compile(string).match` to find token matches. Returns an `re.MatchObject` or `None`. ~~Optional[Callable[[str], Optional[Match]]]~~ |
| `rules` | A dictionary of tokenizer exceptions and special cases. ~~Optional[Dict[str, List[Dict[int, str]]]]~~ |
## Serialization fields {#serialization-fields}
During serialization, spaCy will export several data fields used to restore
different aspects of the object. If needed, you can exclude them from
serialization by passing in the string names via the `exclude` argument.
> #### Example
>
> ```python
> data = tokenizer.to_bytes(exclude=["vocab", "exceptions"])
> tokenizer.from_disk("./data", exclude=["token_match"])
> ```
| Name | Description |
| ---------------- | --------------------------------- |
| `vocab` | The shared [`Vocab`](/api/vocab). |
| `prefix_search` | The prefix rules. |
| `suffix_search` | The suffix rules. |
| `infix_finditer` | The infix rules. |
| `token_match` | The token match expression. |
| `exceptions` | The tokenizer exception rules. |