Update usage docs for lemmatization and morphology

This commit is contained in:
Adriane Boyd 2020-08-29 15:56:50 +02:00
parent e1e1760fd6
commit f9ed31a757
7 changed files with 267 additions and 72 deletions

View File

@ -25,9 +25,10 @@ added to your pipeline, and not a hidden part of the vocab that runs behind the
scenes. This makes it easier to customize how lemmas should be assigned in your
pipeline.
If the lemmatization mode is set to `"rule"` and requires part-of-speech tags to
be assigned, make sure a [`Tagger`](/api/tagger) or another component assigning
tags is available in the pipeline and runs _before_ the lemmatizer.
If the lemmatization mode is set to `"rule"`, which requires coarse-grained POS
(`Token.pos`) to be assigned, make sure a [`Tagger`](/api/tagger),
[`Morphologizer`](/api/morphologizer) or another component assigning POS is
available in the pipeline and runs _before_ the lemmatizer.
</Infobox>

View File

@ -22,15 +22,15 @@ values are defined in the [`Language.Defaults`](/api/language#defaults).
> nlp_de = German() # Includes German data
> ```
| Name | Description |
| ---------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Stop words**<br />[`stop_words.py`][stop_words.py] | List of most common words of a language that are often useful to filter out, for example "and" or "I". Matching tokens will return `True` for `is_stop`. |
| **Tokenizer exceptions**<br />[`tokenizer_exceptions.py`][tokenizer_exceptions.py] | Special-case rules for the tokenizer, for example, contractions like "can't" and abbreviations with punctuation, like "U.K.". |
| **Punctuation rules**<br />[`punctuation.py`][punctuation.py] | Regular expressions for splitting tokens, e.g. on punctuation or special characters like emoji. Includes rules for prefixes, suffixes and infixes. |
| **Character classes**<br />[`char_classes.py`][char_classes.py] | Character classes to be used in regular expressions, for example, latin characters, quotes, hyphens or icons. |
| **Lexical attributes**<br />[`lex_attrs.py`][lex_attrs.py] | Custom functions for setting lexical attributes on tokens, e.g. `like_num`, which includes language-specific words like "ten" or "hundred". |
| **Syntax iterators**<br />[`syntax_iterators.py`][syntax_iterators.py] | Functions that compute views of a `Doc` object based on its syntax. At the moment, only used for [noun chunks](/usage/linguistic-features#noun-chunks). |
| **Lemmatizer**<br />[`spacy-lookups-data`][spacy-lookups-data] | Lemmatization rules or a lookup-based lemmatization table to assign base forms, for example "be" for "was". |
| Name | Description |
| ----------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Stop words**<br />[`stop_words.py`][stop_words.py] | List of most common words of a language that are often useful to filter out, for example "and" or "I". Matching tokens will return `True` for `is_stop`. |
| **Tokenizer exceptions**<br />[`tokenizer_exceptions.py`][tokenizer_exceptions.py] | Special-case rules for the tokenizer, for example, contractions like "can't" and abbreviations with punctuation, like "U.K.". |
| **Punctuation rules**<br />[`punctuation.py`][punctuation.py] | Regular expressions for splitting tokens, e.g. on punctuation or special characters like emoji. Includes rules for prefixes, suffixes and infixes. |
| **Character classes**<br />[`char_classes.py`][char_classes.py] | Character classes to be used in regular expressions, for example, Latin characters, quotes, hyphens or icons. |
| **Lexical attributes**<br />[`lex_attrs.py`][lex_attrs.py] | Custom functions for setting lexical attributes on tokens, e.g. `like_num`, which includes language-specific words like "ten" or "hundred". |
| **Syntax iterators**<br />[`syntax_iterators.py`][syntax_iterators.py] | Functions that compute views of a `Doc` object based on its syntax. At the moment, only used for [noun chunks](/usage/linguistic-features#noun-chunks). |
| **Lemmatizer**<br />[`lemmatizer.py`][lemmatizer.py] [`spacy-lookups-data`][spacy-lookups-data] | Custom lemmatizer implementation and lemmatization tables. |
[stop_words.py]:
https://github.com/explosion/spaCy/tree/master/spacy/lang/en/stop_words.py
@ -44,4 +44,6 @@ values are defined in the [`Language.Defaults`](/api/language#defaults).
https://github.com/explosion/spaCy/tree/master/spacy/lang/en/lex_attrs.py
[syntax_iterators.py]:
https://github.com/explosion/spaCy/tree/master/spacy/lang/en/syntax_iterators.py
[lemmatizer.py]:
https://github.com/explosion/spaCy/tree/master/spacy/lang/fr/lemmatizer.py
[spacy-lookups-data]: https://github.com/explosion/spacy-lookups-data

View File

@ -1,9 +1,9 @@
When you call `nlp` on a text, spaCy first tokenizes the text to produce a `Doc`
object. The `Doc` is then processed in several different steps this is also
referred to as the **processing pipeline**. The pipeline used by the
[default models](/models) consists of a tagger, a parser and an entity
recognizer. Each pipeline component returns the processed `Doc`, which is then
passed on to the next component.
[default models](/models) typically include a tagger, a lemmatizer, a parser and
an entity recognizer. Each pipeline component returns the processed `Doc`, which
is then passed on to the next component.
![The processing pipeline](../../images/pipeline.svg)
@ -12,15 +12,19 @@ passed on to the next component.
> - **Creates:** Objects, attributes and properties modified and set by the
> component.
| Name | Component | Creates | Description |
| -------------- | ------------------------------------------------------------------ | --------------------------------------------------------- | ------------------------------------------------ |
| **tokenizer** | [`Tokenizer`](/api/tokenizer) | `Doc` | Segment text into tokens. |
| **tagger** | [`Tagger`](/api/tagger) | `Token.tag` | Assign part-of-speech tags. |
| **parser** | [`DependencyParser`](/api/dependencyparser) | `Token.head`, `Token.dep`, `Doc.sents`, `Doc.noun_chunks` | Assign dependency labels. |
| **ner** | [`EntityRecognizer`](/api/entityrecognizer) | `Doc.ents`, `Token.ent_iob`, `Token.ent_type` | Detect and label named entities. |
| **lemmatizer** | [`Lemmatizer`](/api/lemmatizer) | `Token.lemma` | Assign base forms. |
| **textcat** | [`TextCategorizer`](/api/textcategorizer) | `Doc.cats` | Assign document labels. |
| **custom** | [custom components](/usage/processing-pipelines#custom-components) | `Doc._.xxx`, `Token._.xxx`, `Span._.xxx` | Assign custom attributes, methods or properties. |
| Name | Component | Creates | Description |
| -------------- | ------------------------------------------- | --------------------------------------------------------- | -------------------------------- |
| **tokenizer** | [`Tokenizer`](/api/tokenizer) | `Doc` | Segment text into tokens. |
| **tagger** | [`Tagger`](/api/tagger) | `Token.tag` | Assign part-of-speech tags. |
| **parser** | [`DependencyParser`](/api/dependencyparser) | `Token.head`, `Token.dep`, `Doc.sents`, `Doc.noun_chunks` | Assign dependency labels. |
| **ner** | [`EntityRecognizer`](/api/entityrecognizer) | `Doc.ents`, `Token.ent_iob`, `Token.ent_type` | Detect and label named entities. |
| **lemmatizer** | [`Lemmatizer`](/api/lemmatizer) | `Token.lemma` | Assign base forms. |
| **textcat** | [`TextCategorizer`](/api/textcategorizer) | `Doc.cats` | Assign document labels. |
| **custom** |
[custom components](/usage/processing-pipelines#custom-components) |
`Doc._.xxx`, `Token._.xxx`, `Span._.xxx` | Assign custom attributes, methods or
properties. |
The processing pipeline always **depends on the statistical model** and its
capabilities. For example, a pipeline can only include an entity recognizer

View File

@ -52,9 +52,9 @@ $ pip install -U spacy
To install additional data tables for lemmatization you can run
`pip install spacy[lookups]` or install
[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data)
separately. The lookups package is needed to create blank models with
lemmatization data, and to lemmatize in languages that don't yet come with
pretrained models and aren't powered by third-party libraries.
separately. The lookups package is needed to provide normalization and
lemmatization data for new models and to lemmatize in languages that don't yet
come with pretrained models and aren't powered by third-party libraries.
</Infobox>

View File

@ -3,6 +3,8 @@ title: Linguistic Features
next: /usage/rule-based-matching
menu:
- ['POS Tagging', 'pos-tagging']
- ['Morphology', 'morphology']
- ['Lemmatization', 'lemmatization']
- ['Dependency Parse', 'dependency-parse']
- ['Named Entities', 'named-entities']
- ['Entity Linking', 'entity-linking']
@ -10,7 +12,8 @@ menu:
- ['Merging & Splitting', 'retokenization']
- ['Sentence Segmentation', 'sbd']
- ['Vectors & Similarity', 'vectors-similarity']
- ['Language data', 'language-data']
- ['Mappings & Exceptions', 'mappings-exceptions']
- ['Language Data', 'language-data']
---
Processing raw text intelligently is difficult: most words are rare, and it's
@ -37,7 +40,7 @@ in the [models directory](/models).
</Infobox>
### Rule-based morphology {#rule-based-morphology}
## Morphology {#morphology}
Inflectional morphology is the process by which a root form of a word is
modified by adding prefixes or suffixes that specify its grammatical function
@ -45,33 +48,147 @@ but do not changes its part-of-speech. We say that a **lemma** (root form) is
**inflected** (modified/combined) with one or more **morphological features** to
create a surface form. Here are some examples:
| Context | Surface | Lemma | POS |  Morphological Features |
| ---------------------------------------- | ------- | ----- | ---- | ---------------------------------------- |
| I was reading the paper | reading | read | verb | `VerbForm=Ger` |
| I don't watch the news, I read the paper | read | read | verb | `VerbForm=Fin`, `Mood=Ind`, `Tense=Pres` |
| I read the paper yesterday | read | read | verb | `VerbForm=Fin`, `Mood=Ind`, `Tense=Past` |
| Context | Surface | Lemma | POS |  Morphological Features |
| ---------------------------------------- | ------- | ----- | ------ | ---------------------------------------- |
| I was reading the paper | reading | read | `VERB` | `VerbForm=Ger` |
| I don't watch the news, I read the paper | read | read | `VERB` | `VerbForm=Fin`, `Mood=Ind`, `Tense=Pres` |
| I read the paper yesterday | read | read | `VERB` | `VerbForm=Fin`, `Mood=Ind`, `Tense=Past` |
English has a relatively simple morphological system, which spaCy handles using
rules that can be keyed by the token, the part-of-speech tag, or the combination
of the two. The system works as follows:
Morphological features are stored in the [`MorphAnalysis`](/api/morphanalysis)
under `Token.morph`, which allows you to access individual morphological
features. The attribute `Token.morph_` provides the morphological analysis in
the Universal Dependencies FEATS format.
1. The tokenizer consults a
[mapping table](/usage/adding-languages#tokenizer-exceptions)
`TOKENIZER_EXCEPTIONS`, which allows sequences of characters to be mapped to
multiple tokens. Each token may be assigned a part of speech and one or more
morphological features.
2. The part-of-speech tagger then assigns each token an **extended POS tag**. In
the API, these tags are known as `Token.tag`. They express the part-of-speech
(e.g. `VERB`) and some amount of morphological information, e.g. that the
verb is past tense.
3. For words whose POS is not set by a prior process, a
[mapping table](/usage/adding-languages#tag-map) `TAG_MAP` maps the tags to a
part-of-speech and a set of morphological features.
4. Finally, a **rule-based deterministic lemmatizer** maps the surface form, to
a lemma in light of the previously assigned extended part-of-speech and
morphological information, without consulting the context of the token. The
lemmatizer also accepts list-based exception files, acquired from
[WordNet](https://wordnet.princeton.edu/).
```python
### {executable="true"}
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("I was reading the paper.")
token = doc[0] # "I"
assert token.morph_ == "Case=Nom|Number=Sing|Person=1|PronType=Prs"
assert token.morph.get("PronType") == ["Prs"]
```
### Statistical morphology {#morphologizer new="3" model="morphologizer"}
spaCy v3 includes a statistical morphologizer component that assigns the
morphological features and POS as `Token.morph` and `Token.pos`.
```python
### {executable="true"}
import spacy
nlp = spacy.load("de_core_news_sm")
doc = nlp("Wo bist du?") # 'Where are you?'
assert doc[2].morph_ == "Case=Nom|Number=Sing|Person=2|PronType=Prs"
assert doc[2].pos_ == "PRON"
```
### Rule-based morphology {#rule-based-morphology}
For languages with relatively simple morphological systems like English, spaCy
can assign morphological features through a rule-based approach, which uses the
token text and fine-grained part-of-speech tags to produce coarse-grained
part-of-speech tags and morphological features.
1. The part-of-speech tagger assigns each token a **fine-grained part-of-speech
tag**. In the API, these tags are known as `Token.tag`. They express the
part-of-speech (e.g. verb) and some amount of morphological information, e.g.
that the verb is past tense (e.g. `VBD` for a past tense verb in the Penn
Treebank) .
2. For words whose coarse-grained POS is not set by a prior process, a
[mapping table](#mapping-exceptions) maps the fine-grained tags to a
coarse-grained POS tags and morphological features.
```python
### {executable="true"}
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Where are you?")
assert doc[2].morph_ == "Case=Nom|Person=2|PronType=Prs"
assert doc[2].pos_ == "PRON"
```
## Lemmatization {#lemmatization model="lemmatizer" new="3"}
The [`Lemmatizer`](/api/lemmatizer) is a pipeline component that provides lookup
and rule-based lemmatization methods in a configurable component. An individual
language can extend the `Lemmatizer` as part of its [language
data](#language-data).
```python
### {executable="true"}
import spacy
# English models include a rule-based lemmatizer
nlp = spacy.load("en_core_web_sm")
lemmatizer = nlp.get_pipe("lemmatizer")
assert lemmatizer.mode == "rule"
doc = nlp("I was reading the paper.")
assert doc[1].lemma_ == "be"
assert doc[2].lemma_ == "read"
```
<Infobox title="Important note" variant="warning">
Unlike spaCy v2, spaCy v3 models do not provide lemmas by default or switch
automatically between lookup and rule-based lemmas depending on whether a
tagger is in the pipeline. To have lemmas in a `Doc`, the pipeline needs to
include a `lemmatizer` component. A `lemmatizer` is configured to use a single
mode such as `"lookup"` or `"rule"` on initialization. The `"rule"` mode
requires `Token.pos` to be set by a previous component.
</Infobox>
The data for spaCy's lemmatizers is distributed in the package
[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data). The
provided models already include all the required tables, but if you are
creating new models, you'll probably want to install `spacy-lookups-data` to
provide the data when the lemmatizer is initialized.
### Lookup lemmatizer {#lemmatizer-lookup}
For models without a tagger or morphologizer, a lookup lemmatizer can be added
to the pipeline as long as a lookup table is provided, typically through
`spacy-lookups-data`. The lookup lemmatizer looks up the token surface form in
the lookup table without reference to the token's part-of-speech or context.
```python
# pip install spacy-lookups-data
import spacy
nlp = spacy.blank("sv")
nlp.add_pipe("lemmatizer", config={"mode": "lookup"})
```
### Rule-based lemmatizer {#lemmatizer-rule}
When training models that include a component that assigns POS (a morphologizer
or a tagger with a [POS mapping](#mappings-exceptions)), a rule-based
lemmatizer can be added using rule tables from `spacy-lookups-data`:
```python
# pip install spacy-lookups-data
import spacy
nlp = spacy.blank("de")
# morphologizer (note: model is not yet trained!)
nlp.add_pipe("morphologizer")
# rule-based lemmatizer
nlp.add_pipe("lemmatizer", config={"mode": "rule"})
```
The rule-based deterministic lemmatizer maps the surface form to a lemma in
light of the previously assigned coarse-grained part-of-speech and morphological
information, without consulting the context of the token. The rule-based
lemmatizer also accepts list-based exception files. For English, these are
acquired from [WordNet](https://wordnet.princeton.edu/).
## Dependency Parsing {#dependency-parse model="parser"}
@ -420,7 +537,7 @@ on a token, it will return an empty string.
>
> #### BILUO Scheme
>
> - `B` Token is the **beginning** of an entity.
> - `B` Token is the **beginning** of a multi-token entity.
> - `I` Token is **inside** a multi-token entity.
> - `L` Token is the **last** token of a multi-token entity.
> - `U` Token is a single-token **unit** entity.
@ -1574,6 +1691,75 @@ doc = nlp(text)
print("After:", [sent.text for sent in doc.sents])
```
## Mappings & Exceptions {#mappings-exceptions new="3"}
The [`AttributeRuler`](/api/attributeruler) manages rule-based mappings and
exceptions for all token-level attributes. As the number of pipeline components
has grown from spaCy v2 to v3, handling rules and exceptions in each component
individually has become impractical, so the `AttributeRuler` provides a single
component with a unified pattern format for all token attribute mappings and
exceptions.
The `AttributeRuler` uses [`Matcher`
patterns](/usage/rule-based-matching#adding-patterns) to identify tokens and
then assigns them the provided attributes. If needed, the `Matcher` patterns
can include context around the target token. For example, the `AttributeRuler`
can:
- provide exceptions for any token attributes
- map fine-grained tags to coarse-grained tags for languages without statistical
morphologizers (replacing the v2 tag map in the language data)
- map token surface form + fine-grained tags to morphological features
(replacing the v2 morph rules in the language data)
- specify the tags for space tokens (replacing hard-coded behavior in the
tagger)
The following example shows how the tag and POS `NNP`/`PROPN` can be specified
for the phrase `"The Who"`, overriding the tags provided by the statistical
tagger and the POS tag map.
```python
### {executable="true"}
import spacy
nlp = spacy.load("en_core_web_sm")
text = "I saw The Who perform. Who did you see?"
doc1 = nlp(text)
assert doc1[2].tag_ == "DT"
assert doc1[2].pos_ == "DET"
assert doc1[3].tag_ == "WP"
assert doc1[3].pos_ == "PRON"
# add a new exception for "The Who" as NNP/PROPN NNP/PROPN
ruler = nlp.get_pipe("attribute_ruler")
# pattern to match "The Who"
patterns = [[{"LOWER": "the"}, {"TEXT": "Who"}]]
# the attributes to assign to the matched token
attrs = {"TAG": "NNP", "POS": "PROPN"}
# add rule for "The" in "The Who"
ruler.add(patterns=patterns, attrs=attrs, index=0)
# add rule for "Who" in "The Who"
ruler.add(patterns=patterns, attrs=attrs, index=1)
doc2 = nlp(text)
assert doc2[2].tag_ == "NNP"
assert doc2[3].tag_ == "NNP"
assert doc2[2].pos_ == "PROPN"
assert doc2[3].pos_ == "PROPN"
# the second "Who" remains unmodified
assert doc2[5].tag_ == "WP"
assert doc2[5].pos_ == "PRON"
```
For easy migration from from spaCy v2 to v3, the `AttributeRuler` can import v2
`TAG_MAP` and `MORPH_RULES` data with the methods
[`AttributerRuler.load_from_tag_map`](/api/attributeruler#load_from_tag_map) and
[`AttributeRuler.load_from_morph_rules`](/api/attributeruler#load_from_morph_rules).
## Word vectors and semantic similarity {#vectors-similarity}
import Vectors101 from 'usage/101/\_vectors-similarity.md'
@ -1703,7 +1889,7 @@ for word, vector in vector_data.items():
vocab.set_vector(word, vector)
```
## Language data {#language-data}
## Language Data {#language-data}
import LanguageData101 from 'usage/101/\_language-data.md'

View File

@ -220,20 +220,21 @@ available pipeline components and component functions.
> ruler = nlp.add_pipe("entity_ruler")
> ```
| String name | Component | Description |
| --------------- | ----------------------------------------------- | ----------------------------------------------------------------------------------------- |
| `tagger` | [`Tagger`](/api/tagger) | Assign part-of-speech-tags. |
| `parser` | [`DependencyParser`](/api/dependencyparser) | Assign dependency labels. |
| `ner` | [`EntityRecognizer`](/api/entityrecognizer) | Assign named entities. |
| `entity_linker` | [`EntityLinker`](/api/entitylinker) | Assign knowledge base IDs to named entities. Should be added after the entity recognizer. |
| `entity_ruler` | [`EntityRuler`](/api/entityruler) | Assign named entities based on pattern rules and dictionaries. |
| `textcat` | [`TextCategorizer`](/api/textcategorizer) | Assign text categories. |
| `lemmatizer` | [`Lemmatizer`](/api/lemmatizer) | Assign base forms to words. |
| `morphologizer` | [`Morphologizer`](/api/morphologizer) | Assign morphological features and coarse-grained POS tags. |
| `senter` | [`SentenceRecognizer`](/api/sentencerecognizer) | Assign sentence boundaries. |
| `sentencizer` | [`Sentencizer`](/api/sentencizer) | Add rule-based sentence segmentation without the dependency parse. |
| `tok2vec` | [`Tok2Vec`](/api/tok2vec) | Assign token-to-vector embeddings. |
| `transformer` | [`Transformer`](/api/transformer) | Assign the tokens and outputs of a transformer model. |
| String name | Component | Description |
| ----------------- | ----------------------------------------------- | ----------------------------------------------------------------------------------------- |
| `tagger` | [`Tagger`](/api/tagger) | Assign part-of-speech-tags. |
| `parser` | [`DependencyParser`](/api/dependencyparser) | Assign dependency labels. |
| `ner` | [`EntityRecognizer`](/api/entityrecognizer) | Assign named entities. |
| `entity_linker` | [`EntityLinker`](/api/entitylinker) | Assign knowledge base IDs to named entities. Should be added after the entity recognizer. |
| `entity_ruler` | [`EntityRuler`](/api/entityruler) | Assign named entities based on pattern rules and dictionaries. |
| `textcat` | [`TextCategorizer`](/api/textcategorizer) | Assign text categories. |
| `lemmatizer` | [`Lemmatizer`](/api/lemmatizer) | Assign base forms to words. |
| `morphologizer` | [`Morphologizer`](/api/morphologizer) | Assign morphological features and coarse-grained POS tags. |
| `attribute_ruler` | [`AttributeRuler`](/api/attributeruler) | Assign token attribute mappings and rule-based exceptions. |
| `senter` | [`SentenceRecognizer`](/api/sentencerecognizer) | Assign sentence boundaries. |
| `sentencizer` | [`Sentencizer`](/api/sentencizer) | Add rule-based sentence segmentation without the dependency parse. |
| `tok2vec` | [`Tok2Vec`](/api/tok2vec) | Assign token-to-vector embeddings. |
| `transformer` | [`Transformer`](/api/transformer) | Assign the tokens and outputs of a transformer model. |
### Disabling and modifying pipeline components {#disabling}

View File

@ -142,6 +142,7 @@ add to your pipeline and customize for your use case:
> #### Example
>
> ```python
> # pip install spacy-lookups-data
> nlp = spacy.blank("en")
> nlp.add_pipe("lemmatizer")
> ```
@ -260,7 +261,7 @@ The following methods, attributes and commands are new in spaCy v3.0.
| [`Language.has_factory`](/api/language#has_factory) | Check whether a component factory is registered on a language class.s |
| [`Language.get_factory_meta`](/api/language#get_factory_meta) [`Language.get_pipe_meta`](/api/language#get_factory_meta) | Get the [`FactoryMeta`](/api/language#factorymeta) with component metadata for a factory or instance name. |
| [`Language.config`](/api/language#config) | The [config](/usage/training#config) used to create the current `nlp` object. An instance of [`Config`](https://thinc.ai/docs/api-config#config) and can be saved to disk and used for training. |
| [`Pipe.score`](/api/pipe#score) | Method on trainable pipeline components that returns a dictionary of evaluation scores. |
| [`Pipe.score`](/api/pipe#score) | Method on pipeline components that returns a dictionary of evaluation scores. |
| [`registry`](/api/top-level#registry) | Function registry to map functions to string names that can be referenced in [configs](/usage/training#config). |
| [`util.load_meta`](/api/top-level#util.load_meta) [`util.load_config`](/api/top-level#util.load_config) | Updated helpers for loading a model's [`meta.json`](/api/data-formats#meta) and [`config.cfg`](/api/data-formats#config). |
| [`util.get_installed_models`](/api/top-level#util.get_installed_models) | Names of all models installed in the environment. |
@ -396,7 +397,7 @@ on them.
| keyword-arguments like `vocab=False` on `to_disk`, `from_disk`, `to_bytes`, `from_bytes` | `exclude=["vocab"]` |
| `n_threads` argument on [`Tokenizer`](/api/tokenizer), [`Matcher`](/api/matcher), [`PhraseMatcher`](/api/phrasematcher) | `n_process` |
| `verbose` argument on [`Language.evaluate`](/api/language#evaluate) | logging (`DEBUG`) |
| `SentenceSegmenter` hook, `SimilarityHook` | [user hooks](/usage/processing-pipelines#custom-components-user-hooks), [`Sentencizer`](/api/sentencizer), [`SentenceRecognizer`](/api/sentenceregognizer) |
| `SentenceSegmenter` hook, `SimilarityHook` | [user hooks](/usage/processing-pipelines#custom-components-user-hooks), [`Sentencizer`](/api/sentencizer), [`SentenceRecognizer`](/api/sentencerecognizer) |
## Migrating from v2.x {#migrating}