Merge pull request #5998 from adrianeboyd/docs/morph-usage-v3

This commit is contained in:
Ines Montani 2020-08-29 17:05:44 +02:00 committed by GitHub
commit d73f7229c0
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
7 changed files with 267 additions and 72 deletions

View File

@ -25,9 +25,10 @@ added to your pipeline, and not a hidden part of the vocab that runs behind the
scenes. This makes it easier to customize how lemmas should be assigned in your scenes. This makes it easier to customize how lemmas should be assigned in your
pipeline. pipeline.
If the lemmatization mode is set to `"rule"` and requires part-of-speech tags to If the lemmatization mode is set to `"rule"`, which requires coarse-grained POS
be assigned, make sure a [`Tagger`](/api/tagger) or another component assigning (`Token.pos`) to be assigned, make sure a [`Tagger`](/api/tagger),
tags is available in the pipeline and runs _before_ the lemmatizer. [`Morphologizer`](/api/morphologizer) or another component assigning POS is
available in the pipeline and runs _before_ the lemmatizer.
</Infobox> </Infobox>

View File

@ -22,15 +22,15 @@ values are defined in the [`Language.Defaults`](/api/language#defaults).
> nlp_de = German() # Includes German data > nlp_de = German() # Includes German data
> ``` > ```
| Name | Description | | Name | Description |
| ---------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------- | | ----------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Stop words**<br />[`stop_words.py`][stop_words.py] | List of most common words of a language that are often useful to filter out, for example "and" or "I". Matching tokens will return `True` for `is_stop`. | | **Stop words**<br />[`stop_words.py`][stop_words.py] | List of most common words of a language that are often useful to filter out, for example "and" or "I". Matching tokens will return `True` for `is_stop`. |
| **Tokenizer exceptions**<br />[`tokenizer_exceptions.py`][tokenizer_exceptions.py] | Special-case rules for the tokenizer, for example, contractions like "can't" and abbreviations with punctuation, like "U.K.". | | **Tokenizer exceptions**<br />[`tokenizer_exceptions.py`][tokenizer_exceptions.py] | Special-case rules for the tokenizer, for example, contractions like "can't" and abbreviations with punctuation, like "U.K.". |
| **Punctuation rules**<br />[`punctuation.py`][punctuation.py] | Regular expressions for splitting tokens, e.g. on punctuation or special characters like emoji. Includes rules for prefixes, suffixes and infixes. | | **Punctuation rules**<br />[`punctuation.py`][punctuation.py] | Regular expressions for splitting tokens, e.g. on punctuation or special characters like emoji. Includes rules for prefixes, suffixes and infixes. |
| **Character classes**<br />[`char_classes.py`][char_classes.py] | Character classes to be used in regular expressions, for example, latin characters, quotes, hyphens or icons. | | **Character classes**<br />[`char_classes.py`][char_classes.py] | Character classes to be used in regular expressions, for example, Latin characters, quotes, hyphens or icons. |
| **Lexical attributes**<br />[`lex_attrs.py`][lex_attrs.py] | Custom functions for setting lexical attributes on tokens, e.g. `like_num`, which includes language-specific words like "ten" or "hundred". | | **Lexical attributes**<br />[`lex_attrs.py`][lex_attrs.py] | Custom functions for setting lexical attributes on tokens, e.g. `like_num`, which includes language-specific words like "ten" or "hundred". |
| **Syntax iterators**<br />[`syntax_iterators.py`][syntax_iterators.py] | Functions that compute views of a `Doc` object based on its syntax. At the moment, only used for [noun chunks](/usage/linguistic-features#noun-chunks). | | **Syntax iterators**<br />[`syntax_iterators.py`][syntax_iterators.py] | Functions that compute views of a `Doc` object based on its syntax. At the moment, only used for [noun chunks](/usage/linguistic-features#noun-chunks). |
| **Lemmatizer**<br />[`spacy-lookups-data`][spacy-lookups-data] | Lemmatization rules or a lookup-based lemmatization table to assign base forms, for example "be" for "was". | | **Lemmatizer**<br />[`lemmatizer.py`][lemmatizer.py] [`spacy-lookups-data`][spacy-lookups-data] | Custom lemmatizer implementation and lemmatization tables. |
[stop_words.py]: [stop_words.py]:
https://github.com/explosion/spaCy/tree/master/spacy/lang/en/stop_words.py https://github.com/explosion/spaCy/tree/master/spacy/lang/en/stop_words.py
@ -44,4 +44,6 @@ values are defined in the [`Language.Defaults`](/api/language#defaults).
https://github.com/explosion/spaCy/tree/master/spacy/lang/en/lex_attrs.py https://github.com/explosion/spaCy/tree/master/spacy/lang/en/lex_attrs.py
[syntax_iterators.py]: [syntax_iterators.py]:
https://github.com/explosion/spaCy/tree/master/spacy/lang/en/syntax_iterators.py https://github.com/explosion/spaCy/tree/master/spacy/lang/en/syntax_iterators.py
[lemmatizer.py]:
https://github.com/explosion/spaCy/tree/master/spacy/lang/fr/lemmatizer.py
[spacy-lookups-data]: https://github.com/explosion/spacy-lookups-data [spacy-lookups-data]: https://github.com/explosion/spacy-lookups-data

View File

@ -1,9 +1,9 @@
When you call `nlp` on a text, spaCy first tokenizes the text to produce a `Doc` When you call `nlp` on a text, spaCy first tokenizes the text to produce a `Doc`
object. The `Doc` is then processed in several different steps this is also object. The `Doc` is then processed in several different steps this is also
referred to as the **processing pipeline**. The pipeline used by the referred to as the **processing pipeline**. The pipeline used by the
[default models](/models) consists of a tagger, a parser and an entity [default models](/models) typically include a tagger, a lemmatizer, a parser and
recognizer. Each pipeline component returns the processed `Doc`, which is then an entity recognizer. Each pipeline component returns the processed `Doc`, which
passed on to the next component. is then passed on to the next component.
![The processing pipeline](../../images/pipeline.svg) ![The processing pipeline](../../images/pipeline.svg)
@ -12,15 +12,19 @@ passed on to the next component.
> - **Creates:** Objects, attributes and properties modified and set by the > - **Creates:** Objects, attributes and properties modified and set by the
> component. > component.
| Name | Component | Creates | Description | | Name | Component | Creates | Description |
| -------------- | ------------------------------------------------------------------ | --------------------------------------------------------- | ------------------------------------------------ | | -------------- | ------------------------------------------- | --------------------------------------------------------- | -------------------------------- |
| **tokenizer** | [`Tokenizer`](/api/tokenizer) | `Doc` | Segment text into tokens. | | **tokenizer** | [`Tokenizer`](/api/tokenizer) | `Doc` | Segment text into tokens. |
| **tagger** | [`Tagger`](/api/tagger) | `Token.tag` | Assign part-of-speech tags. | | **tagger** | [`Tagger`](/api/tagger) | `Token.tag` | Assign part-of-speech tags. |
| **parser** | [`DependencyParser`](/api/dependencyparser) | `Token.head`, `Token.dep`, `Doc.sents`, `Doc.noun_chunks` | Assign dependency labels. | | **parser** | [`DependencyParser`](/api/dependencyparser) | `Token.head`, `Token.dep`, `Doc.sents`, `Doc.noun_chunks` | Assign dependency labels. |
| **ner** | [`EntityRecognizer`](/api/entityrecognizer) | `Doc.ents`, `Token.ent_iob`, `Token.ent_type` | Detect and label named entities. | | **ner** | [`EntityRecognizer`](/api/entityrecognizer) | `Doc.ents`, `Token.ent_iob`, `Token.ent_type` | Detect and label named entities. |
| **lemmatizer** | [`Lemmatizer`](/api/lemmatizer) | `Token.lemma` | Assign base forms. | | **lemmatizer** | [`Lemmatizer`](/api/lemmatizer) | `Token.lemma` | Assign base forms. |
| **textcat** | [`TextCategorizer`](/api/textcategorizer) | `Doc.cats` | Assign document labels. | | **textcat** | [`TextCategorizer`](/api/textcategorizer) | `Doc.cats` | Assign document labels. |
| **custom** | [custom components](/usage/processing-pipelines#custom-components) | `Doc._.xxx`, `Token._.xxx`, `Span._.xxx` | Assign custom attributes, methods or properties. |
| **custom** |
[custom components](/usage/processing-pipelines#custom-components) |
`Doc._.xxx`, `Token._.xxx`, `Span._.xxx` | Assign custom attributes, methods or
properties. |
The processing pipeline always **depends on the statistical model** and its The processing pipeline always **depends on the statistical model** and its
capabilities. For example, a pipeline can only include an entity recognizer capabilities. For example, a pipeline can only include an entity recognizer

View File

@ -52,9 +52,9 @@ $ pip install -U spacy
To install additional data tables for lemmatization you can run To install additional data tables for lemmatization you can run
`pip install spacy[lookups]` or install `pip install spacy[lookups]` or install
[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data)
separately. The lookups package is needed to create blank models with separately. The lookups package is needed to provide normalization and
lemmatization data, and to lemmatize in languages that don't yet come with lemmatization data for new models and to lemmatize in languages that don't yet
pretrained models and aren't powered by third-party libraries. come with pretrained models and aren't powered by third-party libraries.
</Infobox> </Infobox>

View File

@ -3,6 +3,8 @@ title: Linguistic Features
next: /usage/rule-based-matching next: /usage/rule-based-matching
menu: menu:
- ['POS Tagging', 'pos-tagging'] - ['POS Tagging', 'pos-tagging']
- ['Morphology', 'morphology']
- ['Lemmatization', 'lemmatization']
- ['Dependency Parse', 'dependency-parse'] - ['Dependency Parse', 'dependency-parse']
- ['Named Entities', 'named-entities'] - ['Named Entities', 'named-entities']
- ['Entity Linking', 'entity-linking'] - ['Entity Linking', 'entity-linking']
@ -10,7 +12,8 @@ menu:
- ['Merging & Splitting', 'retokenization'] - ['Merging & Splitting', 'retokenization']
- ['Sentence Segmentation', 'sbd'] - ['Sentence Segmentation', 'sbd']
- ['Vectors & Similarity', 'vectors-similarity'] - ['Vectors & Similarity', 'vectors-similarity']
- ['Language data', 'language-data'] - ['Mappings & Exceptions', 'mappings-exceptions']
- ['Language Data', 'language-data']
--- ---
Processing raw text intelligently is difficult: most words are rare, and it's Processing raw text intelligently is difficult: most words are rare, and it's
@ -37,7 +40,7 @@ in the [models directory](/models).
</Infobox> </Infobox>
### Rule-based morphology {#rule-based-morphology} ## Morphology {#morphology}
Inflectional morphology is the process by which a root form of a word is Inflectional morphology is the process by which a root form of a word is
modified by adding prefixes or suffixes that specify its grammatical function modified by adding prefixes or suffixes that specify its grammatical function
@ -45,33 +48,147 @@ but do not changes its part-of-speech. We say that a **lemma** (root form) is
**inflected** (modified/combined) with one or more **morphological features** to **inflected** (modified/combined) with one or more **morphological features** to
create a surface form. Here are some examples: create a surface form. Here are some examples:
| Context | Surface | Lemma | POS |  Morphological Features | | Context | Surface | Lemma | POS |  Morphological Features |
| ---------------------------------------- | ------- | ----- | ---- | ---------------------------------------- | | ---------------------------------------- | ------- | ----- | ------ | ---------------------------------------- |
| I was reading the paper | reading | read | verb | `VerbForm=Ger` | | I was reading the paper | reading | read | `VERB` | `VerbForm=Ger` |
| I don't watch the news, I read the paper | read | read | verb | `VerbForm=Fin`, `Mood=Ind`, `Tense=Pres` | | I don't watch the news, I read the paper | read | read | `VERB` | `VerbForm=Fin`, `Mood=Ind`, `Tense=Pres` |
| I read the paper yesterday | read | read | verb | `VerbForm=Fin`, `Mood=Ind`, `Tense=Past` | | I read the paper yesterday | read | read | `VERB` | `VerbForm=Fin`, `Mood=Ind`, `Tense=Past` |
English has a relatively simple morphological system, which spaCy handles using Morphological features are stored in the [`MorphAnalysis`](/api/morphanalysis)
rules that can be keyed by the token, the part-of-speech tag, or the combination under `Token.morph`, which allows you to access individual morphological
of the two. The system works as follows: features. The attribute `Token.morph_` provides the morphological analysis in
the Universal Dependencies FEATS format.
1. The tokenizer consults a ```python
[mapping table](/usage/adding-languages#tokenizer-exceptions) ### {executable="true"}
`TOKENIZER_EXCEPTIONS`, which allows sequences of characters to be mapped to import spacy
multiple tokens. Each token may be assigned a part of speech and one or more
morphological features. nlp = spacy.load("en_core_web_sm")
2. The part-of-speech tagger then assigns each token an **extended POS tag**. In doc = nlp("I was reading the paper.")
the API, these tags are known as `Token.tag`. They express the part-of-speech
(e.g. `VERB`) and some amount of morphological information, e.g. that the token = doc[0] # "I"
verb is past tense. assert token.morph_ == "Case=Nom|Number=Sing|Person=1|PronType=Prs"
3. For words whose POS is not set by a prior process, a assert token.morph.get("PronType") == ["Prs"]
[mapping table](/usage/adding-languages#tag-map) `TAG_MAP` maps the tags to a ```
part-of-speech and a set of morphological features.
4. Finally, a **rule-based deterministic lemmatizer** maps the surface form, to ### Statistical morphology {#morphologizer new="3" model="morphologizer"}
a lemma in light of the previously assigned extended part-of-speech and
morphological information, without consulting the context of the token. The spaCy v3 includes a statistical morphologizer component that assigns the
lemmatizer also accepts list-based exception files, acquired from morphological features and POS as `Token.morph` and `Token.pos`.
[WordNet](https://wordnet.princeton.edu/).
```python
### {executable="true"}
import spacy
nlp = spacy.load("de_core_news_sm")
doc = nlp("Wo bist du?") # 'Where are you?'
assert doc[2].morph_ == "Case=Nom|Number=Sing|Person=2|PronType=Prs"
assert doc[2].pos_ == "PRON"
```
### Rule-based morphology {#rule-based-morphology}
For languages with relatively simple morphological systems like English, spaCy
can assign morphological features through a rule-based approach, which uses the
token text and fine-grained part-of-speech tags to produce coarse-grained
part-of-speech tags and morphological features.
1. The part-of-speech tagger assigns each token a **fine-grained part-of-speech
tag**. In the API, these tags are known as `Token.tag`. They express the
part-of-speech (e.g. verb) and some amount of morphological information, e.g.
that the verb is past tense (e.g. `VBD` for a past tense verb in the Penn
Treebank) .
2. For words whose coarse-grained POS is not set by a prior process, a
[mapping table](#mapping-exceptions) maps the fine-grained tags to a
coarse-grained POS tags and morphological features.
```python
### {executable="true"}
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Where are you?")
assert doc[2].morph_ == "Case=Nom|Person=2|PronType=Prs"
assert doc[2].pos_ == "PRON"
```
## Lemmatization {#lemmatization model="lemmatizer" new="3"}
The [`Lemmatizer`](/api/lemmatizer) is a pipeline component that provides lookup
and rule-based lemmatization methods in a configurable component. An individual
language can extend the `Lemmatizer` as part of its [language
data](#language-data).
```python
### {executable="true"}
import spacy
# English models include a rule-based lemmatizer
nlp = spacy.load("en_core_web_sm")
lemmatizer = nlp.get_pipe("lemmatizer")
assert lemmatizer.mode == "rule"
doc = nlp("I was reading the paper.")
assert doc[1].lemma_ == "be"
assert doc[2].lemma_ == "read"
```
<Infobox title="Important note" variant="warning">
Unlike spaCy v2, spaCy v3 models do not provide lemmas by default or switch
automatically between lookup and rule-based lemmas depending on whether a
tagger is in the pipeline. To have lemmas in a `Doc`, the pipeline needs to
include a `lemmatizer` component. A `lemmatizer` is configured to use a single
mode such as `"lookup"` or `"rule"` on initialization. The `"rule"` mode
requires `Token.pos` to be set by a previous component.
</Infobox>
The data for spaCy's lemmatizers is distributed in the package
[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data). The
provided models already include all the required tables, but if you are
creating new models, you'll probably want to install `spacy-lookups-data` to
provide the data when the lemmatizer is initialized.
### Lookup lemmatizer {#lemmatizer-lookup}
For models without a tagger or morphologizer, a lookup lemmatizer can be added
to the pipeline as long as a lookup table is provided, typically through
`spacy-lookups-data`. The lookup lemmatizer looks up the token surface form in
the lookup table without reference to the token's part-of-speech or context.
```python
# pip install spacy-lookups-data
import spacy
nlp = spacy.blank("sv")
nlp.add_pipe("lemmatizer", config={"mode": "lookup"})
```
### Rule-based lemmatizer {#lemmatizer-rule}
When training models that include a component that assigns POS (a morphologizer
or a tagger with a [POS mapping](#mappings-exceptions)), a rule-based
lemmatizer can be added using rule tables from `spacy-lookups-data`:
```python
# pip install spacy-lookups-data
import spacy
nlp = spacy.blank("de")
# morphologizer (note: model is not yet trained!)
nlp.add_pipe("morphologizer")
# rule-based lemmatizer
nlp.add_pipe("lemmatizer", config={"mode": "rule"})
```
The rule-based deterministic lemmatizer maps the surface form to a lemma in
light of the previously assigned coarse-grained part-of-speech and morphological
information, without consulting the context of the token. The rule-based
lemmatizer also accepts list-based exception files. For English, these are
acquired from [WordNet](https://wordnet.princeton.edu/).
## Dependency Parsing {#dependency-parse model="parser"} ## Dependency Parsing {#dependency-parse model="parser"}
@ -420,7 +537,7 @@ on a token, it will return an empty string.
> >
> #### BILUO Scheme > #### BILUO Scheme
> >
> - `B` Token is the **beginning** of an entity. > - `B` Token is the **beginning** of a multi-token entity.
> - `I` Token is **inside** a multi-token entity. > - `I` Token is **inside** a multi-token entity.
> - `L` Token is the **last** token of a multi-token entity. > - `L` Token is the **last** token of a multi-token entity.
> - `U` Token is a single-token **unit** entity. > - `U` Token is a single-token **unit** entity.
@ -1615,6 +1732,75 @@ doc = nlp(text)
print("After:", [sent.text for sent in doc.sents]) print("After:", [sent.text for sent in doc.sents])
``` ```
## Mappings & Exceptions {#mappings-exceptions new="3"}
The [`AttributeRuler`](/api/attributeruler) manages rule-based mappings and
exceptions for all token-level attributes. As the number of pipeline components
has grown from spaCy v2 to v3, handling rules and exceptions in each component
individually has become impractical, so the `AttributeRuler` provides a single
component with a unified pattern format for all token attribute mappings and
exceptions.
The `AttributeRuler` uses [`Matcher`
patterns](/usage/rule-based-matching#adding-patterns) to identify tokens and
then assigns them the provided attributes. If needed, the `Matcher` patterns
can include context around the target token. For example, the `AttributeRuler`
can:
- provide exceptions for any token attributes
- map fine-grained tags to coarse-grained tags for languages without statistical
morphologizers (replacing the v2 tag map in the language data)
- map token surface form + fine-grained tags to morphological features
(replacing the v2 morph rules in the language data)
- specify the tags for space tokens (replacing hard-coded behavior in the
tagger)
The following example shows how the tag and POS `NNP`/`PROPN` can be specified
for the phrase `"The Who"`, overriding the tags provided by the statistical
tagger and the POS tag map.
```python
### {executable="true"}
import spacy
nlp = spacy.load("en_core_web_sm")
text = "I saw The Who perform. Who did you see?"
doc1 = nlp(text)
assert doc1[2].tag_ == "DT"
assert doc1[2].pos_ == "DET"
assert doc1[3].tag_ == "WP"
assert doc1[3].pos_ == "PRON"
# add a new exception for "The Who" as NNP/PROPN NNP/PROPN
ruler = nlp.get_pipe("attribute_ruler")
# pattern to match "The Who"
patterns = [[{"LOWER": "the"}, {"TEXT": "Who"}]]
# the attributes to assign to the matched token
attrs = {"TAG": "NNP", "POS": "PROPN"}
# add rule for "The" in "The Who"
ruler.add(patterns=patterns, attrs=attrs, index=0)
# add rule for "Who" in "The Who"
ruler.add(patterns=patterns, attrs=attrs, index=1)
doc2 = nlp(text)
assert doc2[2].tag_ == "NNP"
assert doc2[3].tag_ == "NNP"
assert doc2[2].pos_ == "PROPN"
assert doc2[3].pos_ == "PROPN"
# the second "Who" remains unmodified
assert doc2[5].tag_ == "WP"
assert doc2[5].pos_ == "PRON"
```
For easy migration from from spaCy v2 to v3, the `AttributeRuler` can import v2
`TAG_MAP` and `MORPH_RULES` data with the methods
[`AttributerRuler.load_from_tag_map`](/api/attributeruler#load_from_tag_map) and
[`AttributeRuler.load_from_morph_rules`](/api/attributeruler#load_from_morph_rules).
## Word vectors and semantic similarity {#vectors-similarity} ## Word vectors and semantic similarity {#vectors-similarity}
import Vectors101 from 'usage/101/\_vectors-similarity.md' import Vectors101 from 'usage/101/\_vectors-similarity.md'
@ -1744,7 +1930,7 @@ for word, vector in vector_data.items():
vocab.set_vector(word, vector) vocab.set_vector(word, vector)
``` ```
## Language data {#language-data} ## Language Data {#language-data}
import LanguageData101 from 'usage/101/\_language-data.md' import LanguageData101 from 'usage/101/\_language-data.md'

View File

@ -220,20 +220,21 @@ available pipeline components and component functions.
> ruler = nlp.add_pipe("entity_ruler") > ruler = nlp.add_pipe("entity_ruler")
> ``` > ```
| String name | Component | Description | | String name | Component | Description |
| --------------- | ----------------------------------------------- | ----------------------------------------------------------------------------------------- | | ----------------- | ----------------------------------------------- | ----------------------------------------------------------------------------------------- |
| `tagger` | [`Tagger`](/api/tagger) | Assign part-of-speech-tags. | | `tagger` | [`Tagger`](/api/tagger) | Assign part-of-speech-tags. |
| `parser` | [`DependencyParser`](/api/dependencyparser) | Assign dependency labels. | | `parser` | [`DependencyParser`](/api/dependencyparser) | Assign dependency labels. |
| `ner` | [`EntityRecognizer`](/api/entityrecognizer) | Assign named entities. | | `ner` | [`EntityRecognizer`](/api/entityrecognizer) | Assign named entities. |
| `entity_linker` | [`EntityLinker`](/api/entitylinker) | Assign knowledge base IDs to named entities. Should be added after the entity recognizer. | | `entity_linker` | [`EntityLinker`](/api/entitylinker) | Assign knowledge base IDs to named entities. Should be added after the entity recognizer. |
| `entity_ruler` | [`EntityRuler`](/api/entityruler) | Assign named entities based on pattern rules and dictionaries. | | `entity_ruler` | [`EntityRuler`](/api/entityruler) | Assign named entities based on pattern rules and dictionaries. |
| `textcat` | [`TextCategorizer`](/api/textcategorizer) | Assign text categories. | | `textcat` | [`TextCategorizer`](/api/textcategorizer) | Assign text categories. |
| `lemmatizer` | [`Lemmatizer`](/api/lemmatizer) | Assign base forms to words. | | `lemmatizer` | [`Lemmatizer`](/api/lemmatizer) | Assign base forms to words. |
| `morphologizer` | [`Morphologizer`](/api/morphologizer) | Assign morphological features and coarse-grained POS tags. | | `morphologizer` | [`Morphologizer`](/api/morphologizer) | Assign morphological features and coarse-grained POS tags. |
| `senter` | [`SentenceRecognizer`](/api/sentencerecognizer) | Assign sentence boundaries. | | `attribute_ruler` | [`AttributeRuler`](/api/attributeruler) | Assign token attribute mappings and rule-based exceptions. |
| `sentencizer` | [`Sentencizer`](/api/sentencizer) | Add rule-based sentence segmentation without the dependency parse. | | `senter` | [`SentenceRecognizer`](/api/sentencerecognizer) | Assign sentence boundaries. |
| `tok2vec` | [`Tok2Vec`](/api/tok2vec) | Assign token-to-vector embeddings. | | `sentencizer` | [`Sentencizer`](/api/sentencizer) | Add rule-based sentence segmentation without the dependency parse. |
| `transformer` | [`Transformer`](/api/transformer) | Assign the tokens and outputs of a transformer model. | | `tok2vec` | [`Tok2Vec`](/api/tok2vec) | Assign token-to-vector embeddings. |
| `transformer` | [`Transformer`](/api/transformer) | Assign the tokens and outputs of a transformer model. |
### Disabling, excluding and modifying components {#disabling} ### Disabling, excluding and modifying components {#disabling}

View File

@ -142,6 +142,7 @@ add to your pipeline and customize for your use case:
> #### Example > #### Example
> >
> ```python > ```python
> # pip install spacy-lookups-data
> nlp = spacy.blank("en") > nlp = spacy.blank("en")
> nlp.add_pipe("lemmatizer") > nlp.add_pipe("lemmatizer")
> ``` > ```
@ -263,7 +264,7 @@ The following methods, attributes and commands are new in spaCy v3.0.
| [`Language.config`](/api/language#config) | The [config](/usage/training#config) used to create the current `nlp` object. An instance of [`Config`](https://thinc.ai/docs/api-config#config) and can be saved to disk and used for training. | | [`Language.config`](/api/language#config) | The [config](/usage/training#config) used to create the current `nlp` object. An instance of [`Config`](https://thinc.ai/docs/api-config#config) and can be saved to disk and used for training. |
| [`Language.components`](/api/language#attributes) [`Language.component_names`](/api/language#attributes) | All available components and component names, including disabled components that are not run as part of the pipeline. | | [`Language.components`](/api/language#attributes) [`Language.component_names`](/api/language#attributes) | All available components and component names, including disabled components that are not run as part of the pipeline. |
| [`Language.disabled`](/api/language#attributes) | Names of disabled components that are not run as part of the pipeline. | | [`Language.disabled`](/api/language#attributes) | Names of disabled components that are not run as part of the pipeline. |
| [`Pipe.score`](/api/pipe#score) | Method on trainable pipeline components that returns a dictionary of evaluation scores. | | [`Pipe.score`](/api/pipe#score) | Method on pipeline components that returns a dictionary of evaluation scores. |
| [`registry`](/api/top-level#registry) | Function registry to map functions to string names that can be referenced in [configs](/usage/training#config). | | [`registry`](/api/top-level#registry) | Function registry to map functions to string names that can be referenced in [configs](/usage/training#config). |
| [`util.load_meta`](/api/top-level#util.load_meta) [`util.load_config`](/api/top-level#util.load_config) | Updated helpers for loading a model's [`meta.json`](/api/data-formats#meta) and [`config.cfg`](/api/data-formats#config). | | [`util.load_meta`](/api/top-level#util.load_meta) [`util.load_config`](/api/top-level#util.load_config) | Updated helpers for loading a model's [`meta.json`](/api/data-formats#meta) and [`config.cfg`](/api/data-formats#config). |
| [`util.get_installed_models`](/api/top-level#util.get_installed_models) | Names of all models installed in the environment. | | [`util.get_installed_models`](/api/top-level#util.get_installed_models) | Names of all models installed in the environment. |
@ -399,7 +400,7 @@ on them.
| keyword-arguments like `vocab=False` on `to_disk`, `from_disk`, `to_bytes`, `from_bytes` | `exclude=["vocab"]` | | keyword-arguments like `vocab=False` on `to_disk`, `from_disk`, `to_bytes`, `from_bytes` | `exclude=["vocab"]` |
| `n_threads` argument on [`Tokenizer`](/api/tokenizer), [`Matcher`](/api/matcher), [`PhraseMatcher`](/api/phrasematcher) | `n_process` | | `n_threads` argument on [`Tokenizer`](/api/tokenizer), [`Matcher`](/api/matcher), [`PhraseMatcher`](/api/phrasematcher) | `n_process` |
| `verbose` argument on [`Language.evaluate`](/api/language#evaluate) | logging (`DEBUG`) | | `verbose` argument on [`Language.evaluate`](/api/language#evaluate) | logging (`DEBUG`) |
| `SentenceSegmenter` hook, `SimilarityHook` | [user hooks](/usage/processing-pipelines#custom-components-user-hooks), [`Sentencizer`](/api/sentencizer), [`SentenceRecognizer`](/api/sentenceregognizer) | | `SentenceSegmenter` hook, `SimilarityHook` | [user hooks](/usage/processing-pipelines#custom-components-user-hooks), [`Sentencizer`](/api/sentencizer), [`SentenceRecognizer`](/api/sentencerecognizer) |
## Migrating from v2.x {#migrating} ## Migrating from v2.x {#migrating}