Update usage docs for lemmatization and morphology

2025-08-08 22:24:55 +03:00 · 2020-08-29 15:56:50 +02:00 · 2020-08-29 15:56:50 +02:00 · f9ed31a757
commit f9ed31a757
parent e1e1760fd6
7 changed files with 267 additions and 72 deletions
--- a/website/docs/api/lemmatizer.md
+++ b/website/docs/api/lemmatizer.md
@ -25,9 +25,10 @@ added to your pipeline, and not a hidden part of the vocab that runs behind the
 scenes. This makes it easier to customize how lemmas should be assigned in your
 pipeline.

-If the lemmatization mode is set to `"rule"` and requires part-of-speech tags to
-be assigned, make sure a [`Tagger`](/api/tagger) or another component assigning
-tags is available in the pipeline and runs _before_ the lemmatizer.
+If the lemmatization mode is set to `"rule"`, which requires coarse-grained POS
+(`Token.pos`) to be assigned, make sure a [`Tagger`](/api/tagger),
+[`Morphologizer`](/api/morphologizer) or another component assigning POS is
+available in the pipeline and runs _before_ the lemmatizer.

 </Infobox>

--- a/website/docs/usage/101/_language-data.md
+++ b/website/docs/usage/101/_language-data.md
@ -22,15 +22,15 @@ values are defined in the [`Language.Defaults`](/api/language#defaults).
 > nlp_de = German()  # Includes German data
 > ```

-| Name                                                                               | Description                                                                                                                                              |
-| ---------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| **Stop words**<br />[`stop_words.py`][stop_words.py]                               | List of most common words of a language that are often useful to filter out, for example "and" or "I". Matching tokens will return `True` for `is_stop`. |
-| **Tokenizer exceptions**<br />[`tokenizer_exceptions.py`][tokenizer_exceptions.py] | Special-case rules for the tokenizer, for example, contractions like "can't" and abbreviations with punctuation, like "U.K.".                            |
-| **Punctuation rules**<br />[`punctuation.py`][punctuation.py]                      | Regular expressions for splitting tokens, e.g. on punctuation or special characters like emoji. Includes rules for prefixes, suffixes and infixes.       |
-| **Character classes**<br />[`char_classes.py`][char_classes.py]                    | Character classes to be used in regular expressions, for example, latin characters, quotes, hyphens or icons.                                            |
-| **Lexical attributes**<br />[`lex_attrs.py`][lex_attrs.py]                         | Custom functions for setting lexical attributes on tokens, e.g. `like_num`, which includes language-specific words like "ten" or "hundred".              |
-| **Syntax iterators**<br />[`syntax_iterators.py`][syntax_iterators.py]             | Functions that compute views of a `Doc` object based on its syntax. At the moment, only used for [noun chunks](/usage/linguistic-features#noun-chunks).  |
-| **Lemmatizer**<br />[`spacy-lookups-data`][spacy-lookups-data]                     | Lemmatization rules or a lookup-based lemmatization table to assign base forms, for example "be" for "was".                                              |
+| Name                                                                                            | Description                                                                                                                                              |
+| ----------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| **Stop words**<br />[`stop_words.py`][stop_words.py]                                            | List of most common words of a language that are often useful to filter out, for example "and" or "I". Matching tokens will return `True` for `is_stop`. |
+| **Tokenizer exceptions**<br />[`tokenizer_exceptions.py`][tokenizer_exceptions.py]              | Special-case rules for the tokenizer, for example, contractions like "can't" and abbreviations with punctuation, like "U.K.".                            |
+| **Punctuation rules**<br />[`punctuation.py`][punctuation.py]                                   | Regular expressions for splitting tokens, e.g. on punctuation or special characters like emoji. Includes rules for prefixes, suffixes and infixes.       |
+| **Character classes**<br />[`char_classes.py`][char_classes.py]                                 | Character classes to be used in regular expressions, for example, Latin characters, quotes, hyphens or icons.                                            |
+| **Lexical attributes**<br />[`lex_attrs.py`][lex_attrs.py]                                      | Custom functions for setting lexical attributes on tokens, e.g. `like_num`, which includes language-specific words like "ten" or "hundred".              |
+| **Syntax iterators**<br />[`syntax_iterators.py`][syntax_iterators.py]                          | Functions that compute views of a `Doc` object based on its syntax. At the moment, only used for [noun chunks](/usage/linguistic-features#noun-chunks).  |
+| **Lemmatizer**<br />[`lemmatizer.py`][lemmatizer.py] [`spacy-lookups-data`][spacy-lookups-data] | Custom lemmatizer implementation and lemmatization tables.                                                                                               |

 [stop_words.py]:
  https://github.com/explosion/spaCy/tree/master/spacy/lang/en/stop_words.py
@ -44,4 +44,6 @@ values are defined in the [`Language.Defaults`](/api/language#defaults).
  https://github.com/explosion/spaCy/tree/master/spacy/lang/en/lex_attrs.py
 [syntax_iterators.py]:
  https://github.com/explosion/spaCy/tree/master/spacy/lang/en/syntax_iterators.py
+[lemmatizer.py]:
+  https://github.com/explosion/spaCy/tree/master/spacy/lang/fr/lemmatizer.py
 [spacy-lookups-data]: https://github.com/explosion/spacy-lookups-data
--- a/website/docs/usage/101/_pipelines.md
+++ b/website/docs/usage/101/_pipelines.md
@ -1,9 +1,9 @@
 When you call `nlp` on a text, spaCy first tokenizes the text to produce a `Doc`
 object. The `Doc` is then processed in several different steps – this is also
 referred to as the **processing pipeline**. The pipeline used by the
-[default models](/models) consists of a tagger, a parser and an entity
-recognizer. Each pipeline component returns the processed `Doc`, which is then
-passed on to the next component.
+[default models](/models) typically include a tagger, a lemmatizer, a parser and
+an entity recognizer. Each pipeline component returns the processed `Doc`, which
+is then passed on to the next component.

 ![The processing pipeline](../../images/pipeline.svg)

@ -12,15 +12,19 @@ passed on to the next component.
 > - **Creates:** Objects, attributes and properties modified and set by the
 >   component.

-| Name           | Component                                                          | Creates                                                   | Description                                      |
-| -------------- | ------------------------------------------------------------------ | --------------------------------------------------------- | ------------------------------------------------ |
-| **tokenizer**  | [`Tokenizer`](/api/tokenizer)                                      | `Doc`                                                     | Segment text into tokens.                        |
-| **tagger**     | [`Tagger`](/api/tagger)                                            | `Token.tag`                                               | Assign part-of-speech tags.                      |
-| **parser**     | [`DependencyParser`](/api/dependencyparser)                        | `Token.head`, `Token.dep`, `Doc.sents`, `Doc.noun_chunks` | Assign dependency labels.                        |
-| **ner**        | [`EntityRecognizer`](/api/entityrecognizer)                        | `Doc.ents`, `Token.ent_iob`, `Token.ent_type`             | Detect and label named entities.                 |
-| **lemmatizer** | [`Lemmatizer`](/api/lemmatizer)                                    | `Token.lemma`                                             | Assign base forms.                               |
-| **textcat**    | [`TextCategorizer`](/api/textcategorizer)                          | `Doc.cats`                                                | Assign document labels.                          |
-| **custom**     | [custom components](/usage/processing-pipelines#custom-components) | `Doc._.xxx`, `Token._.xxx`, `Span._.xxx`                  | Assign custom attributes, methods or properties. |
+| Name           | Component                                   | Creates                                                   | Description                      |
+| -------------- | ------------------------------------------- | --------------------------------------------------------- | -------------------------------- |
+| **tokenizer**  | [`Tokenizer`](/api/tokenizer)               | `Doc`                                                     | Segment text into tokens.        |
+| **tagger**     | [`Tagger`](/api/tagger)                     | `Token.tag`                                               | Assign part-of-speech tags.      |
+| **parser**     | [`DependencyParser`](/api/dependencyparser) | `Token.head`, `Token.dep`, `Doc.sents`, `Doc.noun_chunks` | Assign dependency labels.        |
+| **ner**        | [`EntityRecognizer`](/api/entityrecognizer) | `Doc.ents`, `Token.ent_iob`, `Token.ent_type`             | Detect and label named entities. |
+| **lemmatizer** | [`Lemmatizer`](/api/lemmatizer)             | `Token.lemma`                                             | Assign base forms.               |
+| **textcat**    | [`TextCategorizer`](/api/textcategorizer)   | `Doc.cats`                                                | Assign document labels.          |
+
+| **custom** |
+[custom components](/usage/processing-pipelines#custom-components) |
+`Doc._.xxx`, `Token._.xxx`, `Span._.xxx` | Assign custom attributes, methods or
+properties. |

 The processing pipeline always **depends on the statistical model** and its
 capabilities. For example, a pipeline can only include an entity recognizer
--- a/website/docs/usage/index.md
+++ b/website/docs/usage/index.md
@ -52,9 +52,9 @@ $ pip install -U spacy
 To install additional data tables for lemmatization you can run
 `pip install spacy[lookups]` or install
 [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data)
-separately. The lookups package is needed to create blank models with
-lemmatization data, and to lemmatize in languages that don't yet come with
-pretrained models and aren't powered by third-party libraries.
+separately. The lookups package is needed to provide normalization and
+lemmatization data for new models and to lemmatize in languages that don't yet
+come with pretrained models and aren't powered by third-party libraries.

 </Infobox>

--- a/website/docs/usage/linguistic-features.md
+++ b/website/docs/usage/linguistic-features.md
@ -3,6 +3,8 @@ title: Linguistic Features
 next: /usage/rule-based-matching
 menu:
  - ['POS Tagging', 'pos-tagging']
+  - ['Morphology', 'morphology']
+  - ['Lemmatization', 'lemmatization']
  - ['Dependency Parse', 'dependency-parse']
  - ['Named Entities', 'named-entities']
  - ['Entity Linking', 'entity-linking']
@ -10,7 +12,8 @@ menu:
  - ['Merging & Splitting', 'retokenization']
  - ['Sentence Segmentation', 'sbd']
  - ['Vectors & Similarity', 'vectors-similarity']
-  - ['Language data', 'language-data']
+  - ['Mappings & Exceptions', 'mappings-exceptions']
+  - ['Language Data', 'language-data']
 ---

 Processing raw text intelligently is difficult: most words are rare, and it's
@ -37,7 +40,7 @@ in the [models directory](/models).

 </Infobox>

-### Rule-based morphology {#rule-based-morphology}
+## Morphology {#morphology}

 Inflectional morphology is the process by which a root form of a word is
 modified by adding prefixes or suffixes that specify its grammatical function
@ -45,33 +48,147 @@ but do not changes its part-of-speech. We say that a **lemma** (root form) is
 **inflected** (modified/combined) with one or more **morphological features** to
 create a surface form. Here are some examples:

-| Context                                  | Surface | Lemma | POS  |  Morphological Features                  |
-| ---------------------------------------- | ------- | ----- | ---- | ---------------------------------------- |
-| I was reading the paper                  | reading | read  | verb | `VerbForm=Ger`                           |
-| I don't watch the news, I read the paper | read    | read  | verb | `VerbForm=Fin`, `Mood=Ind`, `Tense=Pres` |
-| I read the paper yesterday               | read    | read  | verb | `VerbForm=Fin`, `Mood=Ind`, `Tense=Past` |
+| Context                                  | Surface | Lemma | POS    |  Morphological Features                  |
+| ---------------------------------------- | ------- | ----- | ------ | ---------------------------------------- |
+| I was reading the paper                  | reading | read  | `VERB` | `VerbForm=Ger`                           |
+| I don't watch the news, I read the paper | read    | read  | `VERB` | `VerbForm=Fin`, `Mood=Ind`, `Tense=Pres` |
+| I read the paper yesterday               | read    | read  | `VERB` | `VerbForm=Fin`, `Mood=Ind`, `Tense=Past` |

-English has a relatively simple morphological system, which spaCy handles using
-rules that can be keyed by the token, the part-of-speech tag, or the combination
-of the two. The system works as follows:
+Morphological features are stored in the [`MorphAnalysis`](/api/morphanalysis)
+under `Token.morph`, which allows you to access individual morphological
+features. The attribute `Token.morph_` provides the morphological analysis in
+the Universal Dependencies FEATS format.

-1. The tokenizer consults a
-   [mapping table](/usage/adding-languages#tokenizer-exceptions)
-   `TOKENIZER_EXCEPTIONS`, which allows sequences of characters to be mapped to
-   multiple tokens. Each token may be assigned a part of speech and one or more
-   morphological features.
-2. The part-of-speech tagger then assigns each token an **extended POS tag**. In
-   the API, these tags are known as `Token.tag`. They express the part-of-speech
-   (e.g. `VERB`) and some amount of morphological information, e.g. that the
-   verb is past tense.
-3. For words whose POS is not set by a prior process, a
-   [mapping table](/usage/adding-languages#tag-map) `TAG_MAP` maps the tags to a
-   part-of-speech and a set of morphological features.
-4. Finally, a **rule-based deterministic lemmatizer** maps the surface form, to
-   a lemma in light of the previously assigned extended part-of-speech and
-   morphological information, without consulting the context of the token. The
-   lemmatizer also accepts list-based exception files, acquired from
-   [WordNet](https://wordnet.princeton.edu/).
+```python
+### {executable="true"}
+import spacy
+
+nlp = spacy.load("en_core_web_sm")
+doc = nlp("I was reading the paper.")
+
+token = doc[0] # "I"
+assert token.morph_ == "Case=Nom|Number=Sing|Person=1|PronType=Prs"
+assert token.morph.get("PronType") == ["Prs"]
+```
+
+### Statistical morphology {#morphologizer new="3" model="morphologizer"}
+
+spaCy v3 includes a statistical morphologizer component that assigns the
+morphological features and POS as `Token.morph` and `Token.pos`.
+
+```python
+### {executable="true"}
+import spacy
+
+nlp = spacy.load("de_core_news_sm")
+doc = nlp("Wo bist du?") # 'Where are you?'
+assert doc[2].morph_ == "Case=Nom|Number=Sing|Person=2|PronType=Prs"
+assert doc[2].pos_ == "PRON"
+```
+
+### Rule-based morphology {#rule-based-morphology}
+
+For languages with relatively simple morphological systems like English, spaCy
+can assign morphological features through a rule-based approach, which uses the
+token text and fine-grained part-of-speech tags to produce coarse-grained
+part-of-speech tags and morphological features.
+
+1. The part-of-speech tagger assigns each token a **fine-grained part-of-speech
+   tag**. In the API, these tags are known as `Token.tag`. They express the
+   part-of-speech (e.g. verb) and some amount of morphological information, e.g.
+   that the verb is past tense (e.g. `VBD` for a past tense verb in the Penn
+   Treebank) .
+2. For words whose coarse-grained POS is not set by a prior process, a
+   [mapping table](#mapping-exceptions) maps the fine-grained tags to a
+   coarse-grained POS tags and morphological features.
+
+```python
+### {executable="true"}
+import spacy
+
+nlp = spacy.load("en_core_web_sm")
+doc = nlp("Where are you?")
+assert doc[2].morph_ == "Case=Nom|Person=2|PronType=Prs"
+assert doc[2].pos_ == "PRON"
+```
+
+## Lemmatization {#lemmatization model="lemmatizer" new="3"}
+
+The [`Lemmatizer`](/api/lemmatizer) is a pipeline component that provides lookup
+and rule-based lemmatization methods in a configurable component. An individual
+language can extend the `Lemmatizer` as part of its [language
+data](#language-data).
+
+```python
+### {executable="true"}
+import spacy
+
+# English models include a rule-based lemmatizer
+nlp = spacy.load("en_core_web_sm")
+lemmatizer = nlp.get_pipe("lemmatizer")
+assert lemmatizer.mode == "rule"
+
+doc = nlp("I was reading the paper.")
+assert doc[1].lemma_ == "be"
+assert doc[2].lemma_ == "read"
+```
+
+<Infobox title="Important note" variant="warning">
+
+Unlike spaCy v2, spaCy v3 models do not provide lemmas by default or switch
+automatically between lookup and rule-based lemmas depending on whether a
+tagger is in the pipeline. To have lemmas in a `Doc`, the pipeline needs to
+include a `lemmatizer` component. A `lemmatizer` is configured to use a single
+mode such as `"lookup"` or `"rule"` on initialization. The `"rule"` mode
+requires `Token.pos` to be set by a previous component.
+
+</Infobox>
+
+The data for spaCy's lemmatizers is distributed in the package
+[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data). The
+provided models already include all the required tables, but if you are
+creating new models, you'll probably want to install `spacy-lookups-data` to
+provide the data when the lemmatizer is initialized.
+
+### Lookup lemmatizer {#lemmatizer-lookup}
+
+For models without a tagger or morphologizer, a lookup lemmatizer can be added
+to the pipeline as long as a lookup table is provided, typically through
+`spacy-lookups-data`. The lookup lemmatizer looks up the token surface form in
+the lookup table without reference to the token's part-of-speech or context.
+
+```python
+# pip install spacy-lookups-data
+import spacy
+
+nlp = spacy.blank("sv")
+nlp.add_pipe("lemmatizer", config={"mode": "lookup"})
+```
+
+### Rule-based lemmatizer {#lemmatizer-rule}
+
+When training models that include a component that assigns POS (a morphologizer
+or a tagger with a [POS mapping](#mappings-exceptions)), a rule-based
+lemmatizer can be added using rule tables from `spacy-lookups-data`:
+
+```python
+# pip install spacy-lookups-data
+import spacy
+
+nlp = spacy.blank("de")
+
+# morphologizer (note: model is not yet trained!)
+nlp.add_pipe("morphologizer")
+
+# rule-based lemmatizer
+nlp.add_pipe("lemmatizer", config={"mode": "rule"})
+```
+
+The rule-based deterministic lemmatizer maps the surface form to a lemma in
+light of the previously assigned coarse-grained part-of-speech and morphological
+information, without consulting the context of the token. The rule-based
+lemmatizer also accepts list-based exception files. For English, these are
+acquired from [WordNet](https://wordnet.princeton.edu/).

 ## Dependency Parsing {#dependency-parse model="parser"}

@ -420,7 +537,7 @@ on a token, it will return an empty string.
 >
 > #### BILUO Scheme
 >
-> - `B` – Token is the **beginning** of an entity.
+> - `B` – Token is the **beginning** of a multi-token entity.
 > - `I` – Token is **inside** a multi-token entity.
 > - `L` – Token is the **last** token of a multi-token entity.
 > - `U` – Token is a single-token **unit** entity.
@ -1574,6 +1691,75 @@ doc = nlp(text)
 print("After:", [sent.text for sent in doc.sents])
 ```

+## Mappings & Exceptions {#mappings-exceptions new="3"}
+
+The [`AttributeRuler`](/api/attributeruler) manages rule-based mappings and
+exceptions for all token-level attributes. As the number of pipeline components
+has grown from spaCy v2 to v3, handling rules and exceptions in each component
+individually has become impractical, so the `AttributeRuler` provides a single
+component with a unified pattern format for all token attribute mappings and
+exceptions.
+
+The `AttributeRuler` uses [`Matcher`
+patterns](/usage/rule-based-matching#adding-patterns) to identify tokens and
+then assigns them the provided attributes. If needed, the `Matcher` patterns
+can include context around the target token. For example, the `AttributeRuler`
+can:
+
+- provide exceptions for any token attributes
+- map fine-grained tags to coarse-grained tags for languages without statistical
+  morphologizers (replacing the v2 tag map in the language data)
+- map token surface form + fine-grained tags to morphological features
+  (replacing the v2 morph rules in the language data)
+- specify the tags for space tokens (replacing hard-coded behavior in the
+  tagger)
+
+The following example shows how the tag and POS `NNP`/`PROPN` can be specified
+for the phrase `"The Who"`, overriding the tags provided by the statistical
+tagger and the POS tag map.
+
+```python
+### {executable="true"}
+import spacy
+
+nlp = spacy.load("en_core_web_sm")
+text = "I saw The Who perform. Who did you see?"
+
+doc1 = nlp(text)
+assert doc1[2].tag_ == "DT"
+assert doc1[2].pos_ == "DET"
+assert doc1[3].tag_ == "WP"
+assert doc1[3].pos_ == "PRON"
+
+# add a new exception for "The Who" as NNP/PROPN NNP/PROPN
+ruler = nlp.get_pipe("attribute_ruler")
+
+# pattern to match "The Who"
+patterns = [[{"LOWER": "the"}, {"TEXT": "Who"}]]
+# the attributes to assign to the matched token
+attrs = {"TAG": "NNP", "POS": "PROPN"}
+
+# add rule for "The" in "The Who"
+ruler.add(patterns=patterns, attrs=attrs, index=0)
+# add rule for "Who" in "The Who"
+ruler.add(patterns=patterns, attrs=attrs, index=1)
+
+doc2 = nlp(text)
+assert doc2[2].tag_ == "NNP"
+assert doc2[3].tag_ == "NNP"
+assert doc2[2].pos_ == "PROPN"
+assert doc2[3].pos_ == "PROPN"
+
+# the second "Who" remains unmodified
+assert doc2[5].tag_ == "WP"
+assert doc2[5].pos_ == "PRON"
+```
+
+For easy migration from from spaCy v2 to v3, the `AttributeRuler` can import v2
+`TAG_MAP` and `MORPH_RULES` data with the methods
+[`AttributerRuler.load_from_tag_map`](/api/attributeruler#load_from_tag_map) and
+[`AttributeRuler.load_from_morph_rules`](/api/attributeruler#load_from_morph_rules).
+
 ## Word vectors and semantic similarity {#vectors-similarity}

 import Vectors101 from 'usage/101/\_vectors-similarity.md'
@ -1703,7 +1889,7 @@ for word, vector in vector_data.items():
    vocab.set_vector(word, vector)
 ```

-## Language data {#language-data}
+## Language Data {#language-data}

 import LanguageData101 from 'usage/101/\_language-data.md'

--- a/website/docs/usage/processing-pipelines.md
+++ b/website/docs/usage/processing-pipelines.md
@ -220,20 +220,21 @@ available pipeline components and component functions.
 > ruler = nlp.add_pipe("entity_ruler")
 > ```

-| String name     | Component                                       | Description                                                                               |
-| --------------- | ----------------------------------------------- | ----------------------------------------------------------------------------------------- |
-| `tagger`        | [`Tagger`](/api/tagger)                         | Assign part-of-speech-tags.                                                               |
-| `parser`        | [`DependencyParser`](/api/dependencyparser)     | Assign dependency labels.                                                                 |
-| `ner`           | [`EntityRecognizer`](/api/entityrecognizer)     | Assign named entities.                                                                    |
-| `entity_linker` | [`EntityLinker`](/api/entitylinker)             | Assign knowledge base IDs to named entities. Should be added after the entity recognizer. |
-| `entity_ruler`  | [`EntityRuler`](/api/entityruler)               | Assign named entities based on pattern rules and dictionaries.                            |
-| `textcat`       | [`TextCategorizer`](/api/textcategorizer)       | Assign text categories.                                                                   |
-| `lemmatizer`    | [`Lemmatizer`](/api/lemmatizer)                 | Assign base forms to words.                                                               |
-| `morphologizer` | [`Morphologizer`](/api/morphologizer)           | Assign morphological features and coarse-grained POS tags.                                |
-| `senter`        | [`SentenceRecognizer`](/api/sentencerecognizer) | Assign sentence boundaries.                                                               |
-| `sentencizer`   | [`Sentencizer`](/api/sentencizer)               | Add rule-based sentence segmentation without the dependency parse.                        |
-| `tok2vec`       | [`Tok2Vec`](/api/tok2vec)                       | Assign token-to-vector embeddings.                                                        |
-| `transformer`   | [`Transformer`](/api/transformer)               | Assign the tokens and outputs of a transformer model.                                     |
+| String name       | Component                                       | Description                                                                               |
+| ----------------- | ----------------------------------------------- | ----------------------------------------------------------------------------------------- |
+| `tagger`          | [`Tagger`](/api/tagger)                         | Assign part-of-speech-tags.                                                               |
+| `parser`          | [`DependencyParser`](/api/dependencyparser)     | Assign dependency labels.                                                                 |
+| `ner`             | [`EntityRecognizer`](/api/entityrecognizer)     | Assign named entities.                                                                    |
+| `entity_linker`   | [`EntityLinker`](/api/entitylinker)             | Assign knowledge base IDs to named entities. Should be added after the entity recognizer. |
+| `entity_ruler`    | [`EntityRuler`](/api/entityruler)               | Assign named entities based on pattern rules and dictionaries.                            |
+| `textcat`         | [`TextCategorizer`](/api/textcategorizer)       | Assign text categories.                                                                   |
+| `lemmatizer`      | [`Lemmatizer`](/api/lemmatizer)                 | Assign base forms to words.                                                               |
+| `morphologizer`   | [`Morphologizer`](/api/morphologizer)           | Assign morphological features and coarse-grained POS tags.                                |
+| `attribute_ruler` | [`AttributeRuler`](/api/attributeruler)         | Assign token attribute mappings and rule-based exceptions.                                |
+| `senter`          | [`SentenceRecognizer`](/api/sentencerecognizer) | Assign sentence boundaries.                                                               |
+| `sentencizer`     | [`Sentencizer`](/api/sentencizer)               | Add rule-based sentence segmentation without the dependency parse.                        |
+| `tok2vec`         | [`Tok2Vec`](/api/tok2vec)                       | Assign token-to-vector embeddings.                                                        |
+| `transformer`     | [`Transformer`](/api/transformer)               | Assign the tokens and outputs of a transformer model.                                     |

 ### Disabling and modifying pipeline components {#disabling}

--- a/website/docs/usage/v3.md
+++ b/website/docs/usage/v3.md
@ -142,6 +142,7 @@ add to your pipeline and customize for your use case:
 > #### Example
 >
 > ```python
+> # pip install spacy-lookups-data
 > nlp = spacy.blank("en")
 > nlp.add_pipe("lemmatizer")
 > ```
@ -260,7 +261,7 @@ The following methods, attributes and commands are new in spaCy v3.0.
 | [`Language.has_factory`](/api/language#has_factory)                                                                           | Check whether a component factory is registered on a language class.s                                                                                                                            |
 | [`Language.get_factory_meta`](/api/language#get_factory_meta) [`Language.get_pipe_meta`](/api/language#get_factory_meta)      | Get the [`FactoryMeta`](/api/language#factorymeta) with component metadata for a factory or instance name.                                                                                       |
 | [`Language.config`](/api/language#config)                                                                                     | The [config](/usage/training#config) used to create the current `nlp` object. An instance of [`Config`](https://thinc.ai/docs/api-config#config) and can be saved to disk and used for training. |
-| [`Pipe.score`](/api/pipe#score)                                                                                               | Method on trainable pipeline components that returns a dictionary of evaluation scores.                                                                                                          |
+| [`Pipe.score`](/api/pipe#score)                                                                                               | Method on pipeline components that returns a dictionary of evaluation scores.                                                                                                                    |
 | [`registry`](/api/top-level#registry)                                                                                         | Function registry to map functions to string names that can be referenced in [configs](/usage/training#config).                                                                                  |
 | [`util.load_meta`](/api/top-level#util.load_meta) [`util.load_config`](/api/top-level#util.load_config)                       | Updated helpers for loading a model's [`meta.json`](/api/data-formats#meta) and [`config.cfg`](/api/data-formats#config).                                                                        |
 | [`util.get_installed_models`](/api/top-level#util.get_installed_models)                                                       | Names of all models installed in the environment.                                                                                                                                                |
@ -396,7 +397,7 @@ on them.
 | keyword-arguments like `vocab=False` on `to_disk`, `from_disk`, `to_bytes`, `from_bytes`                                | `exclude=["vocab"]`                                                                                                                                        |
 | `n_threads` argument on [`Tokenizer`](/api/tokenizer), [`Matcher`](/api/matcher), [`PhraseMatcher`](/api/phrasematcher) | `n_process`                                                                                                                                                |
 | `verbose` argument on [`Language.evaluate`](/api/language#evaluate)                                                     | logging (`DEBUG`)                                                                                                                                          |
-| `SentenceSegmenter` hook, `SimilarityHook`                                                                              | [user hooks](/usage/processing-pipelines#custom-components-user-hooks), [`Sentencizer`](/api/sentencizer), [`SentenceRecognizer`](/api/sentenceregognizer) |
+| `SentenceSegmenter` hook, `SimilarityHook`                                                                              | [user hooks](/usage/processing-pipelines#custom-components-user-hooks), [`Sentencizer`](/api/sentencizer), [`SentenceRecognizer`](/api/sentencerecognizer) |

 ## Migrating from v2.x {#migrating}