From f9ed31a757f15e1ef48c9b4d8950f1fc799cb98e Mon Sep 17 00:00:00 2001 From: Adriane Boyd Date: Sat, 29 Aug 2020 15:56:50 +0200 Subject: [PATCH] Update usage docs for lemmatization and morphology --- website/docs/api/lemmatizer.md | 7 +- website/docs/usage/101/_language-data.md | 20 +- website/docs/usage/101/_pipelines.md | 28 ++- website/docs/usage/index.md | 6 +- website/docs/usage/linguistic-features.md | 244 ++++++++++++++++++--- website/docs/usage/processing-pipelines.md | 29 +-- website/docs/usage/v3.md | 5 +- 7 files changed, 267 insertions(+), 72 deletions(-) diff --git a/website/docs/api/lemmatizer.md b/website/docs/api/lemmatizer.md index 8417fd5e8..45a8736db 100644 --- a/website/docs/api/lemmatizer.md +++ b/website/docs/api/lemmatizer.md @@ -25,9 +25,10 @@ added to your pipeline, and not a hidden part of the vocab that runs behind the scenes. This makes it easier to customize how lemmas should be assigned in your pipeline. -If the lemmatization mode is set to `"rule"` and requires part-of-speech tags to -be assigned, make sure a [`Tagger`](/api/tagger) or another component assigning -tags is available in the pipeline and runs _before_ the lemmatizer. +If the lemmatization mode is set to `"rule"`, which requires coarse-grained POS +(`Token.pos`) to be assigned, make sure a [`Tagger`](/api/tagger), +[`Morphologizer`](/api/morphologizer) or another component assigning POS is +available in the pipeline and runs _before_ the lemmatizer. diff --git a/website/docs/usage/101/_language-data.md b/website/docs/usage/101/_language-data.md index 8c3cd48a3..f1fa1f3a2 100644 --- a/website/docs/usage/101/_language-data.md +++ b/website/docs/usage/101/_language-data.md @@ -22,15 +22,15 @@ values are defined in the [`Language.Defaults`](/api/language#defaults). > nlp_de = German() # Includes German data > ``` -| Name | Description | -| ---------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------- | -| **Stop words**
[`stop_words.py`][stop_words.py] | List of most common words of a language that are often useful to filter out, for example "and" or "I". Matching tokens will return `True` for `is_stop`. | -| **Tokenizer exceptions**
[`tokenizer_exceptions.py`][tokenizer_exceptions.py] | Special-case rules for the tokenizer, for example, contractions like "can't" and abbreviations with punctuation, like "U.K.". | -| **Punctuation rules**
[`punctuation.py`][punctuation.py] | Regular expressions for splitting tokens, e.g. on punctuation or special characters like emoji. Includes rules for prefixes, suffixes and infixes. | -| **Character classes**
[`char_classes.py`][char_classes.py] | Character classes to be used in regular expressions, for example, latin characters, quotes, hyphens or icons. | -| **Lexical attributes**
[`lex_attrs.py`][lex_attrs.py] | Custom functions for setting lexical attributes on tokens, e.g. `like_num`, which includes language-specific words like "ten" or "hundred". | -| **Syntax iterators**
[`syntax_iterators.py`][syntax_iterators.py] | Functions that compute views of a `Doc` object based on its syntax. At the moment, only used for [noun chunks](/usage/linguistic-features#noun-chunks). | -| **Lemmatizer**
[`spacy-lookups-data`][spacy-lookups-data] | Lemmatization rules or a lookup-based lemmatization table to assign base forms, for example "be" for "was". | +| Name | Description | +| ----------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------- | +| **Stop words**
[`stop_words.py`][stop_words.py] | List of most common words of a language that are often useful to filter out, for example "and" or "I". Matching tokens will return `True` for `is_stop`. | +| **Tokenizer exceptions**
[`tokenizer_exceptions.py`][tokenizer_exceptions.py] | Special-case rules for the tokenizer, for example, contractions like "can't" and abbreviations with punctuation, like "U.K.". | +| **Punctuation rules**
[`punctuation.py`][punctuation.py] | Regular expressions for splitting tokens, e.g. on punctuation or special characters like emoji. Includes rules for prefixes, suffixes and infixes. | +| **Character classes**
[`char_classes.py`][char_classes.py] | Character classes to be used in regular expressions, for example, Latin characters, quotes, hyphens or icons. | +| **Lexical attributes**
[`lex_attrs.py`][lex_attrs.py] | Custom functions for setting lexical attributes on tokens, e.g. `like_num`, which includes language-specific words like "ten" or "hundred". | +| **Syntax iterators**
[`syntax_iterators.py`][syntax_iterators.py] | Functions that compute views of a `Doc` object based on its syntax. At the moment, only used for [noun chunks](/usage/linguistic-features#noun-chunks). | +| **Lemmatizer**
[`lemmatizer.py`][lemmatizer.py] [`spacy-lookups-data`][spacy-lookups-data] | Custom lemmatizer implementation and lemmatization tables. | [stop_words.py]: https://github.com/explosion/spaCy/tree/master/spacy/lang/en/stop_words.py @@ -44,4 +44,6 @@ values are defined in the [`Language.Defaults`](/api/language#defaults). https://github.com/explosion/spaCy/tree/master/spacy/lang/en/lex_attrs.py [syntax_iterators.py]: https://github.com/explosion/spaCy/tree/master/spacy/lang/en/syntax_iterators.py +[lemmatizer.py]: + https://github.com/explosion/spaCy/tree/master/spacy/lang/fr/lemmatizer.py [spacy-lookups-data]: https://github.com/explosion/spacy-lookups-data diff --git a/website/docs/usage/101/_pipelines.md b/website/docs/usage/101/_pipelines.md index f85978d99..a0971076f 100644 --- a/website/docs/usage/101/_pipelines.md +++ b/website/docs/usage/101/_pipelines.md @@ -1,9 +1,9 @@ When you call `nlp` on a text, spaCy first tokenizes the text to produce a `Doc` object. The `Doc` is then processed in several different steps – this is also referred to as the **processing pipeline**. The pipeline used by the -[default models](/models) consists of a tagger, a parser and an entity -recognizer. Each pipeline component returns the processed `Doc`, which is then -passed on to the next component. +[default models](/models) typically include a tagger, a lemmatizer, a parser and +an entity recognizer. Each pipeline component returns the processed `Doc`, which +is then passed on to the next component. ![The processing pipeline](../../images/pipeline.svg) @@ -12,15 +12,19 @@ passed on to the next component. > - **Creates:** Objects, attributes and properties modified and set by the > component. -| Name | Component | Creates | Description | -| -------------- | ------------------------------------------------------------------ | --------------------------------------------------------- | ------------------------------------------------ | -| **tokenizer** | [`Tokenizer`](/api/tokenizer) | `Doc` | Segment text into tokens. | -| **tagger** | [`Tagger`](/api/tagger) | `Token.tag` | Assign part-of-speech tags. | -| **parser** | [`DependencyParser`](/api/dependencyparser) | `Token.head`, `Token.dep`, `Doc.sents`, `Doc.noun_chunks` | Assign dependency labels. | -| **ner** | [`EntityRecognizer`](/api/entityrecognizer) | `Doc.ents`, `Token.ent_iob`, `Token.ent_type` | Detect and label named entities. | -| **lemmatizer** | [`Lemmatizer`](/api/lemmatizer) | `Token.lemma` | Assign base forms. | -| **textcat** | [`TextCategorizer`](/api/textcategorizer) | `Doc.cats` | Assign document labels. | -| **custom** | [custom components](/usage/processing-pipelines#custom-components) | `Doc._.xxx`, `Token._.xxx`, `Span._.xxx` | Assign custom attributes, methods or properties. | +| Name | Component | Creates | Description | +| -------------- | ------------------------------------------- | --------------------------------------------------------- | -------------------------------- | +| **tokenizer** | [`Tokenizer`](/api/tokenizer) | `Doc` | Segment text into tokens. | +| **tagger** | [`Tagger`](/api/tagger) | `Token.tag` | Assign part-of-speech tags. | +| **parser** | [`DependencyParser`](/api/dependencyparser) | `Token.head`, `Token.dep`, `Doc.sents`, `Doc.noun_chunks` | Assign dependency labels. | +| **ner** | [`EntityRecognizer`](/api/entityrecognizer) | `Doc.ents`, `Token.ent_iob`, `Token.ent_type` | Detect and label named entities. | +| **lemmatizer** | [`Lemmatizer`](/api/lemmatizer) | `Token.lemma` | Assign base forms. | +| **textcat** | [`TextCategorizer`](/api/textcategorizer) | `Doc.cats` | Assign document labels. | + +| **custom** | +[custom components](/usage/processing-pipelines#custom-components) | +`Doc._.xxx`, `Token._.xxx`, `Span._.xxx` | Assign custom attributes, methods or +properties. | The processing pipeline always **depends on the statistical model** and its capabilities. For example, a pipeline can only include an entity recognizer diff --git a/website/docs/usage/index.md b/website/docs/usage/index.md index ede4ab6f9..76858213c 100644 --- a/website/docs/usage/index.md +++ b/website/docs/usage/index.md @@ -52,9 +52,9 @@ $ pip install -U spacy To install additional data tables for lemmatization you can run `pip install spacy[lookups]` or install [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) -separately. The lookups package is needed to create blank models with -lemmatization data, and to lemmatize in languages that don't yet come with -pretrained models and aren't powered by third-party libraries. +separately. The lookups package is needed to provide normalization and +lemmatization data for new models and to lemmatize in languages that don't yet +come with pretrained models and aren't powered by third-party libraries. diff --git a/website/docs/usage/linguistic-features.md b/website/docs/usage/linguistic-features.md index 5c5198308..cd0d8e1e4 100644 --- a/website/docs/usage/linguistic-features.md +++ b/website/docs/usage/linguistic-features.md @@ -3,6 +3,8 @@ title: Linguistic Features next: /usage/rule-based-matching menu: - ['POS Tagging', 'pos-tagging'] + - ['Morphology', 'morphology'] + - ['Lemmatization', 'lemmatization'] - ['Dependency Parse', 'dependency-parse'] - ['Named Entities', 'named-entities'] - ['Entity Linking', 'entity-linking'] @@ -10,7 +12,8 @@ menu: - ['Merging & Splitting', 'retokenization'] - ['Sentence Segmentation', 'sbd'] - ['Vectors & Similarity', 'vectors-similarity'] - - ['Language data', 'language-data'] + - ['Mappings & Exceptions', 'mappings-exceptions'] + - ['Language Data', 'language-data'] --- Processing raw text intelligently is difficult: most words are rare, and it's @@ -37,7 +40,7 @@ in the [models directory](/models). -### Rule-based morphology {#rule-based-morphology} +## Morphology {#morphology} Inflectional morphology is the process by which a root form of a word is modified by adding prefixes or suffixes that specify its grammatical function @@ -45,33 +48,147 @@ but do not changes its part-of-speech. We say that a **lemma** (root form) is **inflected** (modified/combined) with one or more **morphological features** to create a surface form. Here are some examples: -| Context | Surface | Lemma | POS |  Morphological Features | -| ---------------------------------------- | ------- | ----- | ---- | ---------------------------------------- | -| I was reading the paper | reading | read | verb | `VerbForm=Ger` | -| I don't watch the news, I read the paper | read | read | verb | `VerbForm=Fin`, `Mood=Ind`, `Tense=Pres` | -| I read the paper yesterday | read | read | verb | `VerbForm=Fin`, `Mood=Ind`, `Tense=Past` | +| Context | Surface | Lemma | POS |  Morphological Features | +| ---------------------------------------- | ------- | ----- | ------ | ---------------------------------------- | +| I was reading the paper | reading | read | `VERB` | `VerbForm=Ger` | +| I don't watch the news, I read the paper | read | read | `VERB` | `VerbForm=Fin`, `Mood=Ind`, `Tense=Pres` | +| I read the paper yesterday | read | read | `VERB` | `VerbForm=Fin`, `Mood=Ind`, `Tense=Past` | -English has a relatively simple morphological system, which spaCy handles using -rules that can be keyed by the token, the part-of-speech tag, or the combination -of the two. The system works as follows: +Morphological features are stored in the [`MorphAnalysis`](/api/morphanalysis) +under `Token.morph`, which allows you to access individual morphological +features. The attribute `Token.morph_` provides the morphological analysis in +the Universal Dependencies FEATS format. -1. The tokenizer consults a - [mapping table](/usage/adding-languages#tokenizer-exceptions) - `TOKENIZER_EXCEPTIONS`, which allows sequences of characters to be mapped to - multiple tokens. Each token may be assigned a part of speech and one or more - morphological features. -2. The part-of-speech tagger then assigns each token an **extended POS tag**. In - the API, these tags are known as `Token.tag`. They express the part-of-speech - (e.g. `VERB`) and some amount of morphological information, e.g. that the - verb is past tense. -3. For words whose POS is not set by a prior process, a - [mapping table](/usage/adding-languages#tag-map) `TAG_MAP` maps the tags to a - part-of-speech and a set of morphological features. -4. Finally, a **rule-based deterministic lemmatizer** maps the surface form, to - a lemma in light of the previously assigned extended part-of-speech and - morphological information, without consulting the context of the token. The - lemmatizer also accepts list-based exception files, acquired from - [WordNet](https://wordnet.princeton.edu/). +```python +### {executable="true"} +import spacy + +nlp = spacy.load("en_core_web_sm") +doc = nlp("I was reading the paper.") + +token = doc[0] # "I" +assert token.morph_ == "Case=Nom|Number=Sing|Person=1|PronType=Prs" +assert token.morph.get("PronType") == ["Prs"] +``` + +### Statistical morphology {#morphologizer new="3" model="morphologizer"} + +spaCy v3 includes a statistical morphologizer component that assigns the +morphological features and POS as `Token.morph` and `Token.pos`. + +```python +### {executable="true"} +import spacy + +nlp = spacy.load("de_core_news_sm") +doc = nlp("Wo bist du?") # 'Where are you?' +assert doc[2].morph_ == "Case=Nom|Number=Sing|Person=2|PronType=Prs" +assert doc[2].pos_ == "PRON" +``` + +### Rule-based morphology {#rule-based-morphology} + +For languages with relatively simple morphological systems like English, spaCy +can assign morphological features through a rule-based approach, which uses the +token text and fine-grained part-of-speech tags to produce coarse-grained +part-of-speech tags and morphological features. + +1. The part-of-speech tagger assigns each token a **fine-grained part-of-speech + tag**. In the API, these tags are known as `Token.tag`. They express the + part-of-speech (e.g. verb) and some amount of morphological information, e.g. + that the verb is past tense (e.g. `VBD` for a past tense verb in the Penn + Treebank) . +2. For words whose coarse-grained POS is not set by a prior process, a + [mapping table](#mapping-exceptions) maps the fine-grained tags to a + coarse-grained POS tags and morphological features. + +```python +### {executable="true"} +import spacy + +nlp = spacy.load("en_core_web_sm") +doc = nlp("Where are you?") +assert doc[2].morph_ == "Case=Nom|Person=2|PronType=Prs" +assert doc[2].pos_ == "PRON" +``` + +## Lemmatization {#lemmatization model="lemmatizer" new="3"} + +The [`Lemmatizer`](/api/lemmatizer) is a pipeline component that provides lookup +and rule-based lemmatization methods in a configurable component. An individual +language can extend the `Lemmatizer` as part of its [language +data](#language-data). + +```python +### {executable="true"} +import spacy + +# English models include a rule-based lemmatizer +nlp = spacy.load("en_core_web_sm") +lemmatizer = nlp.get_pipe("lemmatizer") +assert lemmatizer.mode == "rule" + +doc = nlp("I was reading the paper.") +assert doc[1].lemma_ == "be" +assert doc[2].lemma_ == "read" +``` + + + +Unlike spaCy v2, spaCy v3 models do not provide lemmas by default or switch +automatically between lookup and rule-based lemmas depending on whether a +tagger is in the pipeline. To have lemmas in a `Doc`, the pipeline needs to +include a `lemmatizer` component. A `lemmatizer` is configured to use a single +mode such as `"lookup"` or `"rule"` on initialization. The `"rule"` mode +requires `Token.pos` to be set by a previous component. + + + +The data for spaCy's lemmatizers is distributed in the package +[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data). The +provided models already include all the required tables, but if you are +creating new models, you'll probably want to install `spacy-lookups-data` to +provide the data when the lemmatizer is initialized. + +### Lookup lemmatizer {#lemmatizer-lookup} + +For models without a tagger or morphologizer, a lookup lemmatizer can be added +to the pipeline as long as a lookup table is provided, typically through +`spacy-lookups-data`. The lookup lemmatizer looks up the token surface form in +the lookup table without reference to the token's part-of-speech or context. + +```python +# pip install spacy-lookups-data +import spacy + +nlp = spacy.blank("sv") +nlp.add_pipe("lemmatizer", config={"mode": "lookup"}) +``` + +### Rule-based lemmatizer {#lemmatizer-rule} + +When training models that include a component that assigns POS (a morphologizer +or a tagger with a [POS mapping](#mappings-exceptions)), a rule-based +lemmatizer can be added using rule tables from `spacy-lookups-data`: + +```python +# pip install spacy-lookups-data +import spacy + +nlp = spacy.blank("de") + +# morphologizer (note: model is not yet trained!) +nlp.add_pipe("morphologizer") + +# rule-based lemmatizer +nlp.add_pipe("lemmatizer", config={"mode": "rule"}) +``` + +The rule-based deterministic lemmatizer maps the surface form to a lemma in +light of the previously assigned coarse-grained part-of-speech and morphological +information, without consulting the context of the token. The rule-based +lemmatizer also accepts list-based exception files. For English, these are +acquired from [WordNet](https://wordnet.princeton.edu/). ## Dependency Parsing {#dependency-parse model="parser"} @@ -420,7 +537,7 @@ on a token, it will return an empty string. > > #### BILUO Scheme > -> - `B` – Token is the **beginning** of an entity. +> - `B` – Token is the **beginning** of a multi-token entity. > - `I` – Token is **inside** a multi-token entity. > - `L` – Token is the **last** token of a multi-token entity. > - `U` – Token is a single-token **unit** entity. @@ -1574,6 +1691,75 @@ doc = nlp(text) print("After:", [sent.text for sent in doc.sents]) ``` +## Mappings & Exceptions {#mappings-exceptions new="3"} + +The [`AttributeRuler`](/api/attributeruler) manages rule-based mappings and +exceptions for all token-level attributes. As the number of pipeline components +has grown from spaCy v2 to v3, handling rules and exceptions in each component +individually has become impractical, so the `AttributeRuler` provides a single +component with a unified pattern format for all token attribute mappings and +exceptions. + +The `AttributeRuler` uses [`Matcher` +patterns](/usage/rule-based-matching#adding-patterns) to identify tokens and +then assigns them the provided attributes. If needed, the `Matcher` patterns +can include context around the target token. For example, the `AttributeRuler` +can: + +- provide exceptions for any token attributes +- map fine-grained tags to coarse-grained tags for languages without statistical + morphologizers (replacing the v2 tag map in the language data) +- map token surface form + fine-grained tags to morphological features + (replacing the v2 morph rules in the language data) +- specify the tags for space tokens (replacing hard-coded behavior in the + tagger) + +The following example shows how the tag and POS `NNP`/`PROPN` can be specified +for the phrase `"The Who"`, overriding the tags provided by the statistical +tagger and the POS tag map. + +```python +### {executable="true"} +import spacy + +nlp = spacy.load("en_core_web_sm") +text = "I saw The Who perform. Who did you see?" + +doc1 = nlp(text) +assert doc1[2].tag_ == "DT" +assert doc1[2].pos_ == "DET" +assert doc1[3].tag_ == "WP" +assert doc1[3].pos_ == "PRON" + +# add a new exception for "The Who" as NNP/PROPN NNP/PROPN +ruler = nlp.get_pipe("attribute_ruler") + +# pattern to match "The Who" +patterns = [[{"LOWER": "the"}, {"TEXT": "Who"}]] +# the attributes to assign to the matched token +attrs = {"TAG": "NNP", "POS": "PROPN"} + +# add rule for "The" in "The Who" +ruler.add(patterns=patterns, attrs=attrs, index=0) +# add rule for "Who" in "The Who" +ruler.add(patterns=patterns, attrs=attrs, index=1) + +doc2 = nlp(text) +assert doc2[2].tag_ == "NNP" +assert doc2[3].tag_ == "NNP" +assert doc2[2].pos_ == "PROPN" +assert doc2[3].pos_ == "PROPN" + +# the second "Who" remains unmodified +assert doc2[5].tag_ == "WP" +assert doc2[5].pos_ == "PRON" +``` + +For easy migration from from spaCy v2 to v3, the `AttributeRuler` can import v2 +`TAG_MAP` and `MORPH_RULES` data with the methods +[`AttributerRuler.load_from_tag_map`](/api/attributeruler#load_from_tag_map) and +[`AttributeRuler.load_from_morph_rules`](/api/attributeruler#load_from_morph_rules). + ## Word vectors and semantic similarity {#vectors-similarity} import Vectors101 from 'usage/101/\_vectors-similarity.md' @@ -1703,7 +1889,7 @@ for word, vector in vector_data.items(): vocab.set_vector(word, vector) ``` -## Language data {#language-data} +## Language Data {#language-data} import LanguageData101 from 'usage/101/\_language-data.md' diff --git a/website/docs/usage/processing-pipelines.md b/website/docs/usage/processing-pipelines.md index 614f113b3..2eaec8503 100644 --- a/website/docs/usage/processing-pipelines.md +++ b/website/docs/usage/processing-pipelines.md @@ -220,20 +220,21 @@ available pipeline components and component functions. > ruler = nlp.add_pipe("entity_ruler") > ``` -| String name | Component | Description | -| --------------- | ----------------------------------------------- | ----------------------------------------------------------------------------------------- | -| `tagger` | [`Tagger`](/api/tagger) | Assign part-of-speech-tags. | -| `parser` | [`DependencyParser`](/api/dependencyparser) | Assign dependency labels. | -| `ner` | [`EntityRecognizer`](/api/entityrecognizer) | Assign named entities. | -| `entity_linker` | [`EntityLinker`](/api/entitylinker) | Assign knowledge base IDs to named entities. Should be added after the entity recognizer. | -| `entity_ruler` | [`EntityRuler`](/api/entityruler) | Assign named entities based on pattern rules and dictionaries. | -| `textcat` | [`TextCategorizer`](/api/textcategorizer) | Assign text categories. | -| `lemmatizer` | [`Lemmatizer`](/api/lemmatizer) | Assign base forms to words. | -| `morphologizer` | [`Morphologizer`](/api/morphologizer) | Assign morphological features and coarse-grained POS tags. | -| `senter` | [`SentenceRecognizer`](/api/sentencerecognizer) | Assign sentence boundaries. | -| `sentencizer` | [`Sentencizer`](/api/sentencizer) | Add rule-based sentence segmentation without the dependency parse. | -| `tok2vec` | [`Tok2Vec`](/api/tok2vec) | Assign token-to-vector embeddings. | -| `transformer` | [`Transformer`](/api/transformer) | Assign the tokens and outputs of a transformer model. | +| String name | Component | Description | +| ----------------- | ----------------------------------------------- | ----------------------------------------------------------------------------------------- | +| `tagger` | [`Tagger`](/api/tagger) | Assign part-of-speech-tags. | +| `parser` | [`DependencyParser`](/api/dependencyparser) | Assign dependency labels. | +| `ner` | [`EntityRecognizer`](/api/entityrecognizer) | Assign named entities. | +| `entity_linker` | [`EntityLinker`](/api/entitylinker) | Assign knowledge base IDs to named entities. Should be added after the entity recognizer. | +| `entity_ruler` | [`EntityRuler`](/api/entityruler) | Assign named entities based on pattern rules and dictionaries. | +| `textcat` | [`TextCategorizer`](/api/textcategorizer) | Assign text categories. | +| `lemmatizer` | [`Lemmatizer`](/api/lemmatizer) | Assign base forms to words. | +| `morphologizer` | [`Morphologizer`](/api/morphologizer) | Assign morphological features and coarse-grained POS tags. | +| `attribute_ruler` | [`AttributeRuler`](/api/attributeruler) | Assign token attribute mappings and rule-based exceptions. | +| `senter` | [`SentenceRecognizer`](/api/sentencerecognizer) | Assign sentence boundaries. | +| `sentencizer` | [`Sentencizer`](/api/sentencizer) | Add rule-based sentence segmentation without the dependency parse. | +| `tok2vec` | [`Tok2Vec`](/api/tok2vec) | Assign token-to-vector embeddings. | +| `transformer` | [`Transformer`](/api/transformer) | Assign the tokens and outputs of a transformer model. | ### Disabling and modifying pipeline components {#disabling} diff --git a/website/docs/usage/v3.md b/website/docs/usage/v3.md index bf0c13b68..65f81d066 100644 --- a/website/docs/usage/v3.md +++ b/website/docs/usage/v3.md @@ -142,6 +142,7 @@ add to your pipeline and customize for your use case: > #### Example > > ```python +> # pip install spacy-lookups-data > nlp = spacy.blank("en") > nlp.add_pipe("lemmatizer") > ``` @@ -260,7 +261,7 @@ The following methods, attributes and commands are new in spaCy v3.0. | [`Language.has_factory`](/api/language#has_factory) | Check whether a component factory is registered on a language class.s | | [`Language.get_factory_meta`](/api/language#get_factory_meta) [`Language.get_pipe_meta`](/api/language#get_factory_meta) | Get the [`FactoryMeta`](/api/language#factorymeta) with component metadata for a factory or instance name. | | [`Language.config`](/api/language#config) | The [config](/usage/training#config) used to create the current `nlp` object. An instance of [`Config`](https://thinc.ai/docs/api-config#config) and can be saved to disk and used for training. | -| [`Pipe.score`](/api/pipe#score) | Method on trainable pipeline components that returns a dictionary of evaluation scores. | +| [`Pipe.score`](/api/pipe#score) | Method on pipeline components that returns a dictionary of evaluation scores. | | [`registry`](/api/top-level#registry) | Function registry to map functions to string names that can be referenced in [configs](/usage/training#config). | | [`util.load_meta`](/api/top-level#util.load_meta) [`util.load_config`](/api/top-level#util.load_config) | Updated helpers for loading a model's [`meta.json`](/api/data-formats#meta) and [`config.cfg`](/api/data-formats#config). | | [`util.get_installed_models`](/api/top-level#util.get_installed_models) | Names of all models installed in the environment. | @@ -396,7 +397,7 @@ on them. | keyword-arguments like `vocab=False` on `to_disk`, `from_disk`, `to_bytes`, `from_bytes` | `exclude=["vocab"]` | | `n_threads` argument on [`Tokenizer`](/api/tokenizer), [`Matcher`](/api/matcher), [`PhraseMatcher`](/api/phrasematcher) | `n_process` | | `verbose` argument on [`Language.evaluate`](/api/language#evaluate) | logging (`DEBUG`) | -| `SentenceSegmenter` hook, `SimilarityHook` | [user hooks](/usage/processing-pipelines#custom-components-user-hooks), [`Sentencizer`](/api/sentencizer), [`SentenceRecognizer`](/api/sentenceregognizer) | +| `SentenceSegmenter` hook, `SimilarityHook` | [user hooks](/usage/processing-pipelines#custom-components-user-hooks), [`Sentencizer`](/api/sentencizer), [`SentenceRecognizer`](/api/sentencerecognizer) | ## Migrating from v2.x {#migrating}