mirror of
https://github.com/explosion/spaCy.git
synced 2024-11-11 04:08:09 +03:00
Update usage docs for lemmatization and morphology
This commit is contained in:
parent
e1e1760fd6
commit
f9ed31a757
|
@ -25,9 +25,10 @@ added to your pipeline, and not a hidden part of the vocab that runs behind the
|
|||
scenes. This makes it easier to customize how lemmas should be assigned in your
|
||||
pipeline.
|
||||
|
||||
If the lemmatization mode is set to `"rule"` and requires part-of-speech tags to
|
||||
be assigned, make sure a [`Tagger`](/api/tagger) or another component assigning
|
||||
tags is available in the pipeline and runs _before_ the lemmatizer.
|
||||
If the lemmatization mode is set to `"rule"`, which requires coarse-grained POS
|
||||
(`Token.pos`) to be assigned, make sure a [`Tagger`](/api/tagger),
|
||||
[`Morphologizer`](/api/morphologizer) or another component assigning POS is
|
||||
available in the pipeline and runs _before_ the lemmatizer.
|
||||
|
||||
</Infobox>
|
||||
|
||||
|
|
|
@ -22,15 +22,15 @@ values are defined in the [`Language.Defaults`](/api/language#defaults).
|
|||
> nlp_de = German() # Includes German data
|
||||
> ```
|
||||
|
||||
| Name | Description |
|
||||
| ---------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| **Stop words**<br />[`stop_words.py`][stop_words.py] | List of most common words of a language that are often useful to filter out, for example "and" or "I". Matching tokens will return `True` for `is_stop`. |
|
||||
| **Tokenizer exceptions**<br />[`tokenizer_exceptions.py`][tokenizer_exceptions.py] | Special-case rules for the tokenizer, for example, contractions like "can't" and abbreviations with punctuation, like "U.K.". |
|
||||
| **Punctuation rules**<br />[`punctuation.py`][punctuation.py] | Regular expressions for splitting tokens, e.g. on punctuation or special characters like emoji. Includes rules for prefixes, suffixes and infixes. |
|
||||
| **Character classes**<br />[`char_classes.py`][char_classes.py] | Character classes to be used in regular expressions, for example, latin characters, quotes, hyphens or icons. |
|
||||
| **Lexical attributes**<br />[`lex_attrs.py`][lex_attrs.py] | Custom functions for setting lexical attributes on tokens, e.g. `like_num`, which includes language-specific words like "ten" or "hundred". |
|
||||
| **Syntax iterators**<br />[`syntax_iterators.py`][syntax_iterators.py] | Functions that compute views of a `Doc` object based on its syntax. At the moment, only used for [noun chunks](/usage/linguistic-features#noun-chunks). |
|
||||
| **Lemmatizer**<br />[`spacy-lookups-data`][spacy-lookups-data] | Lemmatization rules or a lookup-based lemmatization table to assign base forms, for example "be" for "was". |
|
||||
| Name | Description |
|
||||
| ----------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| **Stop words**<br />[`stop_words.py`][stop_words.py] | List of most common words of a language that are often useful to filter out, for example "and" or "I". Matching tokens will return `True` for `is_stop`. |
|
||||
| **Tokenizer exceptions**<br />[`tokenizer_exceptions.py`][tokenizer_exceptions.py] | Special-case rules for the tokenizer, for example, contractions like "can't" and abbreviations with punctuation, like "U.K.". |
|
||||
| **Punctuation rules**<br />[`punctuation.py`][punctuation.py] | Regular expressions for splitting tokens, e.g. on punctuation or special characters like emoji. Includes rules for prefixes, suffixes and infixes. |
|
||||
| **Character classes**<br />[`char_classes.py`][char_classes.py] | Character classes to be used in regular expressions, for example, Latin characters, quotes, hyphens or icons. |
|
||||
| **Lexical attributes**<br />[`lex_attrs.py`][lex_attrs.py] | Custom functions for setting lexical attributes on tokens, e.g. `like_num`, which includes language-specific words like "ten" or "hundred". |
|
||||
| **Syntax iterators**<br />[`syntax_iterators.py`][syntax_iterators.py] | Functions that compute views of a `Doc` object based on its syntax. At the moment, only used for [noun chunks](/usage/linguistic-features#noun-chunks). |
|
||||
| **Lemmatizer**<br />[`lemmatizer.py`][lemmatizer.py] [`spacy-lookups-data`][spacy-lookups-data] | Custom lemmatizer implementation and lemmatization tables. |
|
||||
|
||||
[stop_words.py]:
|
||||
https://github.com/explosion/spaCy/tree/master/spacy/lang/en/stop_words.py
|
||||
|
@ -44,4 +44,6 @@ values are defined in the [`Language.Defaults`](/api/language#defaults).
|
|||
https://github.com/explosion/spaCy/tree/master/spacy/lang/en/lex_attrs.py
|
||||
[syntax_iterators.py]:
|
||||
https://github.com/explosion/spaCy/tree/master/spacy/lang/en/syntax_iterators.py
|
||||
[lemmatizer.py]:
|
||||
https://github.com/explosion/spaCy/tree/master/spacy/lang/fr/lemmatizer.py
|
||||
[spacy-lookups-data]: https://github.com/explosion/spacy-lookups-data
|
||||
|
|
|
@ -1,9 +1,9 @@
|
|||
When you call `nlp` on a text, spaCy first tokenizes the text to produce a `Doc`
|
||||
object. The `Doc` is then processed in several different steps – this is also
|
||||
referred to as the **processing pipeline**. The pipeline used by the
|
||||
[default models](/models) consists of a tagger, a parser and an entity
|
||||
recognizer. Each pipeline component returns the processed `Doc`, which is then
|
||||
passed on to the next component.
|
||||
[default models](/models) typically include a tagger, a lemmatizer, a parser and
|
||||
an entity recognizer. Each pipeline component returns the processed `Doc`, which
|
||||
is then passed on to the next component.
|
||||
|
||||
![The processing pipeline](../../images/pipeline.svg)
|
||||
|
||||
|
@ -12,15 +12,19 @@ passed on to the next component.
|
|||
> - **Creates:** Objects, attributes and properties modified and set by the
|
||||
> component.
|
||||
|
||||
| Name | Component | Creates | Description |
|
||||
| -------------- | ------------------------------------------------------------------ | --------------------------------------------------------- | ------------------------------------------------ |
|
||||
| **tokenizer** | [`Tokenizer`](/api/tokenizer) | `Doc` | Segment text into tokens. |
|
||||
| **tagger** | [`Tagger`](/api/tagger) | `Token.tag` | Assign part-of-speech tags. |
|
||||
| **parser** | [`DependencyParser`](/api/dependencyparser) | `Token.head`, `Token.dep`, `Doc.sents`, `Doc.noun_chunks` | Assign dependency labels. |
|
||||
| **ner** | [`EntityRecognizer`](/api/entityrecognizer) | `Doc.ents`, `Token.ent_iob`, `Token.ent_type` | Detect and label named entities. |
|
||||
| **lemmatizer** | [`Lemmatizer`](/api/lemmatizer) | `Token.lemma` | Assign base forms. |
|
||||
| **textcat** | [`TextCategorizer`](/api/textcategorizer) | `Doc.cats` | Assign document labels. |
|
||||
| **custom** | [custom components](/usage/processing-pipelines#custom-components) | `Doc._.xxx`, `Token._.xxx`, `Span._.xxx` | Assign custom attributes, methods or properties. |
|
||||
| Name | Component | Creates | Description |
|
||||
| -------------- | ------------------------------------------- | --------------------------------------------------------- | -------------------------------- |
|
||||
| **tokenizer** | [`Tokenizer`](/api/tokenizer) | `Doc` | Segment text into tokens. |
|
||||
| **tagger** | [`Tagger`](/api/tagger) | `Token.tag` | Assign part-of-speech tags. |
|
||||
| **parser** | [`DependencyParser`](/api/dependencyparser) | `Token.head`, `Token.dep`, `Doc.sents`, `Doc.noun_chunks` | Assign dependency labels. |
|
||||
| **ner** | [`EntityRecognizer`](/api/entityrecognizer) | `Doc.ents`, `Token.ent_iob`, `Token.ent_type` | Detect and label named entities. |
|
||||
| **lemmatizer** | [`Lemmatizer`](/api/lemmatizer) | `Token.lemma` | Assign base forms. |
|
||||
| **textcat** | [`TextCategorizer`](/api/textcategorizer) | `Doc.cats` | Assign document labels. |
|
||||
|
||||
| **custom** |
|
||||
[custom components](/usage/processing-pipelines#custom-components) |
|
||||
`Doc._.xxx`, `Token._.xxx`, `Span._.xxx` | Assign custom attributes, methods or
|
||||
properties. |
|
||||
|
||||
The processing pipeline always **depends on the statistical model** and its
|
||||
capabilities. For example, a pipeline can only include an entity recognizer
|
||||
|
|
|
@ -52,9 +52,9 @@ $ pip install -U spacy
|
|||
To install additional data tables for lemmatization you can run
|
||||
`pip install spacy[lookups]` or install
|
||||
[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data)
|
||||
separately. The lookups package is needed to create blank models with
|
||||
lemmatization data, and to lemmatize in languages that don't yet come with
|
||||
pretrained models and aren't powered by third-party libraries.
|
||||
separately. The lookups package is needed to provide normalization and
|
||||
lemmatization data for new models and to lemmatize in languages that don't yet
|
||||
come with pretrained models and aren't powered by third-party libraries.
|
||||
|
||||
</Infobox>
|
||||
|
||||
|
|
|
@ -3,6 +3,8 @@ title: Linguistic Features
|
|||
next: /usage/rule-based-matching
|
||||
menu:
|
||||
- ['POS Tagging', 'pos-tagging']
|
||||
- ['Morphology', 'morphology']
|
||||
- ['Lemmatization', 'lemmatization']
|
||||
- ['Dependency Parse', 'dependency-parse']
|
||||
- ['Named Entities', 'named-entities']
|
||||
- ['Entity Linking', 'entity-linking']
|
||||
|
@ -10,7 +12,8 @@ menu:
|
|||
- ['Merging & Splitting', 'retokenization']
|
||||
- ['Sentence Segmentation', 'sbd']
|
||||
- ['Vectors & Similarity', 'vectors-similarity']
|
||||
- ['Language data', 'language-data']
|
||||
- ['Mappings & Exceptions', 'mappings-exceptions']
|
||||
- ['Language Data', 'language-data']
|
||||
---
|
||||
|
||||
Processing raw text intelligently is difficult: most words are rare, and it's
|
||||
|
@ -37,7 +40,7 @@ in the [models directory](/models).
|
|||
|
||||
</Infobox>
|
||||
|
||||
### Rule-based morphology {#rule-based-morphology}
|
||||
## Morphology {#morphology}
|
||||
|
||||
Inflectional morphology is the process by which a root form of a word is
|
||||
modified by adding prefixes or suffixes that specify its grammatical function
|
||||
|
@ -45,33 +48,147 @@ but do not changes its part-of-speech. We say that a **lemma** (root form) is
|
|||
**inflected** (modified/combined) with one or more **morphological features** to
|
||||
create a surface form. Here are some examples:
|
||||
|
||||
| Context | Surface | Lemma | POS | Morphological Features |
|
||||
| ---------------------------------------- | ------- | ----- | ---- | ---------------------------------------- |
|
||||
| I was reading the paper | reading | read | verb | `VerbForm=Ger` |
|
||||
| I don't watch the news, I read the paper | read | read | verb | `VerbForm=Fin`, `Mood=Ind`, `Tense=Pres` |
|
||||
| I read the paper yesterday | read | read | verb | `VerbForm=Fin`, `Mood=Ind`, `Tense=Past` |
|
||||
| Context | Surface | Lemma | POS | Morphological Features |
|
||||
| ---------------------------------------- | ------- | ----- | ------ | ---------------------------------------- |
|
||||
| I was reading the paper | reading | read | `VERB` | `VerbForm=Ger` |
|
||||
| I don't watch the news, I read the paper | read | read | `VERB` | `VerbForm=Fin`, `Mood=Ind`, `Tense=Pres` |
|
||||
| I read the paper yesterday | read | read | `VERB` | `VerbForm=Fin`, `Mood=Ind`, `Tense=Past` |
|
||||
|
||||
English has a relatively simple morphological system, which spaCy handles using
|
||||
rules that can be keyed by the token, the part-of-speech tag, or the combination
|
||||
of the two. The system works as follows:
|
||||
Morphological features are stored in the [`MorphAnalysis`](/api/morphanalysis)
|
||||
under `Token.morph`, which allows you to access individual morphological
|
||||
features. The attribute `Token.morph_` provides the morphological analysis in
|
||||
the Universal Dependencies FEATS format.
|
||||
|
||||
1. The tokenizer consults a
|
||||
[mapping table](/usage/adding-languages#tokenizer-exceptions)
|
||||
`TOKENIZER_EXCEPTIONS`, which allows sequences of characters to be mapped to
|
||||
multiple tokens. Each token may be assigned a part of speech and one or more
|
||||
morphological features.
|
||||
2. The part-of-speech tagger then assigns each token an **extended POS tag**. In
|
||||
the API, these tags are known as `Token.tag`. They express the part-of-speech
|
||||
(e.g. `VERB`) and some amount of morphological information, e.g. that the
|
||||
verb is past tense.
|
||||
3. For words whose POS is not set by a prior process, a
|
||||
[mapping table](/usage/adding-languages#tag-map) `TAG_MAP` maps the tags to a
|
||||
part-of-speech and a set of morphological features.
|
||||
4. Finally, a **rule-based deterministic lemmatizer** maps the surface form, to
|
||||
a lemma in light of the previously assigned extended part-of-speech and
|
||||
morphological information, without consulting the context of the token. The
|
||||
lemmatizer also accepts list-based exception files, acquired from
|
||||
[WordNet](https://wordnet.princeton.edu/).
|
||||
```python
|
||||
### {executable="true"}
|
||||
import spacy
|
||||
|
||||
nlp = spacy.load("en_core_web_sm")
|
||||
doc = nlp("I was reading the paper.")
|
||||
|
||||
token = doc[0] # "I"
|
||||
assert token.morph_ == "Case=Nom|Number=Sing|Person=1|PronType=Prs"
|
||||
assert token.morph.get("PronType") == ["Prs"]
|
||||
```
|
||||
|
||||
### Statistical morphology {#morphologizer new="3" model="morphologizer"}
|
||||
|
||||
spaCy v3 includes a statistical morphologizer component that assigns the
|
||||
morphological features and POS as `Token.morph` and `Token.pos`.
|
||||
|
||||
```python
|
||||
### {executable="true"}
|
||||
import spacy
|
||||
|
||||
nlp = spacy.load("de_core_news_sm")
|
||||
doc = nlp("Wo bist du?") # 'Where are you?'
|
||||
assert doc[2].morph_ == "Case=Nom|Number=Sing|Person=2|PronType=Prs"
|
||||
assert doc[2].pos_ == "PRON"
|
||||
```
|
||||
|
||||
### Rule-based morphology {#rule-based-morphology}
|
||||
|
||||
For languages with relatively simple morphological systems like English, spaCy
|
||||
can assign morphological features through a rule-based approach, which uses the
|
||||
token text and fine-grained part-of-speech tags to produce coarse-grained
|
||||
part-of-speech tags and morphological features.
|
||||
|
||||
1. The part-of-speech tagger assigns each token a **fine-grained part-of-speech
|
||||
tag**. In the API, these tags are known as `Token.tag`. They express the
|
||||
part-of-speech (e.g. verb) and some amount of morphological information, e.g.
|
||||
that the verb is past tense (e.g. `VBD` for a past tense verb in the Penn
|
||||
Treebank) .
|
||||
2. For words whose coarse-grained POS is not set by a prior process, a
|
||||
[mapping table](#mapping-exceptions) maps the fine-grained tags to a
|
||||
coarse-grained POS tags and morphological features.
|
||||
|
||||
```python
|
||||
### {executable="true"}
|
||||
import spacy
|
||||
|
||||
nlp = spacy.load("en_core_web_sm")
|
||||
doc = nlp("Where are you?")
|
||||
assert doc[2].morph_ == "Case=Nom|Person=2|PronType=Prs"
|
||||
assert doc[2].pos_ == "PRON"
|
||||
```
|
||||
|
||||
## Lemmatization {#lemmatization model="lemmatizer" new="3"}
|
||||
|
||||
The [`Lemmatizer`](/api/lemmatizer) is a pipeline component that provides lookup
|
||||
and rule-based lemmatization methods in a configurable component. An individual
|
||||
language can extend the `Lemmatizer` as part of its [language
|
||||
data](#language-data).
|
||||
|
||||
```python
|
||||
### {executable="true"}
|
||||
import spacy
|
||||
|
||||
# English models include a rule-based lemmatizer
|
||||
nlp = spacy.load("en_core_web_sm")
|
||||
lemmatizer = nlp.get_pipe("lemmatizer")
|
||||
assert lemmatizer.mode == "rule"
|
||||
|
||||
doc = nlp("I was reading the paper.")
|
||||
assert doc[1].lemma_ == "be"
|
||||
assert doc[2].lemma_ == "read"
|
||||
```
|
||||
|
||||
<Infobox title="Important note" variant="warning">
|
||||
|
||||
Unlike spaCy v2, spaCy v3 models do not provide lemmas by default or switch
|
||||
automatically between lookup and rule-based lemmas depending on whether a
|
||||
tagger is in the pipeline. To have lemmas in a `Doc`, the pipeline needs to
|
||||
include a `lemmatizer` component. A `lemmatizer` is configured to use a single
|
||||
mode such as `"lookup"` or `"rule"` on initialization. The `"rule"` mode
|
||||
requires `Token.pos` to be set by a previous component.
|
||||
|
||||
</Infobox>
|
||||
|
||||
The data for spaCy's lemmatizers is distributed in the package
|
||||
[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data). The
|
||||
provided models already include all the required tables, but if you are
|
||||
creating new models, you'll probably want to install `spacy-lookups-data` to
|
||||
provide the data when the lemmatizer is initialized.
|
||||
|
||||
### Lookup lemmatizer {#lemmatizer-lookup}
|
||||
|
||||
For models without a tagger or morphologizer, a lookup lemmatizer can be added
|
||||
to the pipeline as long as a lookup table is provided, typically through
|
||||
`spacy-lookups-data`. The lookup lemmatizer looks up the token surface form in
|
||||
the lookup table without reference to the token's part-of-speech or context.
|
||||
|
||||
```python
|
||||
# pip install spacy-lookups-data
|
||||
import spacy
|
||||
|
||||
nlp = spacy.blank("sv")
|
||||
nlp.add_pipe("lemmatizer", config={"mode": "lookup"})
|
||||
```
|
||||
|
||||
### Rule-based lemmatizer {#lemmatizer-rule}
|
||||
|
||||
When training models that include a component that assigns POS (a morphologizer
|
||||
or a tagger with a [POS mapping](#mappings-exceptions)), a rule-based
|
||||
lemmatizer can be added using rule tables from `spacy-lookups-data`:
|
||||
|
||||
```python
|
||||
# pip install spacy-lookups-data
|
||||
import spacy
|
||||
|
||||
nlp = spacy.blank("de")
|
||||
|
||||
# morphologizer (note: model is not yet trained!)
|
||||
nlp.add_pipe("morphologizer")
|
||||
|
||||
# rule-based lemmatizer
|
||||
nlp.add_pipe("lemmatizer", config={"mode": "rule"})
|
||||
```
|
||||
|
||||
The rule-based deterministic lemmatizer maps the surface form to a lemma in
|
||||
light of the previously assigned coarse-grained part-of-speech and morphological
|
||||
information, without consulting the context of the token. The rule-based
|
||||
lemmatizer also accepts list-based exception files. For English, these are
|
||||
acquired from [WordNet](https://wordnet.princeton.edu/).
|
||||
|
||||
## Dependency Parsing {#dependency-parse model="parser"}
|
||||
|
||||
|
@ -420,7 +537,7 @@ on a token, it will return an empty string.
|
|||
>
|
||||
> #### BILUO Scheme
|
||||
>
|
||||
> - `B` – Token is the **beginning** of an entity.
|
||||
> - `B` – Token is the **beginning** of a multi-token entity.
|
||||
> - `I` – Token is **inside** a multi-token entity.
|
||||
> - `L` – Token is the **last** token of a multi-token entity.
|
||||
> - `U` – Token is a single-token **unit** entity.
|
||||
|
@ -1574,6 +1691,75 @@ doc = nlp(text)
|
|||
print("After:", [sent.text for sent in doc.sents])
|
||||
```
|
||||
|
||||
## Mappings & Exceptions {#mappings-exceptions new="3"}
|
||||
|
||||
The [`AttributeRuler`](/api/attributeruler) manages rule-based mappings and
|
||||
exceptions for all token-level attributes. As the number of pipeline components
|
||||
has grown from spaCy v2 to v3, handling rules and exceptions in each component
|
||||
individually has become impractical, so the `AttributeRuler` provides a single
|
||||
component with a unified pattern format for all token attribute mappings and
|
||||
exceptions.
|
||||
|
||||
The `AttributeRuler` uses [`Matcher`
|
||||
patterns](/usage/rule-based-matching#adding-patterns) to identify tokens and
|
||||
then assigns them the provided attributes. If needed, the `Matcher` patterns
|
||||
can include context around the target token. For example, the `AttributeRuler`
|
||||
can:
|
||||
|
||||
- provide exceptions for any token attributes
|
||||
- map fine-grained tags to coarse-grained tags for languages without statistical
|
||||
morphologizers (replacing the v2 tag map in the language data)
|
||||
- map token surface form + fine-grained tags to morphological features
|
||||
(replacing the v2 morph rules in the language data)
|
||||
- specify the tags for space tokens (replacing hard-coded behavior in the
|
||||
tagger)
|
||||
|
||||
The following example shows how the tag and POS `NNP`/`PROPN` can be specified
|
||||
for the phrase `"The Who"`, overriding the tags provided by the statistical
|
||||
tagger and the POS tag map.
|
||||
|
||||
```python
|
||||
### {executable="true"}
|
||||
import spacy
|
||||
|
||||
nlp = spacy.load("en_core_web_sm")
|
||||
text = "I saw The Who perform. Who did you see?"
|
||||
|
||||
doc1 = nlp(text)
|
||||
assert doc1[2].tag_ == "DT"
|
||||
assert doc1[2].pos_ == "DET"
|
||||
assert doc1[3].tag_ == "WP"
|
||||
assert doc1[3].pos_ == "PRON"
|
||||
|
||||
# add a new exception for "The Who" as NNP/PROPN NNP/PROPN
|
||||
ruler = nlp.get_pipe("attribute_ruler")
|
||||
|
||||
# pattern to match "The Who"
|
||||
patterns = [[{"LOWER": "the"}, {"TEXT": "Who"}]]
|
||||
# the attributes to assign to the matched token
|
||||
attrs = {"TAG": "NNP", "POS": "PROPN"}
|
||||
|
||||
# add rule for "The" in "The Who"
|
||||
ruler.add(patterns=patterns, attrs=attrs, index=0)
|
||||
# add rule for "Who" in "The Who"
|
||||
ruler.add(patterns=patterns, attrs=attrs, index=1)
|
||||
|
||||
doc2 = nlp(text)
|
||||
assert doc2[2].tag_ == "NNP"
|
||||
assert doc2[3].tag_ == "NNP"
|
||||
assert doc2[2].pos_ == "PROPN"
|
||||
assert doc2[3].pos_ == "PROPN"
|
||||
|
||||
# the second "Who" remains unmodified
|
||||
assert doc2[5].tag_ == "WP"
|
||||
assert doc2[5].pos_ == "PRON"
|
||||
```
|
||||
|
||||
For easy migration from from spaCy v2 to v3, the `AttributeRuler` can import v2
|
||||
`TAG_MAP` and `MORPH_RULES` data with the methods
|
||||
[`AttributerRuler.load_from_tag_map`](/api/attributeruler#load_from_tag_map) and
|
||||
[`AttributeRuler.load_from_morph_rules`](/api/attributeruler#load_from_morph_rules).
|
||||
|
||||
## Word vectors and semantic similarity {#vectors-similarity}
|
||||
|
||||
import Vectors101 from 'usage/101/\_vectors-similarity.md'
|
||||
|
@ -1703,7 +1889,7 @@ for word, vector in vector_data.items():
|
|||
vocab.set_vector(word, vector)
|
||||
```
|
||||
|
||||
## Language data {#language-data}
|
||||
## Language Data {#language-data}
|
||||
|
||||
import LanguageData101 from 'usage/101/\_language-data.md'
|
||||
|
||||
|
|
|
@ -220,20 +220,21 @@ available pipeline components and component functions.
|
|||
> ruler = nlp.add_pipe("entity_ruler")
|
||||
> ```
|
||||
|
||||
| String name | Component | Description |
|
||||
| --------------- | ----------------------------------------------- | ----------------------------------------------------------------------------------------- |
|
||||
| `tagger` | [`Tagger`](/api/tagger) | Assign part-of-speech-tags. |
|
||||
| `parser` | [`DependencyParser`](/api/dependencyparser) | Assign dependency labels. |
|
||||
| `ner` | [`EntityRecognizer`](/api/entityrecognizer) | Assign named entities. |
|
||||
| `entity_linker` | [`EntityLinker`](/api/entitylinker) | Assign knowledge base IDs to named entities. Should be added after the entity recognizer. |
|
||||
| `entity_ruler` | [`EntityRuler`](/api/entityruler) | Assign named entities based on pattern rules and dictionaries. |
|
||||
| `textcat` | [`TextCategorizer`](/api/textcategorizer) | Assign text categories. |
|
||||
| `lemmatizer` | [`Lemmatizer`](/api/lemmatizer) | Assign base forms to words. |
|
||||
| `morphologizer` | [`Morphologizer`](/api/morphologizer) | Assign morphological features and coarse-grained POS tags. |
|
||||
| `senter` | [`SentenceRecognizer`](/api/sentencerecognizer) | Assign sentence boundaries. |
|
||||
| `sentencizer` | [`Sentencizer`](/api/sentencizer) | Add rule-based sentence segmentation without the dependency parse. |
|
||||
| `tok2vec` | [`Tok2Vec`](/api/tok2vec) | Assign token-to-vector embeddings. |
|
||||
| `transformer` | [`Transformer`](/api/transformer) | Assign the tokens and outputs of a transformer model. |
|
||||
| String name | Component | Description |
|
||||
| ----------------- | ----------------------------------------------- | ----------------------------------------------------------------------------------------- |
|
||||
| `tagger` | [`Tagger`](/api/tagger) | Assign part-of-speech-tags. |
|
||||
| `parser` | [`DependencyParser`](/api/dependencyparser) | Assign dependency labels. |
|
||||
| `ner` | [`EntityRecognizer`](/api/entityrecognizer) | Assign named entities. |
|
||||
| `entity_linker` | [`EntityLinker`](/api/entitylinker) | Assign knowledge base IDs to named entities. Should be added after the entity recognizer. |
|
||||
| `entity_ruler` | [`EntityRuler`](/api/entityruler) | Assign named entities based on pattern rules and dictionaries. |
|
||||
| `textcat` | [`TextCategorizer`](/api/textcategorizer) | Assign text categories. |
|
||||
| `lemmatizer` | [`Lemmatizer`](/api/lemmatizer) | Assign base forms to words. |
|
||||
| `morphologizer` | [`Morphologizer`](/api/morphologizer) | Assign morphological features and coarse-grained POS tags. |
|
||||
| `attribute_ruler` | [`AttributeRuler`](/api/attributeruler) | Assign token attribute mappings and rule-based exceptions. |
|
||||
| `senter` | [`SentenceRecognizer`](/api/sentencerecognizer) | Assign sentence boundaries. |
|
||||
| `sentencizer` | [`Sentencizer`](/api/sentencizer) | Add rule-based sentence segmentation without the dependency parse. |
|
||||
| `tok2vec` | [`Tok2Vec`](/api/tok2vec) | Assign token-to-vector embeddings. |
|
||||
| `transformer` | [`Transformer`](/api/transformer) | Assign the tokens and outputs of a transformer model. |
|
||||
|
||||
### Disabling and modifying pipeline components {#disabling}
|
||||
|
||||
|
|
|
@ -142,6 +142,7 @@ add to your pipeline and customize for your use case:
|
|||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> # pip install spacy-lookups-data
|
||||
> nlp = spacy.blank("en")
|
||||
> nlp.add_pipe("lemmatizer")
|
||||
> ```
|
||||
|
@ -260,7 +261,7 @@ The following methods, attributes and commands are new in spaCy v3.0.
|
|||
| [`Language.has_factory`](/api/language#has_factory) | Check whether a component factory is registered on a language class.s |
|
||||
| [`Language.get_factory_meta`](/api/language#get_factory_meta) [`Language.get_pipe_meta`](/api/language#get_factory_meta) | Get the [`FactoryMeta`](/api/language#factorymeta) with component metadata for a factory or instance name. |
|
||||
| [`Language.config`](/api/language#config) | The [config](/usage/training#config) used to create the current `nlp` object. An instance of [`Config`](https://thinc.ai/docs/api-config#config) and can be saved to disk and used for training. |
|
||||
| [`Pipe.score`](/api/pipe#score) | Method on trainable pipeline components that returns a dictionary of evaluation scores. |
|
||||
| [`Pipe.score`](/api/pipe#score) | Method on pipeline components that returns a dictionary of evaluation scores. |
|
||||
| [`registry`](/api/top-level#registry) | Function registry to map functions to string names that can be referenced in [configs](/usage/training#config). |
|
||||
| [`util.load_meta`](/api/top-level#util.load_meta) [`util.load_config`](/api/top-level#util.load_config) | Updated helpers for loading a model's [`meta.json`](/api/data-formats#meta) and [`config.cfg`](/api/data-formats#config). |
|
||||
| [`util.get_installed_models`](/api/top-level#util.get_installed_models) | Names of all models installed in the environment. |
|
||||
|
@ -396,7 +397,7 @@ on them.
|
|||
| keyword-arguments like `vocab=False` on `to_disk`, `from_disk`, `to_bytes`, `from_bytes` | `exclude=["vocab"]` |
|
||||
| `n_threads` argument on [`Tokenizer`](/api/tokenizer), [`Matcher`](/api/matcher), [`PhraseMatcher`](/api/phrasematcher) | `n_process` |
|
||||
| `verbose` argument on [`Language.evaluate`](/api/language#evaluate) | logging (`DEBUG`) |
|
||||
| `SentenceSegmenter` hook, `SimilarityHook` | [user hooks](/usage/processing-pipelines#custom-components-user-hooks), [`Sentencizer`](/api/sentencizer), [`SentenceRecognizer`](/api/sentenceregognizer) |
|
||||
| `SentenceSegmenter` hook, `SimilarityHook` | [user hooks](/usage/processing-pipelines#custom-components-user-hooks), [`Sentencizer`](/api/sentencizer), [`SentenceRecognizer`](/api/sentencerecognizer) |
|
||||
|
||||
## Migrating from v2.x {#migrating}
|
||||
|
||||
|
|
Loading…
Reference in New Issue
Block a user