mirror of
https://github.com/explosion/spaCy.git
synced 2024-09-21 11:29:13 +03:00
Update docs [ci skip]
This commit is contained in:
parent
d73f7229c0
commit
9b86312bab
|
@ -12,7 +12,8 @@ The attribute ruler lets you set token attributes for tokens identified by
|
||||||
[`Matcher` patterns](/usage/rule-based-matching#matcher). The attribute ruler is
|
[`Matcher` patterns](/usage/rule-based-matching#matcher). The attribute ruler is
|
||||||
typically used to handle exceptions for token attributes and to map values
|
typically used to handle exceptions for token attributes and to map values
|
||||||
between attributes such as mapping fine-grained POS tags to coarse-grained POS
|
between attributes such as mapping fine-grained POS tags to coarse-grained POS
|
||||||
tags.
|
tags. See the [usage guide](/usage/linguistic-features/#mappings-exceptions) for
|
||||||
|
examples.
|
||||||
|
|
||||||
## Config and implementation {#config}
|
## Config and implementation {#config}
|
||||||
|
|
||||||
|
|
|
@ -12,19 +12,16 @@ is then passed on to the next component.
|
||||||
> - **Creates:** Objects, attributes and properties modified and set by the
|
> - **Creates:** Objects, attributes and properties modified and set by the
|
||||||
> component.
|
> component.
|
||||||
|
|
||||||
| Name | Component | Creates | Description |
|
| Name | Component | Creates | Description |
|
||||||
| -------------- | ------------------------------------------- | --------------------------------------------------------- | -------------------------------- |
|
| --------------------- | ------------------------------------------------------------------ | --------------------------------------------------------- | ------------------------------------------------ |
|
||||||
| **tokenizer** | [`Tokenizer`](/api/tokenizer) | `Doc` | Segment text into tokens. |
|
| **tokenizer** | [`Tokenizer`](/api/tokenizer) | `Doc` | Segment text into tokens. |
|
||||||
| **tagger** | [`Tagger`](/api/tagger) | `Token.tag` | Assign part-of-speech tags. |
|
| _processing pipeline_ | | |
|
||||||
| **parser** | [`DependencyParser`](/api/dependencyparser) | `Token.head`, `Token.dep`, `Doc.sents`, `Doc.noun_chunks` | Assign dependency labels. |
|
| **tagger** | [`Tagger`](/api/tagger) | `Token.tag` | Assign part-of-speech tags. |
|
||||||
| **ner** | [`EntityRecognizer`](/api/entityrecognizer) | `Doc.ents`, `Token.ent_iob`, `Token.ent_type` | Detect and label named entities. |
|
| **parser** | [`DependencyParser`](/api/dependencyparser) | `Token.head`, `Token.dep`, `Doc.sents`, `Doc.noun_chunks` | Assign dependency labels. |
|
||||||
| **lemmatizer** | [`Lemmatizer`](/api/lemmatizer) | `Token.lemma` | Assign base forms. |
|
| **ner** | [`EntityRecognizer`](/api/entityrecognizer) | `Doc.ents`, `Token.ent_iob`, `Token.ent_type` | Detect and label named entities. |
|
||||||
| **textcat** | [`TextCategorizer`](/api/textcategorizer) | `Doc.cats` | Assign document labels. |
|
| **lemmatizer** | [`Lemmatizer`](/api/lemmatizer) | `Token.lemma` | Assign base forms. |
|
||||||
|
| **textcat** | [`TextCategorizer`](/api/textcategorizer) | `Doc.cats` | Assign document labels. |
|
||||||
| **custom** |
|
| **custom** | [custom components](/usage/processing-pipelines#custom-components) | `Doc._.xxx`, `Token._.xxx`, `Span._.xxx` | Assign custom attributes, methods or properties. |
|
||||||
[custom components](/usage/processing-pipelines#custom-components) |
|
|
||||||
`Doc._.xxx`, `Token._.xxx`, `Span._.xxx` | Assign custom attributes, methods or
|
|
||||||
properties. |
|
|
||||||
|
|
||||||
The processing pipeline always **depends on the statistical model** and its
|
The processing pipeline always **depends on the statistical model** and its
|
||||||
capabilities. For example, a pipeline can only include an entity recognizer
|
capabilities. For example, a pipeline can only include an entity recognizer
|
||||||
|
|
|
@ -57,41 +57,50 @@ create a surface form. Here are some examples:
|
||||||
Morphological features are stored in the [`MorphAnalysis`](/api/morphanalysis)
|
Morphological features are stored in the [`MorphAnalysis`](/api/morphanalysis)
|
||||||
under `Token.morph`, which allows you to access individual morphological
|
under `Token.morph`, which allows you to access individual morphological
|
||||||
features. The attribute `Token.morph_` provides the morphological analysis in
|
features. The attribute `Token.morph_` provides the morphological analysis in
|
||||||
the Universal Dependencies FEATS format.
|
the Universal Dependencies
|
||||||
|
[FEATS](https://universaldependencies.org/format.html#morphological-annotation)
|
||||||
|
format.
|
||||||
|
|
||||||
|
> #### 📝 Things to try
|
||||||
|
>
|
||||||
|
> 1. Change "I" to "She". You should see that the morphological features change
|
||||||
|
> and express that it's a pronoun in the third person.
|
||||||
|
> 2. Inspect `token.morph_` for the other tokens.
|
||||||
|
|
||||||
```python
|
```python
|
||||||
### {executable="true"}
|
### {executable="true"}
|
||||||
import spacy
|
import spacy
|
||||||
|
|
||||||
nlp = spacy.load("en_core_web_sm")
|
nlp = spacy.load("en_core_web_sm")
|
||||||
|
print("Pipeline:", nlp.pipe_names)
|
||||||
doc = nlp("I was reading the paper.")
|
doc = nlp("I was reading the paper.")
|
||||||
|
token = doc[0] # 'I'
|
||||||
token = doc[0] # "I"
|
print(token.morph_) # 'Case=Nom|Number=Sing|Person=1|PronType=Prs'
|
||||||
assert token.morph_ == "Case=Nom|Number=Sing|Person=1|PronType=Prs"
|
print(token.morph.get("PronType")) # ['Prs']
|
||||||
assert token.morph.get("PronType") == ["Prs"]
|
|
||||||
```
|
```
|
||||||
|
|
||||||
### Statistical morphology {#morphologizer new="3" model="morphologizer"}
|
### Statistical morphology {#morphologizer new="3" model="morphologizer"}
|
||||||
|
|
||||||
spaCy v3 includes a statistical morphologizer component that assigns the
|
spaCy's statistical [`Morphologizer`](/api/morphologizer) component assigns the
|
||||||
morphological features and POS as `Token.morph` and `Token.pos`.
|
morphological features and coarse-grained part-of-speech tags as `Token.morph`
|
||||||
|
and `Token.pos`.
|
||||||
|
|
||||||
```python
|
```python
|
||||||
### {executable="true"}
|
### {executable="true"}
|
||||||
import spacy
|
import spacy
|
||||||
|
|
||||||
nlp = spacy.load("de_core_news_sm")
|
nlp = spacy.load("de_core_news_sm")
|
||||||
doc = nlp("Wo bist du?") # 'Where are you?'
|
doc = nlp("Wo bist du?") # English: 'Where are you?'
|
||||||
assert doc[2].morph_ == "Case=Nom|Number=Sing|Person=2|PronType=Prs"
|
print(doc[2].morph_) # 'Case=Nom|Number=Sing|Person=2|PronType=Prs'
|
||||||
assert doc[2].pos_ == "PRON"
|
print(doc[2].pos_) # 'PRON'
|
||||||
```
|
```
|
||||||
|
|
||||||
### Rule-based morphology {#rule-based-morphology}
|
### Rule-based morphology {#rule-based-morphology}
|
||||||
|
|
||||||
For languages with relatively simple morphological systems like English, spaCy
|
For languages with relatively simple morphological systems like English, spaCy
|
||||||
can assign morphological features through a rule-based approach, which uses the
|
can assign morphological features through a rule-based approach, which uses the
|
||||||
token text and fine-grained part-of-speech tags to produce coarse-grained
|
**token text** and **fine-grained part-of-speech tags** to produce
|
||||||
part-of-speech tags and morphological features.
|
coarse-grained part-of-speech tags and morphological features.
|
||||||
|
|
||||||
1. The part-of-speech tagger assigns each token a **fine-grained part-of-speech
|
1. The part-of-speech tagger assigns each token a **fine-grained part-of-speech
|
||||||
tag**. In the API, these tags are known as `Token.tag`. They express the
|
tag**. In the API, these tags are known as `Token.tag`. They express the
|
||||||
|
@ -108,16 +117,16 @@ import spacy
|
||||||
|
|
||||||
nlp = spacy.load("en_core_web_sm")
|
nlp = spacy.load("en_core_web_sm")
|
||||||
doc = nlp("Where are you?")
|
doc = nlp("Where are you?")
|
||||||
assert doc[2].morph_ == "Case=Nom|Person=2|PronType=Prs"
|
print(doc[2].morph_) # 'Case=Nom|Person=2|PronType=Prs'
|
||||||
assert doc[2].pos_ == "PRON"
|
print(doc[2].pos_) # 'PRON'
|
||||||
```
|
```
|
||||||
|
|
||||||
## Lemmatization {#lemmatization model="lemmatizer" new="3"}
|
## Lemmatization {#lemmatization model="lemmatizer" new="3"}
|
||||||
|
|
||||||
The [`Lemmatizer`](/api/lemmatizer) is a pipeline component that provides lookup
|
The [`Lemmatizer`](/api/lemmatizer) is a pipeline component that provides lookup
|
||||||
and rule-based lemmatization methods in a configurable component. An individual
|
and rule-based lemmatization methods in a configurable component. An individual
|
||||||
language can extend the `Lemmatizer` as part of its [language
|
language can extend the `Lemmatizer` as part of its
|
||||||
data](#language-data).
|
[language data](#language-data).
|
||||||
|
|
||||||
```python
|
```python
|
||||||
### {executable="true"}
|
### {executable="true"}
|
||||||
|
@ -126,36 +135,38 @@ import spacy
|
||||||
# English models include a rule-based lemmatizer
|
# English models include a rule-based lemmatizer
|
||||||
nlp = spacy.load("en_core_web_sm")
|
nlp = spacy.load("en_core_web_sm")
|
||||||
lemmatizer = nlp.get_pipe("lemmatizer")
|
lemmatizer = nlp.get_pipe("lemmatizer")
|
||||||
assert lemmatizer.mode == "rule"
|
print(lemmatizer.mode) # 'rule'
|
||||||
|
|
||||||
doc = nlp("I was reading the paper.")
|
doc = nlp("I was reading the paper.")
|
||||||
assert doc[1].lemma_ == "be"
|
print([token.lemma_ for token in doc])
|
||||||
assert doc[2].lemma_ == "read"
|
# ['I', 'be', 'read', 'the', 'paper', '.']
|
||||||
```
|
```
|
||||||
|
|
||||||
<Infobox title="Important note" variant="warning">
|
<Infobox title="Changed in v3.0" variant="warning">
|
||||||
|
|
||||||
Unlike spaCy v2, spaCy v3 models do not provide lemmas by default or switch
|
Unlike spaCy v2, spaCy v3 models do _not_ provide lemmas by default or switch
|
||||||
automatically between lookup and rule-based lemmas depending on whether a
|
automatically between lookup and rule-based lemmas depending on whether a tagger
|
||||||
tagger is in the pipeline. To have lemmas in a `Doc`, the pipeline needs to
|
is in the pipeline. To have lemmas in a `Doc`, the pipeline needs to include a
|
||||||
include a `lemmatizer` component. A `lemmatizer` is configured to use a single
|
[`Lemmatizer`](/api/lemmatizer) component. The lemmatizer component is
|
||||||
mode such as `"lookup"` or `"rule"` on initialization. The `"rule"` mode
|
configured to use a single mode such as `"lookup"` or `"rule"` on
|
||||||
requires `Token.pos` to be set by a previous component.
|
initialization. The `"rule"` mode requires `Token.pos` to be set by a previous
|
||||||
|
component.
|
||||||
|
|
||||||
</Infobox>
|
</Infobox>
|
||||||
|
|
||||||
The data for spaCy's lemmatizers is distributed in the package
|
The data for spaCy's lemmatizers is distributed in the package
|
||||||
[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data). The
|
[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data). The
|
||||||
provided models already include all the required tables, but if you are
|
provided models already include all the required tables, but if you are creating
|
||||||
creating new models, you'll probably want to install `spacy-lookups-data` to
|
new models, you'll probably want to install `spacy-lookups-data` to provide the
|
||||||
provide the data when the lemmatizer is initialized.
|
data when the lemmatizer is initialized.
|
||||||
|
|
||||||
### Lookup lemmatizer {#lemmatizer-lookup}
|
### Lookup lemmatizer {#lemmatizer-lookup}
|
||||||
|
|
||||||
For models without a tagger or morphologizer, a lookup lemmatizer can be added
|
For models without a tagger or morphologizer, a lookup lemmatizer can be added
|
||||||
to the pipeline as long as a lookup table is provided, typically through
|
to the pipeline as long as a lookup table is provided, typically through
|
||||||
`spacy-lookups-data`. The lookup lemmatizer looks up the token surface form in
|
[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data). The
|
||||||
the lookup table without reference to the token's part-of-speech or context.
|
lookup lemmatizer looks up the token surface form in the lookup table without
|
||||||
|
reference to the token's part-of-speech or context.
|
||||||
|
|
||||||
```python
|
```python
|
||||||
# pip install spacy-lookups-data
|
# pip install spacy-lookups-data
|
||||||
|
@ -168,19 +179,18 @@ nlp.add_pipe("lemmatizer", config={"mode": "lookup"})
|
||||||
### Rule-based lemmatizer {#lemmatizer-rule}
|
### Rule-based lemmatizer {#lemmatizer-rule}
|
||||||
|
|
||||||
When training models that include a component that assigns POS (a morphologizer
|
When training models that include a component that assigns POS (a morphologizer
|
||||||
or a tagger with a [POS mapping](#mappings-exceptions)), a rule-based
|
or a tagger with a [POS mapping](#mappings-exceptions)), a rule-based lemmatizer
|
||||||
lemmatizer can be added using rule tables from `spacy-lookups-data`:
|
can be added using rule tables from
|
||||||
|
[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data):
|
||||||
|
|
||||||
```python
|
```python
|
||||||
# pip install spacy-lookups-data
|
# pip install spacy-lookups-data
|
||||||
import spacy
|
import spacy
|
||||||
|
|
||||||
nlp = spacy.blank("de")
|
nlp = spacy.blank("de")
|
||||||
|
# Morphologizer (note: model is not yet trained!)
|
||||||
# morphologizer (note: model is not yet trained!)
|
|
||||||
nlp.add_pipe("morphologizer")
|
nlp.add_pipe("morphologizer")
|
||||||
|
# Rule-based lemmatizer
|
||||||
# rule-based lemmatizer
|
|
||||||
nlp.add_pipe("lemmatizer", config={"mode": "rule"})
|
nlp.add_pipe("lemmatizer", config={"mode": "rule"})
|
||||||
```
|
```
|
||||||
|
|
||||||
|
@ -1734,25 +1744,26 @@ print("After:", [sent.text for sent in doc.sents])
|
||||||
|
|
||||||
## Mappings & Exceptions {#mappings-exceptions new="3"}
|
## Mappings & Exceptions {#mappings-exceptions new="3"}
|
||||||
|
|
||||||
The [`AttributeRuler`](/api/attributeruler) manages rule-based mappings and
|
The [`AttributeRuler`](/api/attributeruler) manages **rule-based mappings and
|
||||||
exceptions for all token-level attributes. As the number of pipeline components
|
exceptions** for all token-level attributes. As the number of
|
||||||
has grown from spaCy v2 to v3, handling rules and exceptions in each component
|
[pipeline components](/api/#architecture-pipeline) has grown from spaCy v2 to
|
||||||
individually has become impractical, so the `AttributeRuler` provides a single
|
v3, handling rules and exceptions in each component individually has become
|
||||||
component with a unified pattern format for all token attribute mappings and
|
impractical, so the `AttributeRuler` provides a single component with a unified
|
||||||
exceptions.
|
pattern format for all token attribute mappings and exceptions.
|
||||||
|
|
||||||
The `AttributeRuler` uses [`Matcher`
|
The `AttributeRuler` uses
|
||||||
patterns](/usage/rule-based-matching#adding-patterns) to identify tokens and
|
[`Matcher` patterns](/usage/rule-based-matching#adding-patterns) to identify
|
||||||
then assigns them the provided attributes. If needed, the `Matcher` patterns
|
tokens and then assigns them the provided attributes. If needed, the
|
||||||
can include context around the target token. For example, the `AttributeRuler`
|
[`Matcher`](/api/matcher) patterns can include context around the target token.
|
||||||
can:
|
For example, the attribute ruler can:
|
||||||
|
|
||||||
- provide exceptions for any token attributes
|
- provide exceptions for any **token attributes**
|
||||||
- map fine-grained tags to coarse-grained tags for languages without statistical
|
- map **fine-grained tags** to **coarse-grained tags** for languages without
|
||||||
morphologizers (replacing the v2 tag map in the language data)
|
statistical morphologizers (replacing the v2.x `tag_map` in the
|
||||||
- map token surface form + fine-grained tags to morphological features
|
[language data](#language-data))
|
||||||
(replacing the v2 morph rules in the language data)
|
- map token **surface form + fine-grained tags** to **morphological features**
|
||||||
- specify the tags for space tokens (replacing hard-coded behavior in the
|
(replacing the v2.x `morph_rules` in the [language data](#language-data))
|
||||||
|
- specify the **tags for space tokens** (replacing hard-coded behavior in the
|
||||||
tagger)
|
tagger)
|
||||||
|
|
||||||
The following example shows how the tag and POS `NNP`/`PROPN` can be specified
|
The following example shows how the tag and POS `NNP`/`PROPN` can be specified
|
||||||
|
@ -1765,41 +1776,42 @@ import spacy
|
||||||
|
|
||||||
nlp = spacy.load("en_core_web_sm")
|
nlp = spacy.load("en_core_web_sm")
|
||||||
text = "I saw The Who perform. Who did you see?"
|
text = "I saw The Who perform. Who did you see?"
|
||||||
|
|
||||||
doc1 = nlp(text)
|
doc1 = nlp(text)
|
||||||
assert doc1[2].tag_ == "DT"
|
print(doc1[2].tag_, doc1[2].pos_) # DT DET
|
||||||
assert doc1[2].pos_ == "DET"
|
print(doc1[3].tag_, doc1[3].pos_) # WP PRON
|
||||||
assert doc1[3].tag_ == "WP"
|
|
||||||
assert doc1[3].pos_ == "PRON"
|
|
||||||
|
|
||||||
# add a new exception for "The Who" as NNP/PROPN NNP/PROPN
|
# Add attribute ruler with exception for "The Who" as NNP/PROPN NNP/PROPN
|
||||||
ruler = nlp.get_pipe("attribute_ruler")
|
ruler = nlp.get_pipe("attribute_ruler")
|
||||||
|
# Pattern to match "The Who"
|
||||||
# pattern to match "The Who"
|
|
||||||
patterns = [[{"LOWER": "the"}, {"TEXT": "Who"}]]
|
patterns = [[{"LOWER": "the"}, {"TEXT": "Who"}]]
|
||||||
# the attributes to assign to the matched token
|
# The attributes to assign to the matched token
|
||||||
attrs = {"TAG": "NNP", "POS": "PROPN"}
|
attrs = {"TAG": "NNP", "POS": "PROPN"}
|
||||||
|
# Add rules to the attribute ruler
|
||||||
# add rule for "The" in "The Who"
|
ruler.add(patterns=patterns, attrs=attrs, index=0) # "The" in "The Who"
|
||||||
ruler.add(patterns=patterns, attrs=attrs, index=0)
|
ruler.add(patterns=patterns, attrs=attrs, index=1) # "Who" in "The Who"
|
||||||
# add rule for "Who" in "The Who"
|
|
||||||
ruler.add(patterns=patterns, attrs=attrs, index=1)
|
|
||||||
|
|
||||||
doc2 = nlp(text)
|
doc2 = nlp(text)
|
||||||
assert doc2[2].tag_ == "NNP"
|
print(doc2[2].tag_, doc2[2].pos_) # NNP PROPN
|
||||||
assert doc2[3].tag_ == "NNP"
|
print(doc2[3].tag_, doc2[3].pos_) # NNP PROPN
|
||||||
assert doc2[2].pos_ == "PROPN"
|
# The second "Who" remains unmodified
|
||||||
assert doc2[3].pos_ == "PROPN"
|
print(doc2[5].tag_, doc2[5].pos_) # WP PRON
|
||||||
|
|
||||||
# the second "Who" remains unmodified
|
|
||||||
assert doc2[5].tag_ == "WP"
|
|
||||||
assert doc2[5].pos_ == "PRON"
|
|
||||||
```
|
```
|
||||||
|
|
||||||
For easy migration from from spaCy v2 to v3, the `AttributeRuler` can import v2
|
<Infobox variant="warning" title="Migrating from spaCy v2.x">
|
||||||
`TAG_MAP` and `MORPH_RULES` data with the methods
|
|
||||||
[`AttributerRuler.load_from_tag_map`](/api/attributeruler#load_from_tag_map) and
|
For easy migration from from spaCy v2 to v3, the
|
||||||
[`AttributeRuler.load_from_morph_rules`](/api/attributeruler#load_from_morph_rules).
|
[`AttributeRuler`](/api/attributeruler) can import a **tag map and morph rules**
|
||||||
|
in the v2 format with the methods
|
||||||
|
[`load_from_tag_map`](/api/attributeruler#load_from_tag_map) and
|
||||||
|
[`load_from_morph_rules`](/api/attributeruler#load_from_morph_rules).
|
||||||
|
|
||||||
|
```diff
|
||||||
|
nlp = spacy.blank("en")
|
||||||
|
+ ruler = nlp.add_pipe("attribute_ruler")
|
||||||
|
+ ruler.load_from_tag_map(YOUR_TAG_MAP)
|
||||||
|
```
|
||||||
|
|
||||||
|
</Infobox>
|
||||||
|
|
||||||
## Word vectors and semantic similarity {#vectors-similarity}
|
## Word vectors and semantic similarity {#vectors-similarity}
|
||||||
|
|
||||||
|
|
|
@ -250,26 +250,26 @@ in your config and see validation errors if the argument values don't match.
|
||||||
|
|
||||||
The following methods, attributes and commands are new in spaCy v3.0.
|
The following methods, attributes and commands are new in spaCy v3.0.
|
||||||
|
|
||||||
| Name | Description |
|
| Name | Description |
|
||||||
| ----------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
| ------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||||
| [`Token.lex`](/api/token#attributes) | Access a token's [`Lexeme`](/api/lexeme). |
|
| [`Token.lex`](/api/token#attributes) | Access a token's [`Lexeme`](/api/lexeme). |
|
||||||
| [`Token.morph`](/api/token#attributes) [`Token.morph_`](/api/token#attributes) | Access a token's morphological analysis. |
|
| [`Token.morph`](/api/token#attributes), [`Token.morph_`](/api/token#attributes) | Access a token's morphological analysis. |
|
||||||
| [`Language.select_pipes`](/api/language#select_pipes) | Context manager for enabling or disabling specific pipeline components for a block. |
|
| [`Language.select_pipes`](/api/language#select_pipes) | Context manager for enabling or disabling specific pipeline components for a block. |
|
||||||
| [`Language.disable_pipe`](/api/language#disable_pipe) [`Language.enable_pipe`](/api/language#enable_pipe) | Disable or enable a loaded pipeline component (but don't remove it). |
|
| [`Language.disable_pipe`](/api/language#disable_pipe), [`Language.enable_pipe`](/api/language#enable_pipe) | Disable or enable a loaded pipeline component (but don't remove it). |
|
||||||
| [`Language.analyze_pipes`](/api/language#analyze_pipes) | [Analyze](/usage/processing-pipelines#analysis) components and their interdependencies. |
|
| [`Language.analyze_pipes`](/api/language#analyze_pipes) | [Analyze](/usage/processing-pipelines#analysis) components and their interdependencies. |
|
||||||
| [`Language.resume_training`](/api/language#resume_training) | Experimental: continue training a pretrained model and initialize "rehearsal" for components that implement a `rehearse` method to prevent catastrophic forgetting. |
|
| [`Language.resume_training`](/api/language#resume_training) | Experimental: continue training a pretrained model and initialize "rehearsal" for components that implement a `rehearse` method to prevent catastrophic forgetting. |
|
||||||
| [`@Language.factory`](/api/language#factory) [`@Language.component`](/api/language#component) | Decorators for [registering](/usage/processing-pipelines#custom-components) pipeline component factories and simple stateless component functions. |
|
| [`@Language.factory`](/api/language#factory), [`@Language.component`](/api/language#component) | Decorators for [registering](/usage/processing-pipelines#custom-components) pipeline component factories and simple stateless component functions. |
|
||||||
| [`Language.has_factory`](/api/language#has_factory) | Check whether a component factory is registered on a language class.s |
|
| [`Language.has_factory`](/api/language#has_factory) | Check whether a component factory is registered on a language class.s |
|
||||||
| [`Language.get_factory_meta`](/api/language#get_factory_meta) [`Language.get_pipe_meta`](/api/language#get_factory_meta) | Get the [`FactoryMeta`](/api/language#factorymeta) with component metadata for a factory or instance name. |
|
| [`Language.get_factory_meta`](/api/language#get_factory_meta), [`Language.get_pipe_meta`](/api/language#get_factory_meta) | Get the [`FactoryMeta`](/api/language#factorymeta) with component metadata for a factory or instance name. |
|
||||||
| [`Language.config`](/api/language#config) | The [config](/usage/training#config) used to create the current `nlp` object. An instance of [`Config`](https://thinc.ai/docs/api-config#config) and can be saved to disk and used for training. |
|
| [`Language.config`](/api/language#config) | The [config](/usage/training#config) used to create the current `nlp` object. An instance of [`Config`](https://thinc.ai/docs/api-config#config) and can be saved to disk and used for training. |
|
||||||
| [`Language.components`](/api/language#attributes) [`Language.component_names`](/api/language#attributes) | All available components and component names, including disabled components that are not run as part of the pipeline. |
|
| [`Language.components`](/api/language#attributes), [`Language.component_names`](/api/language#attributes) | All available components and component names, including disabled components that are not run as part of the pipeline. |
|
||||||
| [`Language.disabled`](/api/language#attributes) | Names of disabled components that are not run as part of the pipeline. |
|
| [`Language.disabled`](/api/language#attributes) | Names of disabled components that are not run as part of the pipeline. |
|
||||||
| [`Pipe.score`](/api/pipe#score) | Method on pipeline components that returns a dictionary of evaluation scores. |
|
| [`Pipe.score`](/api/pipe#score) | Method on pipeline components that returns a dictionary of evaluation scores. |
|
||||||
| [`registry`](/api/top-level#registry) | Function registry to map functions to string names that can be referenced in [configs](/usage/training#config). |
|
| [`registry`](/api/top-level#registry) | Function registry to map functions to string names that can be referenced in [configs](/usage/training#config). |
|
||||||
| [`util.load_meta`](/api/top-level#util.load_meta) [`util.load_config`](/api/top-level#util.load_config) | Updated helpers for loading a model's [`meta.json`](/api/data-formats#meta) and [`config.cfg`](/api/data-formats#config). |
|
| [`util.load_meta`](/api/top-level#util.load_meta), [`util.load_config`](/api/top-level#util.load_config) | Updated helpers for loading a model's [`meta.json`](/api/data-formats#meta) and [`config.cfg`](/api/data-formats#config). |
|
||||||
| [`util.get_installed_models`](/api/top-level#util.get_installed_models) | Names of all models installed in the environment. |
|
| [`util.get_installed_models`](/api/top-level#util.get_installed_models) | Names of all models installed in the environment. |
|
||||||
| [`init config`](/api/cli#init-config) [`init fill-config`](/api/cli#init-fill-config) [`debug config`](/api/cli#debug-config) | CLI commands for initializing, auto-filling and debugging [training configs](/usage/training). |
|
| [`init config`](/api/cli#init-config), [`init fill-config`](/api/cli#init-fill-config), [`debug config`](/api/cli#debug-config) | CLI commands for initializing, auto-filling and debugging [training configs](/usage/training). |
|
||||||
| [`project`](/api/cli#project) | Suite of CLI commands for cloning, running and managing [spaCy projects](/usage/projects). |
|
| [`project`](/api/cli#project) | Suite of CLI commands for cloning, running and managing [spaCy projects](/usage/projects). |
|
||||||
|
|
||||||
### New and updated documentation {#new-docs}
|
### New and updated documentation {#new-docs}
|
||||||
|
|
||||||
|
@ -304,7 +304,10 @@ format for documenting argument and return types.
|
||||||
[Layers & Architectures](/usage/layers-architectures),
|
[Layers & Architectures](/usage/layers-architectures),
|
||||||
[Projects](/usage/projects),
|
[Projects](/usage/projects),
|
||||||
[Custom pipeline components](/usage/processing-pipelines#custom-components),
|
[Custom pipeline components](/usage/processing-pipelines#custom-components),
|
||||||
[Custom tokenizers](/usage/linguistic-features#custom-tokenizer)
|
[Custom tokenizers](/usage/linguistic-features#custom-tokenizer),
|
||||||
|
[Morphology](/usage/linguistic-features#morphology),
|
||||||
|
[Lemmatization](/usage/linguistic-features#lemmatization),
|
||||||
|
[Mapping & Exceptions](/usage/linguistic-features#mappings-exceptions)
|
||||||
- **API Reference: ** [Library architecture](/api),
|
- **API Reference: ** [Library architecture](/api),
|
||||||
[Model architectures](/api/architectures), [Data formats](/api/data-formats)
|
[Model architectures](/api/architectures), [Data formats](/api/data-formats)
|
||||||
- **New Classes: ** [`Example`](/api/example), [`Tok2Vec`](/api/tok2vec),
|
- **New Classes: ** [`Example`](/api/example), [`Tok2Vec`](/api/tok2vec),
|
||||||
|
@ -371,19 +374,25 @@ Note that spaCy v3.0 now requires **Python 3.6+**.
|
||||||
arguments). The `on_match` callback becomes an optional keyword argument.
|
arguments). The `on_match` callback becomes an optional keyword argument.
|
||||||
- The `PRON_LEMMA` symbol and `-PRON-` as an indicator for pronoun lemmas has
|
- The `PRON_LEMMA` symbol and `-PRON-` as an indicator for pronoun lemmas has
|
||||||
been removed.
|
been removed.
|
||||||
|
- The `TAG_MAP` and `MORPH_RULES` in the language data have been replaced by the
|
||||||
|
more flexible [`AttributeRuler`](/api/attributeruler).
|
||||||
|
- The [`Lemmatizer`](/api/lemmatizer) is now a standalone pipeline component and
|
||||||
|
doesn't provide lemmas by default or switch automatically between lookup and
|
||||||
|
rule-based lemmas. You can now add it to your pipeline explicitly and set its
|
||||||
|
mode on initialization.
|
||||||
|
|
||||||
### Removed or renamed API {#incompat-removed}
|
### Removed or renamed API {#incompat-removed}
|
||||||
|
|
||||||
| Removed | Replacement |
|
| Removed | Replacement |
|
||||||
| -------------------------------------------------------- | ------------------------------------------------------------------------------------------ |
|
| -------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------ |
|
||||||
| `Language.disable_pipes` | [`Language.select_pipes`](/api/language#select_pipes) |
|
| `Language.disable_pipes` | [`Language.select_pipes`](/api/language#select_pipes), [`Language.disable_pipe`](/api/language#disable_pipe) |
|
||||||
| `GoldParse` | [`Example`](/api/example) |
|
| `GoldParse` | [`Example`](/api/example) |
|
||||||
| `GoldCorpus` | [`Corpus`](/api/corpus) |
|
| `GoldCorpus` | [`Corpus`](/api/corpus) |
|
||||||
| `KnowledgeBase.load_bulk`, `KnowledgeBase.dump` | [`KnowledgeBase.from_disk`](/api/kb#from_disk), [`KnowledgeBase.to_disk`](/api/kb#to_disk) |
|
| `KnowledgeBase.load_bulk`, `KnowledgeBase.dump` | [`KnowledgeBase.from_disk`](/api/kb#from_disk), [`KnowledgeBase.to_disk`](/api/kb#to_disk) |
|
||||||
| `spacy init-model` | [`spacy init model`](/api/cli#init-model) |
|
| `spacy init-model` | [`spacy init model`](/api/cli#init-model) |
|
||||||
| `spacy debug-data` | [`spacy debug data`](/api/cli#debug-data) |
|
| `spacy debug-data` | [`spacy debug data`](/api/cli#debug-data) |
|
||||||
| `spacy profile` | [`spacy debug profile`](/api/cli#debug-profile) |
|
| `spacy profile` | [`spacy debug profile`](/api/cli#debug-profile) |
|
||||||
| `spacy link`, `util.set_data_path`, `util.get_data_path` | not needed, model symlinks are deprecated |
|
| `spacy link`, `util.set_data_path`, `util.get_data_path` | not needed, model symlinks are deprecated |
|
||||||
|
|
||||||
The following deprecated methods, attributes and arguments were removed in v3.0.
|
The following deprecated methods, attributes and arguments were removed in v3.0.
|
||||||
Most of them have been **deprecated for a while** and many would previously
|
Most of them have been **deprecated for a while** and many would previously
|
||||||
|
@ -557,6 +566,24 @@ patterns = [nlp("health care reform"), nlp("healthcare reform")]
|
||||||
+ matcher.add("HEALTH", patterns, on_match=on_match)
|
+ matcher.add("HEALTH", patterns, on_match=on_match)
|
||||||
```
|
```
|
||||||
|
|
||||||
|
### Migrating tag maps and morph rules {#migrating-training-mappings-exceptions}
|
||||||
|
|
||||||
|
Instead of defining a `tag_map` and `morph_rules` in the language data, spaCy
|
||||||
|
v3.0 now manages mappings and exceptions with a separate and more flexible
|
||||||
|
pipeline component, the [`AttributeRuler`](/api/attributeruler). See the
|
||||||
|
[usage guide](/usage/linguistic-features#mappings-exceptions) for examples. The
|
||||||
|
`AttributeRuler` provides two handy helper methods
|
||||||
|
[`load_from_tag_map`](/api/attributeruler#load_from_tag_map) and
|
||||||
|
[`load_from_morph_rules`](/api/attributeruler#load_from_morph_rules) that let
|
||||||
|
you load in your existing tag map or morph rules:
|
||||||
|
|
||||||
|
```diff
|
||||||
|
nlp = spacy.blank("en")
|
||||||
|
- nlp.vocab.morphology.load_tag_map(YOUR_TAG_MAP)
|
||||||
|
+ ruler = nlp.add_pipe("attribute_ruler")
|
||||||
|
+ ruler.load_from_tag_map(YOUR_TAG_MAP)
|
||||||
|
```
|
||||||
|
|
||||||
### Training models {#migrating-training}
|
### Training models {#migrating-training}
|
||||||
|
|
||||||
To train your models, you should now pretty much always use the
|
To train your models, you should now pretty much always use the
|
||||||
|
@ -602,8 +629,8 @@ If you've exported a starter config from our
|
||||||
values. You can then use the auto-generated `config.cfg` for training:
|
values. You can then use the auto-generated `config.cfg` for training:
|
||||||
|
|
||||||
```diff
|
```diff
|
||||||
### {wrap="true"}
|
- python -m spacy train en ./output ./train.json ./dev.json
|
||||||
- python -m spacy train en ./output ./train.json ./dev.json --pipeline tagger,parser --cnn-window 1 --bilstm-depth 0
|
--pipeline tagger,parser --cnn-window 1 --bilstm-depth 0
|
||||||
+ python -m spacy train ./config.cfg --output ./output
|
+ python -m spacy train ./config.cfg --output ./output
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
|
@ -169,7 +169,13 @@ function formatCode(html, lang, prompt) {
|
||||||
}
|
}
|
||||||
const result = html
|
const result = html
|
||||||
.split('\n')
|
.split('\n')
|
||||||
.map((line, i) => (prompt ? replacePrompt(line, prompt, i === 0) : line))
|
.map((line, i) => {
|
||||||
|
let newLine = prompt ? replacePrompt(line, prompt, i === 0) : line
|
||||||
|
if (lang === 'diff' && !line.startsWith('<')) {
|
||||||
|
newLine = highlightCode('python', line)
|
||||||
|
}
|
||||||
|
return newLine
|
||||||
|
})
|
||||||
.join('\n')
|
.join('\n')
|
||||||
return htmlToReact(result)
|
return htmlToReact(result)
|
||||||
}
|
}
|
||||||
|
|
|
@ -28,7 +28,6 @@ export default class Juniper extends React.Component {
|
||||||
mode: this.props.lang,
|
mode: this.props.lang,
|
||||||
theme: this.props.theme,
|
theme: this.props.theme,
|
||||||
})
|
})
|
||||||
|
|
||||||
const runCode = () => this.execute(outputArea, cm.getValue())
|
const runCode = () => this.execute(outputArea, cm.getValue())
|
||||||
cm.setOption('extraKeys', { 'Shift-Enter': runCode })
|
cm.setOption('extraKeys', { 'Shift-Enter': runCode })
|
||||||
Widget.attach(outputArea, this.outputRef)
|
Widget.attach(outputArea, this.outputRef)
|
||||||
|
|
|
@ -65,12 +65,12 @@
|
||||||
--color-subtle-dark: hsl(162, 5%, 60%)
|
--color-subtle-dark: hsl(162, 5%, 60%)
|
||||||
|
|
||||||
--color-green-medium: hsl(108, 66%, 63%)
|
--color-green-medium: hsl(108, 66%, 63%)
|
||||||
--color-green-transparent: hsla(108, 66%, 63%, 0.11)
|
--color-green-transparent: hsla(108, 66%, 63%, 0.12)
|
||||||
--color-red-light: hsl(355, 100%, 96%)
|
--color-red-light: hsl(355, 100%, 96%)
|
||||||
--color-red-medium: hsl(346, 84%, 61%)
|
--color-red-medium: hsl(346, 84%, 61%)
|
||||||
--color-red-dark: hsl(332, 64%, 34%)
|
--color-red-dark: hsl(332, 64%, 34%)
|
||||||
--color-red-opaque: hsl(346, 96%, 89%)
|
--color-red-opaque: hsl(346, 96%, 89%)
|
||||||
--color-red-transparent: hsla(346, 84%, 61%, 0.11)
|
--color-red-transparent: hsla(346, 84%, 61%, 0.12)
|
||||||
--color-yellow-light: hsl(46, 100%, 95%)
|
--color-yellow-light: hsl(46, 100%, 95%)
|
||||||
--color-yellow-medium: hsl(45, 90%, 55%)
|
--color-yellow-medium: hsl(45, 90%, 55%)
|
||||||
--color-yellow-dark: hsl(44, 94%, 27%)
|
--color-yellow-dark: hsl(44, 94%, 27%)
|
||||||
|
@ -79,11 +79,11 @@
|
||||||
// Syntax Highlighting
|
// Syntax Highlighting
|
||||||
--syntax-comment: hsl(162, 5%, 60%)
|
--syntax-comment: hsl(162, 5%, 60%)
|
||||||
--syntax-tag: hsl(266, 72%, 72%)
|
--syntax-tag: hsl(266, 72%, 72%)
|
||||||
--syntax-number: hsl(266, 72%, 72%)
|
--syntax-number: var(--syntax-tag)
|
||||||
--syntax-selector: hsl(31, 100%, 71%)
|
--syntax-selector: hsl(31, 100%, 71%)
|
||||||
--syntax-operator: hsl(342, 100%, 59%)
|
|
||||||
--syntax-function: hsl(195, 70%, 54%)
|
--syntax-function: hsl(195, 70%, 54%)
|
||||||
--syntax-keyword: hsl(342, 100%, 59%)
|
--syntax-keyword: hsl(343, 100%, 68%)
|
||||||
|
--syntax-operator: var(--syntax-keyword)
|
||||||
--syntax-regex: hsl(45, 90%, 55%)
|
--syntax-regex: hsl(45, 90%, 55%)
|
||||||
|
|
||||||
// Other
|
// Other
|
||||||
|
@ -354,6 +354,7 @@ body [id]:target
|
||||||
&.inserted, &.deleted
|
&.inserted, &.deleted
|
||||||
padding: 2px 0
|
padding: 2px 0
|
||||||
border-radius: 2px
|
border-radius: 2px
|
||||||
|
opacity: 0.9
|
||||||
|
|
||||||
&.inserted
|
&.inserted
|
||||||
color: var(--color-green-medium)
|
color: var(--color-green-medium)
|
||||||
|
@ -388,7 +389,6 @@ body [id]:target
|
||||||
.token
|
.token
|
||||||
color: var(--color-subtle)
|
color: var(--color-subtle)
|
||||||
|
|
||||||
|
|
||||||
.gatsby-highlight-code-line
|
.gatsby-highlight-code-line
|
||||||
background-color: var(--color-dark-secondary)
|
background-color: var(--color-dark-secondary)
|
||||||
border-left: 0.35em solid var(--color-theme)
|
border-left: 0.35em solid var(--color-theme)
|
||||||
|
@ -409,6 +409,7 @@ body [id]:target
|
||||||
color: var(--color-subtle)
|
color: var(--color-subtle)
|
||||||
|
|
||||||
.CodeMirror-line
|
.CodeMirror-line
|
||||||
|
color: var(--syntax-comment)
|
||||||
padding: 0
|
padding: 0
|
||||||
|
|
||||||
.CodeMirror-selected
|
.CodeMirror-selected
|
||||||
|
@ -418,26 +419,25 @@ body [id]:target
|
||||||
.CodeMirror-cursor
|
.CodeMirror-cursor
|
||||||
border-left-color: currentColor
|
border-left-color: currentColor
|
||||||
|
|
||||||
.cm-variable-2
|
.cm-property, .cm-variable, .cm-variable-2, .cm-meta // decorators
|
||||||
color: inherit
|
color: var(--color-subtle)
|
||||||
font-style: italic
|
|
||||||
|
|
||||||
.cm-comment
|
.cm-comment
|
||||||
color: var(--syntax-comment)
|
color: var(--syntax-comment)
|
||||||
|
|
||||||
.cm-keyword
|
.cm-keyword, .cm-builtin
|
||||||
color: var(--syntax-keyword)
|
color: var(--syntax-keyword)
|
||||||
|
|
||||||
.cm-operator
|
.cm-operator
|
||||||
color: var(--syntax-operator)
|
color: var(--syntax-operator)
|
||||||
|
|
||||||
.cm-string, .cm-builtin
|
.cm-string
|
||||||
color: var(--syntax-selector)
|
color: var(--syntax-selector)
|
||||||
|
|
||||||
.cm-number
|
.cm-number
|
||||||
color: var(--syntax-number)
|
color: var(--syntax-number)
|
||||||
|
|
||||||
.cm-def, .cm-meta
|
.cm-def
|
||||||
color: var(--syntax-function)
|
color: var(--syntax-function)
|
||||||
|
|
||||||
// Jupyter
|
// Jupyter
|
||||||
|
|
Loading…
Reference in New Issue
Block a user