Update docs [ci skip]

This commit is contained in:
Ines Montani 2020-08-29 18:43:19 +02:00
parent d73f7229c0
commit 9b86312bab
7 changed files with 183 additions and 141 deletions

View File

@ -12,7 +12,8 @@ The attribute ruler lets you set token attributes for tokens identified by
[`Matcher` patterns](/usage/rule-based-matching#matcher). The attribute ruler is [`Matcher` patterns](/usage/rule-based-matching#matcher). The attribute ruler is
typically used to handle exceptions for token attributes and to map values typically used to handle exceptions for token attributes and to map values
between attributes such as mapping fine-grained POS tags to coarse-grained POS between attributes such as mapping fine-grained POS tags to coarse-grained POS
tags. tags. See the [usage guide](/usage/linguistic-features/#mappings-exceptions) for
examples.
## Config and implementation {#config} ## Config and implementation {#config}

View File

@ -12,19 +12,16 @@ is then passed on to the next component.
> - **Creates:** Objects, attributes and properties modified and set by the > - **Creates:** Objects, attributes and properties modified and set by the
> component. > component.
| Name | Component | Creates | Description | | Name | Component | Creates | Description |
| -------------- | ------------------------------------------- | --------------------------------------------------------- | -------------------------------- | | --------------------- | ------------------------------------------------------------------ | --------------------------------------------------------- | ------------------------------------------------ |
| **tokenizer** | [`Tokenizer`](/api/tokenizer) | `Doc` | Segment text into tokens. | | **tokenizer** | [`Tokenizer`](/api/tokenizer) | `Doc` | Segment text into tokens. |
| **tagger** | [`Tagger`](/api/tagger) | `Token.tag` | Assign part-of-speech tags. | | _processing pipeline_ | | |
| **parser** | [`DependencyParser`](/api/dependencyparser) | `Token.head`, `Token.dep`, `Doc.sents`, `Doc.noun_chunks` | Assign dependency labels. | | **tagger** | [`Tagger`](/api/tagger) | `Token.tag` | Assign part-of-speech tags. |
| **ner** | [`EntityRecognizer`](/api/entityrecognizer) | `Doc.ents`, `Token.ent_iob`, `Token.ent_type` | Detect and label named entities. | | **parser** | [`DependencyParser`](/api/dependencyparser) | `Token.head`, `Token.dep`, `Doc.sents`, `Doc.noun_chunks` | Assign dependency labels. |
| **lemmatizer** | [`Lemmatizer`](/api/lemmatizer) | `Token.lemma` | Assign base forms. | | **ner** | [`EntityRecognizer`](/api/entityrecognizer) | `Doc.ents`, `Token.ent_iob`, `Token.ent_type` | Detect and label named entities. |
| **textcat** | [`TextCategorizer`](/api/textcategorizer) | `Doc.cats` | Assign document labels. | | **lemmatizer** | [`Lemmatizer`](/api/lemmatizer) | `Token.lemma` | Assign base forms. |
| **textcat** | [`TextCategorizer`](/api/textcategorizer) | `Doc.cats` | Assign document labels. |
| **custom** | | **custom** | [custom components](/usage/processing-pipelines#custom-components) | `Doc._.xxx`, `Token._.xxx`, `Span._.xxx` | Assign custom attributes, methods or properties. |
[custom components](/usage/processing-pipelines#custom-components) |
`Doc._.xxx`, `Token._.xxx`, `Span._.xxx` | Assign custom attributes, methods or
properties. |
The processing pipeline always **depends on the statistical model** and its The processing pipeline always **depends on the statistical model** and its
capabilities. For example, a pipeline can only include an entity recognizer capabilities. For example, a pipeline can only include an entity recognizer

View File

@ -57,41 +57,50 @@ create a surface form. Here are some examples:
Morphological features are stored in the [`MorphAnalysis`](/api/morphanalysis) Morphological features are stored in the [`MorphAnalysis`](/api/morphanalysis)
under `Token.morph`, which allows you to access individual morphological under `Token.morph`, which allows you to access individual morphological
features. The attribute `Token.morph_` provides the morphological analysis in features. The attribute `Token.morph_` provides the morphological analysis in
the Universal Dependencies FEATS format. the Universal Dependencies
[FEATS](https://universaldependencies.org/format.html#morphological-annotation)
format.
> #### 📝 Things to try
>
> 1. Change "I" to "She". You should see that the morphological features change
> and express that it's a pronoun in the third person.
> 2. Inspect `token.morph_` for the other tokens.
```python ```python
### {executable="true"} ### {executable="true"}
import spacy import spacy
nlp = spacy.load("en_core_web_sm") nlp = spacy.load("en_core_web_sm")
print("Pipeline:", nlp.pipe_names)
doc = nlp("I was reading the paper.") doc = nlp("I was reading the paper.")
token = doc[0] # 'I'
token = doc[0] # "I" print(token.morph_) # 'Case=Nom|Number=Sing|Person=1|PronType=Prs'
assert token.morph_ == "Case=Nom|Number=Sing|Person=1|PronType=Prs" print(token.morph.get("PronType")) # ['Prs']
assert token.morph.get("PronType") == ["Prs"]
``` ```
### Statistical morphology {#morphologizer new="3" model="morphologizer"} ### Statistical morphology {#morphologizer new="3" model="morphologizer"}
spaCy v3 includes a statistical morphologizer component that assigns the spaCy's statistical [`Morphologizer`](/api/morphologizer) component assigns the
morphological features and POS as `Token.morph` and `Token.pos`. morphological features and coarse-grained part-of-speech tags as `Token.morph`
and `Token.pos`.
```python ```python
### {executable="true"} ### {executable="true"}
import spacy import spacy
nlp = spacy.load("de_core_news_sm") nlp = spacy.load("de_core_news_sm")
doc = nlp("Wo bist du?") # 'Where are you?' doc = nlp("Wo bist du?") # English: 'Where are you?'
assert doc[2].morph_ == "Case=Nom|Number=Sing|Person=2|PronType=Prs" print(doc[2].morph_) # 'Case=Nom|Number=Sing|Person=2|PronType=Prs'
assert doc[2].pos_ == "PRON" print(doc[2].pos_) # 'PRON'
``` ```
### Rule-based morphology {#rule-based-morphology} ### Rule-based morphology {#rule-based-morphology}
For languages with relatively simple morphological systems like English, spaCy For languages with relatively simple morphological systems like English, spaCy
can assign morphological features through a rule-based approach, which uses the can assign morphological features through a rule-based approach, which uses the
token text and fine-grained part-of-speech tags to produce coarse-grained **token text** and **fine-grained part-of-speech tags** to produce
part-of-speech tags and morphological features. coarse-grained part-of-speech tags and morphological features.
1. The part-of-speech tagger assigns each token a **fine-grained part-of-speech 1. The part-of-speech tagger assigns each token a **fine-grained part-of-speech
tag**. In the API, these tags are known as `Token.tag`. They express the tag**. In the API, these tags are known as `Token.tag`. They express the
@ -108,16 +117,16 @@ import spacy
nlp = spacy.load("en_core_web_sm") nlp = spacy.load("en_core_web_sm")
doc = nlp("Where are you?") doc = nlp("Where are you?")
assert doc[2].morph_ == "Case=Nom|Person=2|PronType=Prs" print(doc[2].morph_) # 'Case=Nom|Person=2|PronType=Prs'
assert doc[2].pos_ == "PRON" print(doc[2].pos_) # 'PRON'
``` ```
## Lemmatization {#lemmatization model="lemmatizer" new="3"} ## Lemmatization {#lemmatization model="lemmatizer" new="3"}
The [`Lemmatizer`](/api/lemmatizer) is a pipeline component that provides lookup The [`Lemmatizer`](/api/lemmatizer) is a pipeline component that provides lookup
and rule-based lemmatization methods in a configurable component. An individual and rule-based lemmatization methods in a configurable component. An individual
language can extend the `Lemmatizer` as part of its [language language can extend the `Lemmatizer` as part of its
data](#language-data). [language data](#language-data).
```python ```python
### {executable="true"} ### {executable="true"}
@ -126,36 +135,38 @@ import spacy
# English models include a rule-based lemmatizer # English models include a rule-based lemmatizer
nlp = spacy.load("en_core_web_sm") nlp = spacy.load("en_core_web_sm")
lemmatizer = nlp.get_pipe("lemmatizer") lemmatizer = nlp.get_pipe("lemmatizer")
assert lemmatizer.mode == "rule" print(lemmatizer.mode) # 'rule'
doc = nlp("I was reading the paper.") doc = nlp("I was reading the paper.")
assert doc[1].lemma_ == "be" print([token.lemma_ for token in doc])
assert doc[2].lemma_ == "read" # ['I', 'be', 'read', 'the', 'paper', '.']
``` ```
<Infobox title="Important note" variant="warning"> <Infobox title="Changed in v3.0" variant="warning">
Unlike spaCy v2, spaCy v3 models do not provide lemmas by default or switch Unlike spaCy v2, spaCy v3 models do _not_ provide lemmas by default or switch
automatically between lookup and rule-based lemmas depending on whether a automatically between lookup and rule-based lemmas depending on whether a tagger
tagger is in the pipeline. To have lemmas in a `Doc`, the pipeline needs to is in the pipeline. To have lemmas in a `Doc`, the pipeline needs to include a
include a `lemmatizer` component. A `lemmatizer` is configured to use a single [`Lemmatizer`](/api/lemmatizer) component. The lemmatizer component is
mode such as `"lookup"` or `"rule"` on initialization. The `"rule"` mode configured to use a single mode such as `"lookup"` or `"rule"` on
requires `Token.pos` to be set by a previous component. initialization. The `"rule"` mode requires `Token.pos` to be set by a previous
component.
</Infobox> </Infobox>
The data for spaCy's lemmatizers is distributed in the package The data for spaCy's lemmatizers is distributed in the package
[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data). The [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data). The
provided models already include all the required tables, but if you are provided models already include all the required tables, but if you are creating
creating new models, you'll probably want to install `spacy-lookups-data` to new models, you'll probably want to install `spacy-lookups-data` to provide the
provide the data when the lemmatizer is initialized. data when the lemmatizer is initialized.
### Lookup lemmatizer {#lemmatizer-lookup} ### Lookup lemmatizer {#lemmatizer-lookup}
For models without a tagger or morphologizer, a lookup lemmatizer can be added For models without a tagger or morphologizer, a lookup lemmatizer can be added
to the pipeline as long as a lookup table is provided, typically through to the pipeline as long as a lookup table is provided, typically through
`spacy-lookups-data`. The lookup lemmatizer looks up the token surface form in [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data). The
the lookup table without reference to the token's part-of-speech or context. lookup lemmatizer looks up the token surface form in the lookup table without
reference to the token's part-of-speech or context.
```python ```python
# pip install spacy-lookups-data # pip install spacy-lookups-data
@ -168,19 +179,18 @@ nlp.add_pipe("lemmatizer", config={"mode": "lookup"})
### Rule-based lemmatizer {#lemmatizer-rule} ### Rule-based lemmatizer {#lemmatizer-rule}
When training models that include a component that assigns POS (a morphologizer When training models that include a component that assigns POS (a morphologizer
or a tagger with a [POS mapping](#mappings-exceptions)), a rule-based or a tagger with a [POS mapping](#mappings-exceptions)), a rule-based lemmatizer
lemmatizer can be added using rule tables from `spacy-lookups-data`: can be added using rule tables from
[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data):
```python ```python
# pip install spacy-lookups-data # pip install spacy-lookups-data
import spacy import spacy
nlp = spacy.blank("de") nlp = spacy.blank("de")
# Morphologizer (note: model is not yet trained!)
# morphologizer (note: model is not yet trained!)
nlp.add_pipe("morphologizer") nlp.add_pipe("morphologizer")
# Rule-based lemmatizer
# rule-based lemmatizer
nlp.add_pipe("lemmatizer", config={"mode": "rule"}) nlp.add_pipe("lemmatizer", config={"mode": "rule"})
``` ```
@ -1734,25 +1744,26 @@ print("After:", [sent.text for sent in doc.sents])
## Mappings & Exceptions {#mappings-exceptions new="3"} ## Mappings & Exceptions {#mappings-exceptions new="3"}
The [`AttributeRuler`](/api/attributeruler) manages rule-based mappings and The [`AttributeRuler`](/api/attributeruler) manages **rule-based mappings and
exceptions for all token-level attributes. As the number of pipeline components exceptions** for all token-level attributes. As the number of
has grown from spaCy v2 to v3, handling rules and exceptions in each component [pipeline components](/api/#architecture-pipeline) has grown from spaCy v2 to
individually has become impractical, so the `AttributeRuler` provides a single v3, handling rules and exceptions in each component individually has become
component with a unified pattern format for all token attribute mappings and impractical, so the `AttributeRuler` provides a single component with a unified
exceptions. pattern format for all token attribute mappings and exceptions.
The `AttributeRuler` uses [`Matcher` The `AttributeRuler` uses
patterns](/usage/rule-based-matching#adding-patterns) to identify tokens and [`Matcher` patterns](/usage/rule-based-matching#adding-patterns) to identify
then assigns them the provided attributes. If needed, the `Matcher` patterns tokens and then assigns them the provided attributes. If needed, the
can include context around the target token. For example, the `AttributeRuler` [`Matcher`](/api/matcher) patterns can include context around the target token.
can: For example, the attribute ruler can:
- provide exceptions for any token attributes - provide exceptions for any **token attributes**
- map fine-grained tags to coarse-grained tags for languages without statistical - map **fine-grained tags** to **coarse-grained tags** for languages without
morphologizers (replacing the v2 tag map in the language data) statistical morphologizers (replacing the v2.x `tag_map` in the
- map token surface form + fine-grained tags to morphological features [language data](#language-data))
(replacing the v2 morph rules in the language data) - map token **surface form + fine-grained tags** to **morphological features**
- specify the tags for space tokens (replacing hard-coded behavior in the (replacing the v2.x `morph_rules` in the [language data](#language-data))
- specify the **tags for space tokens** (replacing hard-coded behavior in the
tagger) tagger)
The following example shows how the tag and POS `NNP`/`PROPN` can be specified The following example shows how the tag and POS `NNP`/`PROPN` can be specified
@ -1765,41 +1776,42 @@ import spacy
nlp = spacy.load("en_core_web_sm") nlp = spacy.load("en_core_web_sm")
text = "I saw The Who perform. Who did you see?" text = "I saw The Who perform. Who did you see?"
doc1 = nlp(text) doc1 = nlp(text)
assert doc1[2].tag_ == "DT" print(doc1[2].tag_, doc1[2].pos_) # DT DET
assert doc1[2].pos_ == "DET" print(doc1[3].tag_, doc1[3].pos_) # WP PRON
assert doc1[3].tag_ == "WP"
assert doc1[3].pos_ == "PRON"
# add a new exception for "The Who" as NNP/PROPN NNP/PROPN # Add attribute ruler with exception for "The Who" as NNP/PROPN NNP/PROPN
ruler = nlp.get_pipe("attribute_ruler") ruler = nlp.get_pipe("attribute_ruler")
# Pattern to match "The Who"
# pattern to match "The Who"
patterns = [[{"LOWER": "the"}, {"TEXT": "Who"}]] patterns = [[{"LOWER": "the"}, {"TEXT": "Who"}]]
# the attributes to assign to the matched token # The attributes to assign to the matched token
attrs = {"TAG": "NNP", "POS": "PROPN"} attrs = {"TAG": "NNP", "POS": "PROPN"}
# Add rules to the attribute ruler
# add rule for "The" in "The Who" ruler.add(patterns=patterns, attrs=attrs, index=0) # "The" in "The Who"
ruler.add(patterns=patterns, attrs=attrs, index=0) ruler.add(patterns=patterns, attrs=attrs, index=1) # "Who" in "The Who"
# add rule for "Who" in "The Who"
ruler.add(patterns=patterns, attrs=attrs, index=1)
doc2 = nlp(text) doc2 = nlp(text)
assert doc2[2].tag_ == "NNP" print(doc2[2].tag_, doc2[2].pos_) # NNP PROPN
assert doc2[3].tag_ == "NNP" print(doc2[3].tag_, doc2[3].pos_) # NNP PROPN
assert doc2[2].pos_ == "PROPN" # The second "Who" remains unmodified
assert doc2[3].pos_ == "PROPN" print(doc2[5].tag_, doc2[5].pos_) # WP PRON
# the second "Who" remains unmodified
assert doc2[5].tag_ == "WP"
assert doc2[5].pos_ == "PRON"
``` ```
For easy migration from from spaCy v2 to v3, the `AttributeRuler` can import v2 <Infobox variant="warning" title="Migrating from spaCy v2.x">
`TAG_MAP` and `MORPH_RULES` data with the methods
[`AttributerRuler.load_from_tag_map`](/api/attributeruler#load_from_tag_map) and For easy migration from from spaCy v2 to v3, the
[`AttributeRuler.load_from_morph_rules`](/api/attributeruler#load_from_morph_rules). [`AttributeRuler`](/api/attributeruler) can import a **tag map and morph rules**
in the v2 format with the methods
[`load_from_tag_map`](/api/attributeruler#load_from_tag_map) and
[`load_from_morph_rules`](/api/attributeruler#load_from_morph_rules).
```diff
nlp = spacy.blank("en")
+ ruler = nlp.add_pipe("attribute_ruler")
+ ruler.load_from_tag_map(YOUR_TAG_MAP)
```
</Infobox>
## Word vectors and semantic similarity {#vectors-similarity} ## Word vectors and semantic similarity {#vectors-similarity}

View File

@ -250,26 +250,26 @@ in your config and see validation errors if the argument values don't match.
The following methods, attributes and commands are new in spaCy v3.0. The following methods, attributes and commands are new in spaCy v3.0.
| Name | Description | | Name | Description |
| ----------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | ------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| [`Token.lex`](/api/token#attributes) | Access a token's [`Lexeme`](/api/lexeme). | | [`Token.lex`](/api/token#attributes) | Access a token's [`Lexeme`](/api/lexeme). |
| [`Token.morph`](/api/token#attributes) [`Token.morph_`](/api/token#attributes) | Access a token's morphological analysis. | | [`Token.morph`](/api/token#attributes), [`Token.morph_`](/api/token#attributes) | Access a token's morphological analysis. |
| [`Language.select_pipes`](/api/language#select_pipes) | Context manager for enabling or disabling specific pipeline components for a block. | | [`Language.select_pipes`](/api/language#select_pipes) | Context manager for enabling or disabling specific pipeline components for a block. |
| [`Language.disable_pipe`](/api/language#disable_pipe) [`Language.enable_pipe`](/api/language#enable_pipe) | Disable or enable a loaded pipeline component (but don't remove it). | | [`Language.disable_pipe`](/api/language#disable_pipe), [`Language.enable_pipe`](/api/language#enable_pipe) | Disable or enable a loaded pipeline component (but don't remove it). |
| [`Language.analyze_pipes`](/api/language#analyze_pipes) | [Analyze](/usage/processing-pipelines#analysis) components and their interdependencies. | | [`Language.analyze_pipes`](/api/language#analyze_pipes) | [Analyze](/usage/processing-pipelines#analysis) components and their interdependencies. |
| [`Language.resume_training`](/api/language#resume_training) | Experimental: continue training a pretrained model and initialize "rehearsal" for components that implement a `rehearse` method to prevent catastrophic forgetting. | | [`Language.resume_training`](/api/language#resume_training) | Experimental: continue training a pretrained model and initialize "rehearsal" for components that implement a `rehearse` method to prevent catastrophic forgetting. |
| [`@Language.factory`](/api/language#factory) [`@Language.component`](/api/language#component) | Decorators for [registering](/usage/processing-pipelines#custom-components) pipeline component factories and simple stateless component functions. | | [`@Language.factory`](/api/language#factory), [`@Language.component`](/api/language#component) | Decorators for [registering](/usage/processing-pipelines#custom-components) pipeline component factories and simple stateless component functions. |
| [`Language.has_factory`](/api/language#has_factory) | Check whether a component factory is registered on a language class.s | | [`Language.has_factory`](/api/language#has_factory) | Check whether a component factory is registered on a language class.s |
| [`Language.get_factory_meta`](/api/language#get_factory_meta) [`Language.get_pipe_meta`](/api/language#get_factory_meta) | Get the [`FactoryMeta`](/api/language#factorymeta) with component metadata for a factory or instance name. | | [`Language.get_factory_meta`](/api/language#get_factory_meta), [`Language.get_pipe_meta`](/api/language#get_factory_meta) | Get the [`FactoryMeta`](/api/language#factorymeta) with component metadata for a factory or instance name. |
| [`Language.config`](/api/language#config) | The [config](/usage/training#config) used to create the current `nlp` object. An instance of [`Config`](https://thinc.ai/docs/api-config#config) and can be saved to disk and used for training. | | [`Language.config`](/api/language#config) | The [config](/usage/training#config) used to create the current `nlp` object. An instance of [`Config`](https://thinc.ai/docs/api-config#config) and can be saved to disk and used for training. |
| [`Language.components`](/api/language#attributes) [`Language.component_names`](/api/language#attributes) | All available components and component names, including disabled components that are not run as part of the pipeline. | | [`Language.components`](/api/language#attributes), [`Language.component_names`](/api/language#attributes) | All available components and component names, including disabled components that are not run as part of the pipeline. |
| [`Language.disabled`](/api/language#attributes) | Names of disabled components that are not run as part of the pipeline. | | [`Language.disabled`](/api/language#attributes) | Names of disabled components that are not run as part of the pipeline. |
| [`Pipe.score`](/api/pipe#score) | Method on pipeline components that returns a dictionary of evaluation scores. | | [`Pipe.score`](/api/pipe#score) | Method on pipeline components that returns a dictionary of evaluation scores. |
| [`registry`](/api/top-level#registry) | Function registry to map functions to string names that can be referenced in [configs](/usage/training#config). | | [`registry`](/api/top-level#registry) | Function registry to map functions to string names that can be referenced in [configs](/usage/training#config). |
| [`util.load_meta`](/api/top-level#util.load_meta) [`util.load_config`](/api/top-level#util.load_config) | Updated helpers for loading a model's [`meta.json`](/api/data-formats#meta) and [`config.cfg`](/api/data-formats#config). | | [`util.load_meta`](/api/top-level#util.load_meta), [`util.load_config`](/api/top-level#util.load_config) | Updated helpers for loading a model's [`meta.json`](/api/data-formats#meta) and [`config.cfg`](/api/data-formats#config). |
| [`util.get_installed_models`](/api/top-level#util.get_installed_models) | Names of all models installed in the environment. | | [`util.get_installed_models`](/api/top-level#util.get_installed_models) | Names of all models installed in the environment. |
| [`init config`](/api/cli#init-config) [`init fill-config`](/api/cli#init-fill-config) [`debug config`](/api/cli#debug-config) | CLI commands for initializing, auto-filling and debugging [training configs](/usage/training). | | [`init config`](/api/cli#init-config), [`init fill-config`](/api/cli#init-fill-config), [`debug config`](/api/cli#debug-config) | CLI commands for initializing, auto-filling and debugging [training configs](/usage/training). |
| [`project`](/api/cli#project) | Suite of CLI commands for cloning, running and managing [spaCy projects](/usage/projects). | | [`project`](/api/cli#project) | Suite of CLI commands for cloning, running and managing [spaCy projects](/usage/projects). |
### New and updated documentation {#new-docs} ### New and updated documentation {#new-docs}
@ -304,7 +304,10 @@ format for documenting argument and return types.
[Layers & Architectures](/usage/layers-architectures), [Layers & Architectures](/usage/layers-architectures),
[Projects](/usage/projects), [Projects](/usage/projects),
[Custom pipeline components](/usage/processing-pipelines#custom-components), [Custom pipeline components](/usage/processing-pipelines#custom-components),
[Custom tokenizers](/usage/linguistic-features#custom-tokenizer) [Custom tokenizers](/usage/linguistic-features#custom-tokenizer),
[Morphology](/usage/linguistic-features#morphology),
[Lemmatization](/usage/linguistic-features#lemmatization),
[Mapping & Exceptions](/usage/linguistic-features#mappings-exceptions)
- **API Reference: ** [Library architecture](/api), - **API Reference: ** [Library architecture](/api),
[Model architectures](/api/architectures), [Data formats](/api/data-formats) [Model architectures](/api/architectures), [Data formats](/api/data-formats)
- **New Classes: ** [`Example`](/api/example), [`Tok2Vec`](/api/tok2vec), - **New Classes: ** [`Example`](/api/example), [`Tok2Vec`](/api/tok2vec),
@ -371,19 +374,25 @@ Note that spaCy v3.0 now requires **Python 3.6+**.
arguments). The `on_match` callback becomes an optional keyword argument. arguments). The `on_match` callback becomes an optional keyword argument.
- The `PRON_LEMMA` symbol and `-PRON-` as an indicator for pronoun lemmas has - The `PRON_LEMMA` symbol and `-PRON-` as an indicator for pronoun lemmas has
been removed. been removed.
- The `TAG_MAP` and `MORPH_RULES` in the language data have been replaced by the
more flexible [`AttributeRuler`](/api/attributeruler).
- The [`Lemmatizer`](/api/lemmatizer) is now a standalone pipeline component and
doesn't provide lemmas by default or switch automatically between lookup and
rule-based lemmas. You can now add it to your pipeline explicitly and set its
mode on initialization.
### Removed or renamed API {#incompat-removed} ### Removed or renamed API {#incompat-removed}
| Removed | Replacement | | Removed | Replacement |
| -------------------------------------------------------- | ------------------------------------------------------------------------------------------ | | -------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------ |
| `Language.disable_pipes` | [`Language.select_pipes`](/api/language#select_pipes) | | `Language.disable_pipes` | [`Language.select_pipes`](/api/language#select_pipes), [`Language.disable_pipe`](/api/language#disable_pipe) |
| `GoldParse` | [`Example`](/api/example) | | `GoldParse` | [`Example`](/api/example) |
| `GoldCorpus` | [`Corpus`](/api/corpus) | | `GoldCorpus` | [`Corpus`](/api/corpus) |
| `KnowledgeBase.load_bulk`, `KnowledgeBase.dump` | [`KnowledgeBase.from_disk`](/api/kb#from_disk), [`KnowledgeBase.to_disk`](/api/kb#to_disk) | | `KnowledgeBase.load_bulk`, `KnowledgeBase.dump` | [`KnowledgeBase.from_disk`](/api/kb#from_disk), [`KnowledgeBase.to_disk`](/api/kb#to_disk) |
| `spacy init-model` | [`spacy init model`](/api/cli#init-model) | | `spacy init-model` | [`spacy init model`](/api/cli#init-model) |
| `spacy debug-data` | [`spacy debug data`](/api/cli#debug-data) | | `spacy debug-data` | [`spacy debug data`](/api/cli#debug-data) |
| `spacy profile` | [`spacy debug profile`](/api/cli#debug-profile) | | `spacy profile` | [`spacy debug profile`](/api/cli#debug-profile) |
| `spacy link`, `util.set_data_path`, `util.get_data_path` | not needed, model symlinks are deprecated | | `spacy link`, `util.set_data_path`, `util.get_data_path` | not needed, model symlinks are deprecated |
The following deprecated methods, attributes and arguments were removed in v3.0. The following deprecated methods, attributes and arguments were removed in v3.0.
Most of them have been **deprecated for a while** and many would previously Most of them have been **deprecated for a while** and many would previously
@ -557,6 +566,24 @@ patterns = [nlp("health care reform"), nlp("healthcare reform")]
+ matcher.add("HEALTH", patterns, on_match=on_match) + matcher.add("HEALTH", patterns, on_match=on_match)
``` ```
### Migrating tag maps and morph rules {#migrating-training-mappings-exceptions}
Instead of defining a `tag_map` and `morph_rules` in the language data, spaCy
v3.0 now manages mappings and exceptions with a separate and more flexible
pipeline component, the [`AttributeRuler`](/api/attributeruler). See the
[usage guide](/usage/linguistic-features#mappings-exceptions) for examples. The
`AttributeRuler` provides two handy helper methods
[`load_from_tag_map`](/api/attributeruler#load_from_tag_map) and
[`load_from_morph_rules`](/api/attributeruler#load_from_morph_rules) that let
you load in your existing tag map or morph rules:
```diff
nlp = spacy.blank("en")
- nlp.vocab.morphology.load_tag_map(YOUR_TAG_MAP)
+ ruler = nlp.add_pipe("attribute_ruler")
+ ruler.load_from_tag_map(YOUR_TAG_MAP)
```
### Training models {#migrating-training} ### Training models {#migrating-training}
To train your models, you should now pretty much always use the To train your models, you should now pretty much always use the
@ -602,8 +629,8 @@ If you've exported a starter config from our
values. You can then use the auto-generated `config.cfg` for training: values. You can then use the auto-generated `config.cfg` for training:
```diff ```diff
### {wrap="true"} - python -m spacy train en ./output ./train.json ./dev.json
- python -m spacy train en ./output ./train.json ./dev.json --pipeline tagger,parser --cnn-window 1 --bilstm-depth 0 --pipeline tagger,parser --cnn-window 1 --bilstm-depth 0
+ python -m spacy train ./config.cfg --output ./output + python -m spacy train ./config.cfg --output ./output
``` ```

View File

@ -169,7 +169,13 @@ function formatCode(html, lang, prompt) {
} }
const result = html const result = html
.split('\n') .split('\n')
.map((line, i) => (prompt ? replacePrompt(line, prompt, i === 0) : line)) .map((line, i) => {
let newLine = prompt ? replacePrompt(line, prompt, i === 0) : line
if (lang === 'diff' && !line.startsWith('<')) {
newLine = highlightCode('python', line)
}
return newLine
})
.join('\n') .join('\n')
return htmlToReact(result) return htmlToReact(result)
} }

View File

@ -28,7 +28,6 @@ export default class Juniper extends React.Component {
mode: this.props.lang, mode: this.props.lang,
theme: this.props.theme, theme: this.props.theme,
}) })
const runCode = () => this.execute(outputArea, cm.getValue()) const runCode = () => this.execute(outputArea, cm.getValue())
cm.setOption('extraKeys', { 'Shift-Enter': runCode }) cm.setOption('extraKeys', { 'Shift-Enter': runCode })
Widget.attach(outputArea, this.outputRef) Widget.attach(outputArea, this.outputRef)

View File

@ -65,12 +65,12 @@
--color-subtle-dark: hsl(162, 5%, 60%) --color-subtle-dark: hsl(162, 5%, 60%)
--color-green-medium: hsl(108, 66%, 63%) --color-green-medium: hsl(108, 66%, 63%)
--color-green-transparent: hsla(108, 66%, 63%, 0.11) --color-green-transparent: hsla(108, 66%, 63%, 0.12)
--color-red-light: hsl(355, 100%, 96%) --color-red-light: hsl(355, 100%, 96%)
--color-red-medium: hsl(346, 84%, 61%) --color-red-medium: hsl(346, 84%, 61%)
--color-red-dark: hsl(332, 64%, 34%) --color-red-dark: hsl(332, 64%, 34%)
--color-red-opaque: hsl(346, 96%, 89%) --color-red-opaque: hsl(346, 96%, 89%)
--color-red-transparent: hsla(346, 84%, 61%, 0.11) --color-red-transparent: hsla(346, 84%, 61%, 0.12)
--color-yellow-light: hsl(46, 100%, 95%) --color-yellow-light: hsl(46, 100%, 95%)
--color-yellow-medium: hsl(45, 90%, 55%) --color-yellow-medium: hsl(45, 90%, 55%)
--color-yellow-dark: hsl(44, 94%, 27%) --color-yellow-dark: hsl(44, 94%, 27%)
@ -79,11 +79,11 @@
// Syntax Highlighting // Syntax Highlighting
--syntax-comment: hsl(162, 5%, 60%) --syntax-comment: hsl(162, 5%, 60%)
--syntax-tag: hsl(266, 72%, 72%) --syntax-tag: hsl(266, 72%, 72%)
--syntax-number: hsl(266, 72%, 72%) --syntax-number: var(--syntax-tag)
--syntax-selector: hsl(31, 100%, 71%) --syntax-selector: hsl(31, 100%, 71%)
--syntax-operator: hsl(342, 100%, 59%)
--syntax-function: hsl(195, 70%, 54%) --syntax-function: hsl(195, 70%, 54%)
--syntax-keyword: hsl(342, 100%, 59%) --syntax-keyword: hsl(343, 100%, 68%)
--syntax-operator: var(--syntax-keyword)
--syntax-regex: hsl(45, 90%, 55%) --syntax-regex: hsl(45, 90%, 55%)
// Other // Other
@ -354,6 +354,7 @@ body [id]:target
&.inserted, &.deleted &.inserted, &.deleted
padding: 2px 0 padding: 2px 0
border-radius: 2px border-radius: 2px
opacity: 0.9
&.inserted &.inserted
color: var(--color-green-medium) color: var(--color-green-medium)
@ -388,7 +389,6 @@ body [id]:target
.token .token
color: var(--color-subtle) color: var(--color-subtle)
.gatsby-highlight-code-line .gatsby-highlight-code-line
background-color: var(--color-dark-secondary) background-color: var(--color-dark-secondary)
border-left: 0.35em solid var(--color-theme) border-left: 0.35em solid var(--color-theme)
@ -409,6 +409,7 @@ body [id]:target
color: var(--color-subtle) color: var(--color-subtle)
.CodeMirror-line .CodeMirror-line
color: var(--syntax-comment)
padding: 0 padding: 0
.CodeMirror-selected .CodeMirror-selected
@ -418,26 +419,25 @@ body [id]:target
.CodeMirror-cursor .CodeMirror-cursor
border-left-color: currentColor border-left-color: currentColor
.cm-variable-2 .cm-property, .cm-variable, .cm-variable-2, .cm-meta // decorators
color: inherit color: var(--color-subtle)
font-style: italic
.cm-comment .cm-comment
color: var(--syntax-comment) color: var(--syntax-comment)
.cm-keyword .cm-keyword, .cm-builtin
color: var(--syntax-keyword) color: var(--syntax-keyword)
.cm-operator .cm-operator
color: var(--syntax-operator) color: var(--syntax-operator)
.cm-string, .cm-builtin .cm-string
color: var(--syntax-selector) color: var(--syntax-selector)
.cm-number .cm-number
color: var(--syntax-number) color: var(--syntax-number)
.cm-def, .cm-meta .cm-def
color: var(--syntax-function) color: var(--syntax-function)
// Jupyter // Jupyter