mirror of
				https://github.com/explosion/spaCy.git
				synced 2025-10-30 23:47:31 +03:00 
			
		
		
		
	Update docs [ci skip]
This commit is contained in:
		
							parent
							
								
									d73f7229c0
								
							
						
					
					
						commit
						9b86312bab
					
				|  | @ -12,7 +12,8 @@ The attribute ruler lets you set token attributes for tokens identified by | ||||||
| [`Matcher` patterns](/usage/rule-based-matching#matcher). The attribute ruler is | [`Matcher` patterns](/usage/rule-based-matching#matcher). The attribute ruler is | ||||||
| typically used to handle exceptions for token attributes and to map values | typically used to handle exceptions for token attributes and to map values | ||||||
| between attributes such as mapping fine-grained POS tags to coarse-grained POS | between attributes such as mapping fine-grained POS tags to coarse-grained POS | ||||||
| tags. | tags. See the [usage guide](/usage/linguistic-features/#mappings-exceptions) for | ||||||
|  | examples. | ||||||
| 
 | 
 | ||||||
| ## Config and implementation {#config} | ## Config and implementation {#config} | ||||||
| 
 | 
 | ||||||
|  |  | ||||||
|  | @ -12,19 +12,16 @@ is then passed on to the next component. | ||||||
| > - **Creates:** Objects, attributes and properties modified and set by the | > - **Creates:** Objects, attributes and properties modified and set by the | ||||||
| >   component. | >   component. | ||||||
| 
 | 
 | ||||||
| | Name           | Component                                   | Creates                                                   | Description                      | | | Name                  | Component                                                          | Creates                                                   | Description                                      | | ||||||
| | -------------- | ------------------------------------------- | --------------------------------------------------------- | -------------------------------- | | | --------------------- | ------------------------------------------------------------------ | --------------------------------------------------------- | ------------------------------------------------ | | ||||||
| | **tokenizer**  | [`Tokenizer`](/api/tokenizer)               | `Doc`                                                     | Segment text into tokens.        | | | **tokenizer**         | [`Tokenizer`](/api/tokenizer)                                      | `Doc`                                                     | Segment text into tokens.                        | | ||||||
| | **tagger**     | [`Tagger`](/api/tagger)                     | `Token.tag`                                               | Assign part-of-speech tags.      | | | _processing pipeline_ |                                                                    |                                                           | | ||||||
| | **parser**     | [`DependencyParser`](/api/dependencyparser) | `Token.head`, `Token.dep`, `Doc.sents`, `Doc.noun_chunks` | Assign dependency labels.        | | | **tagger**            | [`Tagger`](/api/tagger)                                            | `Token.tag`                                               | Assign part-of-speech tags.                      | | ||||||
| | **ner**        | [`EntityRecognizer`](/api/entityrecognizer) | `Doc.ents`, `Token.ent_iob`, `Token.ent_type`             | Detect and label named entities. | | | **parser**            | [`DependencyParser`](/api/dependencyparser)                        | `Token.head`, `Token.dep`, `Doc.sents`, `Doc.noun_chunks` | Assign dependency labels.                        | | ||||||
| | **lemmatizer** | [`Lemmatizer`](/api/lemmatizer)             | `Token.lemma`                                             | Assign base forms.               | | | **ner**               | [`EntityRecognizer`](/api/entityrecognizer)                        | `Doc.ents`, `Token.ent_iob`, `Token.ent_type`             | Detect and label named entities.                 | | ||||||
| | **textcat**    | [`TextCategorizer`](/api/textcategorizer)   | `Doc.cats`                                                | Assign document labels.          | | | **lemmatizer**        | [`Lemmatizer`](/api/lemmatizer)                                    | `Token.lemma`                                             | Assign base forms.                               | | ||||||
| 
 | | **textcat**           | [`TextCategorizer`](/api/textcategorizer)                          | `Doc.cats`                                                | Assign document labels.                          | | ||||||
| | **custom** | | | **custom**            | [custom components](/usage/processing-pipelines#custom-components) | `Doc._.xxx`, `Token._.xxx`, `Span._.xxx`                  | Assign custom attributes, methods or properties. | | ||||||
| [custom components](/usage/processing-pipelines#custom-components) | |  | ||||||
| `Doc._.xxx`, `Token._.xxx`, `Span._.xxx` | Assign custom attributes, methods or |  | ||||||
| properties. | |  | ||||||
| 
 | 
 | ||||||
| The processing pipeline always **depends on the statistical model** and its | The processing pipeline always **depends on the statistical model** and its | ||||||
| capabilities. For example, a pipeline can only include an entity recognizer | capabilities. For example, a pipeline can only include an entity recognizer | ||||||
|  |  | ||||||
|  | @ -57,41 +57,50 @@ create a surface form. Here are some examples: | ||||||
| Morphological features are stored in the [`MorphAnalysis`](/api/morphanalysis) | Morphological features are stored in the [`MorphAnalysis`](/api/morphanalysis) | ||||||
| under `Token.morph`, which allows you to access individual morphological | under `Token.morph`, which allows you to access individual morphological | ||||||
| features. The attribute `Token.morph_` provides the morphological analysis in | features. The attribute `Token.morph_` provides the morphological analysis in | ||||||
| the Universal Dependencies FEATS format. | the Universal Dependencies | ||||||
|  | [FEATS](https://universaldependencies.org/format.html#morphological-annotation) | ||||||
|  | format. | ||||||
|  | 
 | ||||||
|  | > #### 📝 Things to try | ||||||
|  | > | ||||||
|  | > 1. Change "I" to "She". You should see that the morphological features change | ||||||
|  | >    and express that it's a pronoun in the third person. | ||||||
|  | > 2. Inspect `token.morph_` for the other tokens. | ||||||
| 
 | 
 | ||||||
| ```python | ```python | ||||||
| ### {executable="true"} | ### {executable="true"} | ||||||
| import spacy | import spacy | ||||||
| 
 | 
 | ||||||
| nlp = spacy.load("en_core_web_sm") | nlp = spacy.load("en_core_web_sm") | ||||||
|  | print("Pipeline:", nlp.pipe_names) | ||||||
| doc = nlp("I was reading the paper.") | doc = nlp("I was reading the paper.") | ||||||
| 
 | token = doc[0]  # 'I' | ||||||
| token = doc[0] # "I" | print(token.morph_)  # 'Case=Nom|Number=Sing|Person=1|PronType=Prs' | ||||||
| assert token.morph_ == "Case=Nom|Number=Sing|Person=1|PronType=Prs" | print(token.morph.get("PronType"))  # ['Prs'] | ||||||
| assert token.morph.get("PronType") == ["Prs"] |  | ||||||
| ``` | ``` | ||||||
| 
 | 
 | ||||||
| ### Statistical morphology {#morphologizer new="3" model="morphologizer"} | ### Statistical morphology {#morphologizer new="3" model="morphologizer"} | ||||||
| 
 | 
 | ||||||
| spaCy v3 includes a statistical morphologizer component that assigns the | spaCy's statistical [`Morphologizer`](/api/morphologizer) component assigns the | ||||||
| morphological features and POS as `Token.morph` and `Token.pos`. | morphological features and coarse-grained part-of-speech tags as `Token.morph` | ||||||
|  | and `Token.pos`. | ||||||
| 
 | 
 | ||||||
| ```python | ```python | ||||||
| ### {executable="true"} | ### {executable="true"} | ||||||
| import spacy | import spacy | ||||||
| 
 | 
 | ||||||
| nlp = spacy.load("de_core_news_sm") | nlp = spacy.load("de_core_news_sm") | ||||||
| doc = nlp("Wo bist du?") # 'Where are you?' | doc = nlp("Wo bist du?") # English: 'Where are you?' | ||||||
| assert doc[2].morph_ == "Case=Nom|Number=Sing|Person=2|PronType=Prs" | print(doc[2].morph_)  # 'Case=Nom|Number=Sing|Person=2|PronType=Prs' | ||||||
| assert doc[2].pos_ == "PRON" | print(doc[2].pos_) # 'PRON' | ||||||
| ``` | ``` | ||||||
| 
 | 
 | ||||||
| ### Rule-based morphology {#rule-based-morphology} | ### Rule-based morphology {#rule-based-morphology} | ||||||
| 
 | 
 | ||||||
| For languages with relatively simple morphological systems like English, spaCy | For languages with relatively simple morphological systems like English, spaCy | ||||||
| can assign morphological features through a rule-based approach, which uses the | can assign morphological features through a rule-based approach, which uses the | ||||||
| token text and fine-grained part-of-speech tags to produce coarse-grained | **token text** and **fine-grained part-of-speech tags** to produce | ||||||
| part-of-speech tags and morphological features. | coarse-grained part-of-speech tags and morphological features. | ||||||
| 
 | 
 | ||||||
| 1. The part-of-speech tagger assigns each token a **fine-grained part-of-speech | 1. The part-of-speech tagger assigns each token a **fine-grained part-of-speech | ||||||
|    tag**. In the API, these tags are known as `Token.tag`. They express the |    tag**. In the API, these tags are known as `Token.tag`. They express the | ||||||
|  | @ -108,16 +117,16 @@ import spacy | ||||||
| 
 | 
 | ||||||
| nlp = spacy.load("en_core_web_sm") | nlp = spacy.load("en_core_web_sm") | ||||||
| doc = nlp("Where are you?") | doc = nlp("Where are you?") | ||||||
| assert doc[2].morph_ == "Case=Nom|Person=2|PronType=Prs" | print(doc[2].morph_)  # 'Case=Nom|Person=2|PronType=Prs' | ||||||
| assert doc[2].pos_ == "PRON" | print(doc[2].pos_)  # 'PRON' | ||||||
| ``` | ``` | ||||||
| 
 | 
 | ||||||
| ## Lemmatization {#lemmatization model="lemmatizer" new="3"} | ## Lemmatization {#lemmatization model="lemmatizer" new="3"} | ||||||
| 
 | 
 | ||||||
| The [`Lemmatizer`](/api/lemmatizer) is a pipeline component that provides lookup | The [`Lemmatizer`](/api/lemmatizer) is a pipeline component that provides lookup | ||||||
| and rule-based lemmatization methods in a configurable component. An individual | and rule-based lemmatization methods in a configurable component. An individual | ||||||
| language can extend the `Lemmatizer` as part of its [language | language can extend the `Lemmatizer` as part of its | ||||||
| data](#language-data). | [language data](#language-data). | ||||||
| 
 | 
 | ||||||
| ```python | ```python | ||||||
| ### {executable="true"} | ### {executable="true"} | ||||||
|  | @ -126,36 +135,38 @@ import spacy | ||||||
| # English models include a rule-based lemmatizer | # English models include a rule-based lemmatizer | ||||||
| nlp = spacy.load("en_core_web_sm") | nlp = spacy.load("en_core_web_sm") | ||||||
| lemmatizer = nlp.get_pipe("lemmatizer") | lemmatizer = nlp.get_pipe("lemmatizer") | ||||||
| assert lemmatizer.mode == "rule" | print(lemmatizer.mode)  # 'rule' | ||||||
| 
 | 
 | ||||||
| doc = nlp("I was reading the paper.") | doc = nlp("I was reading the paper.") | ||||||
| assert doc[1].lemma_ == "be" | print([token.lemma_ for token in doc]) | ||||||
| assert doc[2].lemma_ == "read" | # ['I', 'be', 'read', 'the', 'paper', '.'] | ||||||
| ``` | ``` | ||||||
| 
 | 
 | ||||||
| <Infobox title="Important note" variant="warning"> | <Infobox title="Changed in v3.0" variant="warning"> | ||||||
| 
 | 
 | ||||||
| Unlike spaCy v2, spaCy v3 models do not provide lemmas by default or switch | Unlike spaCy v2, spaCy v3 models do _not_ provide lemmas by default or switch | ||||||
| automatically between lookup and rule-based lemmas depending on whether a | automatically between lookup and rule-based lemmas depending on whether a tagger | ||||||
| tagger is in the pipeline. To have lemmas in a `Doc`, the pipeline needs to | is in the pipeline. To have lemmas in a `Doc`, the pipeline needs to include a | ||||||
| include a `lemmatizer` component. A `lemmatizer` is configured to use a single | [`Lemmatizer`](/api/lemmatizer) component. The lemmatizer component is | ||||||
| mode such as `"lookup"` or `"rule"` on initialization. The `"rule"` mode | configured to use a single mode such as `"lookup"` or `"rule"` on | ||||||
| requires `Token.pos` to be set by a previous component. | initialization. The `"rule"` mode requires `Token.pos` to be set by a previous | ||||||
|  | component. | ||||||
| 
 | 
 | ||||||
| </Infobox> | </Infobox> | ||||||
| 
 | 
 | ||||||
| The data for spaCy's lemmatizers is distributed in the package | The data for spaCy's lemmatizers is distributed in the package | ||||||
| [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data). The | [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data). The | ||||||
| provided models already include all the required tables, but if you are | provided models already include all the required tables, but if you are creating | ||||||
| creating new models, you'll probably want to install `spacy-lookups-data` to | new models, you'll probably want to install `spacy-lookups-data` to provide the | ||||||
| provide the data when the lemmatizer is initialized. | data when the lemmatizer is initialized. | ||||||
| 
 | 
 | ||||||
| ### Lookup lemmatizer {#lemmatizer-lookup} | ### Lookup lemmatizer {#lemmatizer-lookup} | ||||||
| 
 | 
 | ||||||
| For models without a tagger or morphologizer, a lookup lemmatizer can be added | For models without a tagger or morphologizer, a lookup lemmatizer can be added | ||||||
| to the pipeline as long as a lookup table is provided, typically through | to the pipeline as long as a lookup table is provided, typically through | ||||||
| `spacy-lookups-data`. The lookup lemmatizer looks up the token surface form in | [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data). The | ||||||
| the lookup table without reference to the token's part-of-speech or context. | lookup lemmatizer looks up the token surface form in the lookup table without | ||||||
|  | reference to the token's part-of-speech or context. | ||||||
| 
 | 
 | ||||||
| ```python | ```python | ||||||
| # pip install spacy-lookups-data | # pip install spacy-lookups-data | ||||||
|  | @ -168,19 +179,18 @@ nlp.add_pipe("lemmatizer", config={"mode": "lookup"}) | ||||||
| ### Rule-based lemmatizer {#lemmatizer-rule} | ### Rule-based lemmatizer {#lemmatizer-rule} | ||||||
| 
 | 
 | ||||||
| When training models that include a component that assigns POS (a morphologizer | When training models that include a component that assigns POS (a morphologizer | ||||||
| or a tagger with a [POS mapping](#mappings-exceptions)), a rule-based | or a tagger with a [POS mapping](#mappings-exceptions)), a rule-based lemmatizer | ||||||
| lemmatizer can be added using rule tables from `spacy-lookups-data`: | can be added using rule tables from | ||||||
|  | [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data): | ||||||
| 
 | 
 | ||||||
| ```python | ```python | ||||||
| # pip install spacy-lookups-data | # pip install spacy-lookups-data | ||||||
| import spacy | import spacy | ||||||
| 
 | 
 | ||||||
| nlp = spacy.blank("de") | nlp = spacy.blank("de") | ||||||
| 
 | # Morphologizer (note: model is not yet trained!) | ||||||
| # morphologizer (note: model is not yet trained!) |  | ||||||
| nlp.add_pipe("morphologizer") | nlp.add_pipe("morphologizer") | ||||||
| 
 | # Rule-based lemmatizer | ||||||
| # rule-based lemmatizer |  | ||||||
| nlp.add_pipe("lemmatizer", config={"mode": "rule"}) | nlp.add_pipe("lemmatizer", config={"mode": "rule"}) | ||||||
| ``` | ``` | ||||||
| 
 | 
 | ||||||
|  | @ -1734,25 +1744,26 @@ print("After:", [sent.text for sent in doc.sents]) | ||||||
| 
 | 
 | ||||||
| ## Mappings & Exceptions {#mappings-exceptions new="3"} | ## Mappings & Exceptions {#mappings-exceptions new="3"} | ||||||
| 
 | 
 | ||||||
| The [`AttributeRuler`](/api/attributeruler) manages rule-based mappings and | The [`AttributeRuler`](/api/attributeruler) manages **rule-based mappings and | ||||||
| exceptions for all token-level attributes. As the number of pipeline components | exceptions** for all token-level attributes. As the number of | ||||||
| has grown from spaCy v2 to v3, handling rules and exceptions in each component | [pipeline components](/api/#architecture-pipeline) has grown from spaCy v2 to | ||||||
| individually has become impractical, so the `AttributeRuler` provides a single | v3, handling rules and exceptions in each component individually has become | ||||||
| component with a unified pattern format for all token attribute mappings and | impractical, so the `AttributeRuler` provides a single component with a unified | ||||||
| exceptions. | pattern format for all token attribute mappings and exceptions. | ||||||
| 
 | 
 | ||||||
| The `AttributeRuler` uses [`Matcher` | The `AttributeRuler` uses | ||||||
| patterns](/usage/rule-based-matching#adding-patterns) to identify tokens and | [`Matcher` patterns](/usage/rule-based-matching#adding-patterns) to identify | ||||||
| then assigns them the provided attributes. If needed, the `Matcher` patterns | tokens and then assigns them the provided attributes. If needed, the | ||||||
| can include context around the target token. For example, the `AttributeRuler` | [`Matcher`](/api/matcher) patterns can include context around the target token. | ||||||
| can: | For example, the attribute ruler can: | ||||||
| 
 | 
 | ||||||
| - provide exceptions for any token attributes | - provide exceptions for any **token attributes** | ||||||
| - map fine-grained tags to coarse-grained tags for languages without statistical | - map **fine-grained tags** to **coarse-grained tags** for languages without | ||||||
|   morphologizers (replacing the v2 tag map in the language data) |   statistical morphologizers (replacing the v2.x `tag_map` in the | ||||||
| - map token surface form + fine-grained tags to morphological features |   [language data](#language-data)) | ||||||
|   (replacing the v2 morph rules in the language data) | - map token **surface form + fine-grained tags** to **morphological features** | ||||||
| - specify the tags for space tokens (replacing hard-coded behavior in the |   (replacing the v2.x `morph_rules` in the [language data](#language-data)) | ||||||
|  | - specify the **tags for space tokens** (replacing hard-coded behavior in the | ||||||
|   tagger) |   tagger) | ||||||
| 
 | 
 | ||||||
| The following example shows how the tag and POS `NNP`/`PROPN` can be specified | The following example shows how the tag and POS `NNP`/`PROPN` can be specified | ||||||
|  | @ -1765,41 +1776,42 @@ import spacy | ||||||
| 
 | 
 | ||||||
| nlp = spacy.load("en_core_web_sm") | nlp = spacy.load("en_core_web_sm") | ||||||
| text = "I saw The Who perform. Who did you see?" | text = "I saw The Who perform. Who did you see?" | ||||||
| 
 |  | ||||||
| doc1 = nlp(text) | doc1 = nlp(text) | ||||||
| assert doc1[2].tag_ == "DT" | print(doc1[2].tag_, doc1[2].pos_)  # DT DET | ||||||
| assert doc1[2].pos_ == "DET" | print(doc1[3].tag_, doc1[3].pos_)  # WP PRON | ||||||
| assert doc1[3].tag_ == "WP" |  | ||||||
| assert doc1[3].pos_ == "PRON" |  | ||||||
| 
 | 
 | ||||||
| # add a new exception for "The Who" as NNP/PROPN NNP/PROPN | # Add attribute ruler with exception for "The Who" as NNP/PROPN NNP/PROPN | ||||||
| ruler = nlp.get_pipe("attribute_ruler") | ruler = nlp.get_pipe("attribute_ruler") | ||||||
| 
 | # Pattern to match "The Who" | ||||||
| # pattern to match "The Who" |  | ||||||
| patterns = [[{"LOWER": "the"}, {"TEXT": "Who"}]] | patterns = [[{"LOWER": "the"}, {"TEXT": "Who"}]] | ||||||
| # the attributes to assign to the matched token | # The attributes to assign to the matched token | ||||||
| attrs = {"TAG": "NNP", "POS": "PROPN"} | attrs = {"TAG": "NNP", "POS": "PROPN"} | ||||||
| 
 | # Add rules to the attribute ruler | ||||||
| # add rule for "The" in "The Who" | ruler.add(patterns=patterns, attrs=attrs, index=0)  # "The" in "The Who" | ||||||
| ruler.add(patterns=patterns, attrs=attrs, index=0) | ruler.add(patterns=patterns, attrs=attrs, index=1)  # "Who" in "The Who" | ||||||
| # add rule for "Who" in "The Who" |  | ||||||
| ruler.add(patterns=patterns, attrs=attrs, index=1) |  | ||||||
| 
 | 
 | ||||||
| doc2 = nlp(text) | doc2 = nlp(text) | ||||||
| assert doc2[2].tag_ == "NNP" | print(doc2[2].tag_, doc2[2].pos_)  # NNP PROPN | ||||||
| assert doc2[3].tag_ == "NNP" | print(doc2[3].tag_, doc2[3].pos_)  # NNP PROPN | ||||||
| assert doc2[2].pos_ == "PROPN" | # The second "Who" remains unmodified | ||||||
| assert doc2[3].pos_ == "PROPN" | print(doc2[5].tag_, doc2[5].pos_)  # WP PRON | ||||||
| 
 |  | ||||||
| # the second "Who" remains unmodified |  | ||||||
| assert doc2[5].tag_ == "WP" |  | ||||||
| assert doc2[5].pos_ == "PRON" |  | ||||||
| ``` | ``` | ||||||
| 
 | 
 | ||||||
| For easy migration from from spaCy v2 to v3, the `AttributeRuler` can import v2 | <Infobox variant="warning" title="Migrating from spaCy v2.x"> | ||||||
| `TAG_MAP` and `MORPH_RULES` data with the methods | 
 | ||||||
| [`AttributerRuler.load_from_tag_map`](/api/attributeruler#load_from_tag_map) and | For easy migration from from spaCy v2 to v3, the | ||||||
| [`AttributeRuler.load_from_morph_rules`](/api/attributeruler#load_from_morph_rules). | [`AttributeRuler`](/api/attributeruler) can import a **tag map and morph rules** | ||||||
|  | in the v2 format with the methods | ||||||
|  | [`load_from_tag_map`](/api/attributeruler#load_from_tag_map) and | ||||||
|  | [`load_from_morph_rules`](/api/attributeruler#load_from_morph_rules). | ||||||
|  | 
 | ||||||
|  | ```diff | ||||||
|  | nlp = spacy.blank("en") | ||||||
|  | + ruler = nlp.add_pipe("attribute_ruler") | ||||||
|  | + ruler.load_from_tag_map(YOUR_TAG_MAP) | ||||||
|  | ``` | ||||||
|  | 
 | ||||||
|  | </Infobox> | ||||||
| 
 | 
 | ||||||
| ## Word vectors and semantic similarity {#vectors-similarity} | ## Word vectors and semantic similarity {#vectors-similarity} | ||||||
| 
 | 
 | ||||||
|  |  | ||||||
|  | @ -250,26 +250,26 @@ in your config and see validation errors if the argument values don't match. | ||||||
| 
 | 
 | ||||||
| The following methods, attributes and commands are new in spaCy v3.0. | The following methods, attributes and commands are new in spaCy v3.0. | ||||||
| 
 | 
 | ||||||
| | Name                                                                                                                          | Description                                                                                                                                                                                      | | | Name                                                                                                                            | Description                                                                                                                                                                                      | | ||||||
| | ----------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | | ------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | ||||||
| | [`Token.lex`](/api/token#attributes)                                                                                          | Access a token's [`Lexeme`](/api/lexeme).                                                                                                                                                        | | | [`Token.lex`](/api/token#attributes)                                                                                            | Access a token's [`Lexeme`](/api/lexeme).                                                                                                                                                        | | ||||||
| | [`Token.morph`](/api/token#attributes) [`Token.morph_`](/api/token#attributes)                                                | Access a token's morphological analysis.                                                                                                                                                         | | | [`Token.morph`](/api/token#attributes), [`Token.morph_`](/api/token#attributes)                                                 | Access a token's morphological analysis.                                                                                                                                                         | | ||||||
| | [`Language.select_pipes`](/api/language#select_pipes)                                                                         | Context manager for enabling or disabling specific pipeline components for a block.                                                                                                              | | | [`Language.select_pipes`](/api/language#select_pipes)                                                                           | Context manager for enabling or disabling specific pipeline components for a block.                                                                                                              | | ||||||
| | [`Language.disable_pipe`](/api/language#disable_pipe) [`Language.enable_pipe`](/api/language#enable_pipe)                     | Disable or enable a loaded pipeline component (but don't remove it).                                                                                                                             | | | [`Language.disable_pipe`](/api/language#disable_pipe), [`Language.enable_pipe`](/api/language#enable_pipe)                      | Disable or enable a loaded pipeline component (but don't remove it).                                                                                                                             | | ||||||
| | [`Language.analyze_pipes`](/api/language#analyze_pipes)                                                                       | [Analyze](/usage/processing-pipelines#analysis) components and their interdependencies.                                                                                                          | | | [`Language.analyze_pipes`](/api/language#analyze_pipes)                                                                         | [Analyze](/usage/processing-pipelines#analysis) components and their interdependencies.                                                                                                          | | ||||||
| | [`Language.resume_training`](/api/language#resume_training)                                                                   | Experimental: continue training a pretrained model and initialize "rehearsal" for components that implement a `rehearse` method to prevent catastrophic forgetting.                              | | | [`Language.resume_training`](/api/language#resume_training)                                                                     | Experimental: continue training a pretrained model and initialize "rehearsal" for components that implement a `rehearse` method to prevent catastrophic forgetting.                              | | ||||||
| | [`@Language.factory`](/api/language#factory) [`@Language.component`](/api/language#component)                                 | Decorators for [registering](/usage/processing-pipelines#custom-components) pipeline component factories and simple stateless component functions.                                               | | | [`@Language.factory`](/api/language#factory), [`@Language.component`](/api/language#component)                                  | Decorators for [registering](/usage/processing-pipelines#custom-components) pipeline component factories and simple stateless component functions.                                               | | ||||||
| | [`Language.has_factory`](/api/language#has_factory)                                                                           | Check whether a component factory is registered on a language class.s                                                                                                                            | | | [`Language.has_factory`](/api/language#has_factory)                                                                             | Check whether a component factory is registered on a language class.s                                                                                                                            | | ||||||
| | [`Language.get_factory_meta`](/api/language#get_factory_meta) [`Language.get_pipe_meta`](/api/language#get_factory_meta)      | Get the [`FactoryMeta`](/api/language#factorymeta) with component metadata for a factory or instance name.                                                                                       | | | [`Language.get_factory_meta`](/api/language#get_factory_meta), [`Language.get_pipe_meta`](/api/language#get_factory_meta)       | Get the [`FactoryMeta`](/api/language#factorymeta) with component metadata for a factory or instance name.                                                                                       | | ||||||
| | [`Language.config`](/api/language#config)                                                                                     | The [config](/usage/training#config) used to create the current `nlp` object. An instance of [`Config`](https://thinc.ai/docs/api-config#config) and can be saved to disk and used for training. | | | [`Language.config`](/api/language#config)                                                                                       | The [config](/usage/training#config) used to create the current `nlp` object. An instance of [`Config`](https://thinc.ai/docs/api-config#config) and can be saved to disk and used for training. | | ||||||
| | [`Language.components`](/api/language#attributes) [`Language.component_names`](/api/language#attributes)                      | All available components and component names, including disabled components that are not run as part of the pipeline.                                                                            | | | [`Language.components`](/api/language#attributes), [`Language.component_names`](/api/language#attributes)                       | All available components and component names, including disabled components that are not run as part of the pipeline.                                                                            | | ||||||
| | [`Language.disabled`](/api/language#attributes)                                                                               | Names of disabled components that are not run as part of the pipeline.                                                                                                                           | | | [`Language.disabled`](/api/language#attributes)                                                                                 | Names of disabled components that are not run as part of the pipeline.                                                                                                                           | | ||||||
| | [`Pipe.score`](/api/pipe#score)                                                                                               | Method on pipeline components that returns a dictionary of evaluation scores.                                                                                                                    | | | [`Pipe.score`](/api/pipe#score)                                                                                                 | Method on pipeline components that returns a dictionary of evaluation scores.                                                                                                                    | | ||||||
| | [`registry`](/api/top-level#registry)                                                                                         | Function registry to map functions to string names that can be referenced in [configs](/usage/training#config).                                                                                  | | | [`registry`](/api/top-level#registry)                                                                                           | Function registry to map functions to string names that can be referenced in [configs](/usage/training#config).                                                                                  | | ||||||
| | [`util.load_meta`](/api/top-level#util.load_meta) [`util.load_config`](/api/top-level#util.load_config)                       | Updated helpers for loading a model's [`meta.json`](/api/data-formats#meta) and [`config.cfg`](/api/data-formats#config).                                                                        | | | [`util.load_meta`](/api/top-level#util.load_meta), [`util.load_config`](/api/top-level#util.load_config)                        | Updated helpers for loading a model's [`meta.json`](/api/data-formats#meta) and [`config.cfg`](/api/data-formats#config).                                                                        | | ||||||
| | [`util.get_installed_models`](/api/top-level#util.get_installed_models)                                                       | Names of all models installed in the environment.                                                                                                                                                | | | [`util.get_installed_models`](/api/top-level#util.get_installed_models)                                                         | Names of all models installed in the environment.                                                                                                                                                | | ||||||
| | [`init config`](/api/cli#init-config) [`init fill-config`](/api/cli#init-fill-config) [`debug config`](/api/cli#debug-config) | CLI commands for initializing, auto-filling and debugging [training configs](/usage/training).                                                                                                   | | | [`init config`](/api/cli#init-config), [`init fill-config`](/api/cli#init-fill-config), [`debug config`](/api/cli#debug-config) | CLI commands for initializing, auto-filling and debugging [training configs](/usage/training).                                                                                                   | | ||||||
| | [`project`](/api/cli#project)                                                                                                 | Suite of CLI commands for cloning, running and managing [spaCy projects](/usage/projects).                                                                                                       | | | [`project`](/api/cli#project)                                                                                                   | Suite of CLI commands for cloning, running and managing [spaCy projects](/usage/projects).                                                                                                       | | ||||||
| 
 | 
 | ||||||
| ### New and updated documentation {#new-docs} | ### New and updated documentation {#new-docs} | ||||||
| 
 | 
 | ||||||
|  | @ -304,7 +304,10 @@ format for documenting argument and return types. | ||||||
|   [Layers & Architectures](/usage/layers-architectures), |   [Layers & Architectures](/usage/layers-architectures), | ||||||
|   [Projects](/usage/projects), |   [Projects](/usage/projects), | ||||||
|   [Custom pipeline components](/usage/processing-pipelines#custom-components), |   [Custom pipeline components](/usage/processing-pipelines#custom-components), | ||||||
|   [Custom tokenizers](/usage/linguistic-features#custom-tokenizer) |   [Custom tokenizers](/usage/linguistic-features#custom-tokenizer), | ||||||
|  |   [Morphology](/usage/linguistic-features#morphology), | ||||||
|  |   [Lemmatization](/usage/linguistic-features#lemmatization), | ||||||
|  |   [Mapping & Exceptions](/usage/linguistic-features#mappings-exceptions) | ||||||
| - **API Reference: ** [Library architecture](/api), | - **API Reference: ** [Library architecture](/api), | ||||||
|   [Model architectures](/api/architectures), [Data formats](/api/data-formats) |   [Model architectures](/api/architectures), [Data formats](/api/data-formats) | ||||||
| - **New Classes: ** [`Example`](/api/example), [`Tok2Vec`](/api/tok2vec), | - **New Classes: ** [`Example`](/api/example), [`Tok2Vec`](/api/tok2vec), | ||||||
|  | @ -371,19 +374,25 @@ Note that spaCy v3.0 now requires **Python 3.6+**. | ||||||
|   arguments). The `on_match` callback becomes an optional keyword argument. |   arguments). The `on_match` callback becomes an optional keyword argument. | ||||||
| - The `PRON_LEMMA` symbol and `-PRON-` as an indicator for pronoun lemmas has | - The `PRON_LEMMA` symbol and `-PRON-` as an indicator for pronoun lemmas has | ||||||
|   been removed. |   been removed. | ||||||
|  | - The `TAG_MAP` and `MORPH_RULES` in the language data have been replaced by the | ||||||
|  |   more flexible [`AttributeRuler`](/api/attributeruler). | ||||||
|  | - The [`Lemmatizer`](/api/lemmatizer) is now a standalone pipeline component and | ||||||
|  |   doesn't provide lemmas by default or switch automatically between lookup and | ||||||
|  |   rule-based lemmas. You can now add it to your pipeline explicitly and set its | ||||||
|  |   mode on initialization. | ||||||
| 
 | 
 | ||||||
| ### Removed or renamed API {#incompat-removed} | ### Removed or renamed API {#incompat-removed} | ||||||
| 
 | 
 | ||||||
| | Removed                                                  | Replacement                                                                                | | | Removed                                                  | Replacement                                                                                                  | | ||||||
| | -------------------------------------------------------- | ------------------------------------------------------------------------------------------ | | | -------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------ | | ||||||
| | `Language.disable_pipes`                                 | [`Language.select_pipes`](/api/language#select_pipes)                                      | | | `Language.disable_pipes`                                 | [`Language.select_pipes`](/api/language#select_pipes), [`Language.disable_pipe`](/api/language#disable_pipe) | | ||||||
| | `GoldParse`                                              | [`Example`](/api/example)                                                                  | | | `GoldParse`                                              | [`Example`](/api/example)                                                                                    | | ||||||
| | `GoldCorpus`                                             | [`Corpus`](/api/corpus)                                                                    | | | `GoldCorpus`                                             | [`Corpus`](/api/corpus)                                                                                      | | ||||||
| | `KnowledgeBase.load_bulk`, `KnowledgeBase.dump`          | [`KnowledgeBase.from_disk`](/api/kb#from_disk), [`KnowledgeBase.to_disk`](/api/kb#to_disk) | | | `KnowledgeBase.load_bulk`, `KnowledgeBase.dump`          | [`KnowledgeBase.from_disk`](/api/kb#from_disk), [`KnowledgeBase.to_disk`](/api/kb#to_disk)                   | | ||||||
| | `spacy init-model`                                       | [`spacy init model`](/api/cli#init-model)                                                  | | | `spacy init-model`                                       | [`spacy init model`](/api/cli#init-model)                                                                    | | ||||||
| | `spacy debug-data`                                       | [`spacy debug data`](/api/cli#debug-data)                                                  | | | `spacy debug-data`                                       | [`spacy debug data`](/api/cli#debug-data)                                                                    | | ||||||
| | `spacy profile`                                          | [`spacy debug profile`](/api/cli#debug-profile)                                            | | | `spacy profile`                                          | [`spacy debug profile`](/api/cli#debug-profile)                                                              | | ||||||
| | `spacy link`, `util.set_data_path`, `util.get_data_path` | not needed, model symlinks are deprecated                                                  | | | `spacy link`, `util.set_data_path`, `util.get_data_path` | not needed, model symlinks are deprecated                                                                    | | ||||||
| 
 | 
 | ||||||
| The following deprecated methods, attributes and arguments were removed in v3.0. | The following deprecated methods, attributes and arguments were removed in v3.0. | ||||||
| Most of them have been **deprecated for a while** and many would previously | Most of them have been **deprecated for a while** and many would previously | ||||||
|  | @ -557,6 +566,24 @@ patterns = [nlp("health care reform"), nlp("healthcare reform")] | ||||||
| + matcher.add("HEALTH", patterns, on_match=on_match) | + matcher.add("HEALTH", patterns, on_match=on_match) | ||||||
| ``` | ``` | ||||||
| 
 | 
 | ||||||
|  | ### Migrating tag maps and morph rules {#migrating-training-mappings-exceptions} | ||||||
|  | 
 | ||||||
|  | Instead of defining a `tag_map` and `morph_rules` in the language data, spaCy | ||||||
|  | v3.0 now manages mappings and exceptions with a separate and more flexible | ||||||
|  | pipeline component, the [`AttributeRuler`](/api/attributeruler). See the | ||||||
|  | [usage guide](/usage/linguistic-features#mappings-exceptions) for examples. The | ||||||
|  | `AttributeRuler` provides two handy helper methods | ||||||
|  | [`load_from_tag_map`](/api/attributeruler#load_from_tag_map) and | ||||||
|  | [`load_from_morph_rules`](/api/attributeruler#load_from_morph_rules) that let | ||||||
|  | you load in your existing tag map or morph rules: | ||||||
|  | 
 | ||||||
|  | ```diff | ||||||
|  | nlp = spacy.blank("en") | ||||||
|  | - nlp.vocab.morphology.load_tag_map(YOUR_TAG_MAP) | ||||||
|  | + ruler = nlp.add_pipe("attribute_ruler") | ||||||
|  | + ruler.load_from_tag_map(YOUR_TAG_MAP) | ||||||
|  | ``` | ||||||
|  | 
 | ||||||
| ### Training models {#migrating-training} | ### Training models {#migrating-training} | ||||||
| 
 | 
 | ||||||
| To train your models, you should now pretty much always use the | To train your models, you should now pretty much always use the | ||||||
|  | @ -602,8 +629,8 @@ If you've exported a starter config from our | ||||||
| values. You can then use the auto-generated `config.cfg` for training: | values. You can then use the auto-generated `config.cfg` for training: | ||||||
| 
 | 
 | ||||||
| ```diff | ```diff | ||||||
| ### {wrap="true"} | - python -m spacy train en ./output ./train.json ./dev.json | ||||||
| - python -m spacy train en ./output ./train.json ./dev.json --pipeline tagger,parser --cnn-window 1 --bilstm-depth 0 | --pipeline tagger,parser --cnn-window 1 --bilstm-depth 0 | ||||||
| + python -m spacy train ./config.cfg --output ./output | + python -m spacy train ./config.cfg --output ./output | ||||||
| ``` | ``` | ||||||
| 
 | 
 | ||||||
|  |  | ||||||
|  | @ -169,7 +169,13 @@ function formatCode(html, lang, prompt) { | ||||||
|     } |     } | ||||||
|     const result = html |     const result = html | ||||||
|         .split('\n') |         .split('\n') | ||||||
|         .map((line, i) => (prompt ? replacePrompt(line, prompt, i === 0) : line)) |         .map((line, i) => { | ||||||
|  |             let newLine = prompt ? replacePrompt(line, prompt, i === 0) : line | ||||||
|  |             if (lang === 'diff' && !line.startsWith('<')) { | ||||||
|  |                 newLine = highlightCode('python', line) | ||||||
|  |             } | ||||||
|  |             return newLine | ||||||
|  |         }) | ||||||
|         .join('\n') |         .join('\n') | ||||||
|     return htmlToReact(result) |     return htmlToReact(result) | ||||||
| } | } | ||||||
|  |  | ||||||
|  | @ -28,7 +28,6 @@ export default class Juniper extends React.Component { | ||||||
|             mode: this.props.lang, |             mode: this.props.lang, | ||||||
|             theme: this.props.theme, |             theme: this.props.theme, | ||||||
|         }) |         }) | ||||||
| 
 |  | ||||||
|         const runCode = () => this.execute(outputArea, cm.getValue()) |         const runCode = () => this.execute(outputArea, cm.getValue()) | ||||||
|         cm.setOption('extraKeys', { 'Shift-Enter': runCode }) |         cm.setOption('extraKeys', { 'Shift-Enter': runCode }) | ||||||
|         Widget.attach(outputArea, this.outputRef) |         Widget.attach(outputArea, this.outputRef) | ||||||
|  |  | ||||||
|  | @ -65,12 +65,12 @@ | ||||||
|     --color-subtle-dark: hsl(162, 5%, 60%) |     --color-subtle-dark: hsl(162, 5%, 60%) | ||||||
| 
 | 
 | ||||||
|     --color-green-medium: hsl(108, 66%, 63%) |     --color-green-medium: hsl(108, 66%, 63%) | ||||||
|     --color-green-transparent: hsla(108, 66%, 63%, 0.11) |     --color-green-transparent: hsla(108, 66%, 63%, 0.12) | ||||||
|     --color-red-light: hsl(355, 100%, 96%) |     --color-red-light: hsl(355, 100%, 96%) | ||||||
|     --color-red-medium: hsl(346, 84%, 61%) |     --color-red-medium: hsl(346, 84%, 61%) | ||||||
|     --color-red-dark: hsl(332, 64%, 34%) |     --color-red-dark: hsl(332, 64%, 34%) | ||||||
|     --color-red-opaque: hsl(346, 96%, 89%) |     --color-red-opaque: hsl(346, 96%, 89%) | ||||||
|     --color-red-transparent: hsla(346, 84%, 61%, 0.11) |     --color-red-transparent: hsla(346, 84%, 61%, 0.12) | ||||||
|     --color-yellow-light: hsl(46, 100%, 95%) |     --color-yellow-light: hsl(46, 100%, 95%) | ||||||
|     --color-yellow-medium: hsl(45, 90%, 55%) |     --color-yellow-medium: hsl(45, 90%, 55%) | ||||||
|     --color-yellow-dark: hsl(44, 94%, 27%) |     --color-yellow-dark: hsl(44, 94%, 27%) | ||||||
|  | @ -79,11 +79,11 @@ | ||||||
|     // Syntax Highlighting |     // Syntax Highlighting | ||||||
|     --syntax-comment: hsl(162, 5%, 60%) |     --syntax-comment: hsl(162, 5%, 60%) | ||||||
|     --syntax-tag: hsl(266, 72%, 72%) |     --syntax-tag: hsl(266, 72%, 72%) | ||||||
|     --syntax-number: hsl(266, 72%, 72%) |     --syntax-number: var(--syntax-tag) | ||||||
|     --syntax-selector: hsl(31, 100%, 71%) |     --syntax-selector: hsl(31, 100%, 71%) | ||||||
|     --syntax-operator: hsl(342, 100%, 59%) |  | ||||||
|     --syntax-function: hsl(195, 70%, 54%) |     --syntax-function: hsl(195, 70%, 54%) | ||||||
|     --syntax-keyword: hsl(342, 100%, 59%) |     --syntax-keyword: hsl(343, 100%, 68%) | ||||||
|  |     --syntax-operator: var(--syntax-keyword) | ||||||
|     --syntax-regex: hsl(45, 90%, 55%) |     --syntax-regex: hsl(45, 90%, 55%) | ||||||
| 
 | 
 | ||||||
|     // Other |     // Other | ||||||
|  | @ -354,6 +354,7 @@ body [id]:target | ||||||
|     &.inserted, &.deleted |     &.inserted, &.deleted | ||||||
|         padding: 2px 0 |         padding: 2px 0 | ||||||
|         border-radius: 2px |         border-radius: 2px | ||||||
|  |         opacity: 0.9 | ||||||
| 
 | 
 | ||||||
|     &.inserted |     &.inserted | ||||||
|         color: var(--color-green-medium) |         color: var(--color-green-medium) | ||||||
|  | @ -388,7 +389,6 @@ body [id]:target | ||||||
|     .token |     .token | ||||||
|         color: var(--color-subtle) |         color: var(--color-subtle) | ||||||
| 
 | 
 | ||||||
| 
 |  | ||||||
| .gatsby-highlight-code-line | .gatsby-highlight-code-line | ||||||
|     background-color: var(--color-dark-secondary) |     background-color: var(--color-dark-secondary) | ||||||
|     border-left: 0.35em solid var(--color-theme) |     border-left: 0.35em solid var(--color-theme) | ||||||
|  | @ -409,6 +409,7 @@ body [id]:target | ||||||
|     color: var(--color-subtle) |     color: var(--color-subtle) | ||||||
| 
 | 
 | ||||||
|     .CodeMirror-line |     .CodeMirror-line | ||||||
|  |         color: var(--syntax-comment) | ||||||
|         padding: 0 |         padding: 0 | ||||||
| 
 | 
 | ||||||
|     .CodeMirror-selected |     .CodeMirror-selected | ||||||
|  | @ -418,26 +419,25 @@ body [id]:target | ||||||
|     .CodeMirror-cursor |     .CodeMirror-cursor | ||||||
|         border-left-color: currentColor |         border-left-color: currentColor | ||||||
| 
 | 
 | ||||||
|     .cm-variable-2 |     .cm-property, .cm-variable, .cm-variable-2, .cm-meta // decorators | ||||||
|         color: inherit |         color: var(--color-subtle) | ||||||
|         font-style: italic |  | ||||||
| 
 | 
 | ||||||
|     .cm-comment |     .cm-comment | ||||||
|         color: var(--syntax-comment) |         color: var(--syntax-comment) | ||||||
| 
 | 
 | ||||||
|     .cm-keyword |     .cm-keyword, .cm-builtin | ||||||
|         color: var(--syntax-keyword) |         color: var(--syntax-keyword) | ||||||
| 
 | 
 | ||||||
|     .cm-operator |     .cm-operator | ||||||
|         color: var(--syntax-operator) |         color: var(--syntax-operator) | ||||||
| 
 | 
 | ||||||
|     .cm-string, .cm-builtin |     .cm-string | ||||||
|         color: var(--syntax-selector) |         color: var(--syntax-selector) | ||||||
| 
 | 
 | ||||||
|     .cm-number |     .cm-number | ||||||
|         color: var(--syntax-number) |         color: var(--syntax-number) | ||||||
| 
 | 
 | ||||||
|     .cm-def, .cm-meta |     .cm-def | ||||||
|         color: var(--syntax-function) |         color: var(--syntax-function) | ||||||
| 
 | 
 | ||||||
| // Jupyter | // Jupyter | ||||||
|  |  | ||||||
		Loading…
	
		Reference in New Issue
	
	Block a user