Add details about pretrained pipeline design

2025-11-22 02:36:03 +03:00 · 2021-03-17 11:29:57 +01:00 · 2021-03-17 11:29:57 +01:00 · a5ffe8dfed
commit a5ffe8dfed
parent 61472e7cb3
1 changed files with 144 additions and 0 deletions
--- a/website/docs/models/index.md
+++ b/website/docs/models/index.md
@ -4,6 +4,7 @@ teaser: Downloadable trained pipelines and weights for spaCy
 menu:
  - ['Quickstart', 'quickstart']
  - ['Conventions', 'conventions']
  - ['Pipeline Design', 'design']
 ---
 <!-- TODO: include interactive demo -->
@ -53,3 +54,146 @@ For a detailed compatibility overview, see the
 [`compatibility.json`](https://github.com/explosion/spacy-models/tree/master/compatibility.json).
 This is also the source of spaCy's internal compatibility check, performed when
 you run the [`download`](/api/cli#download) command.
 ## Pretrained pipeline design {#design}
 The spaCy v3 pretrained pipelines are designed to be efficient and configurable.
 For example, multiple components can share a common "token-to-vector" model and
 it's easy to swap out or disable the lemmatizer. The pipelines are designed to
 be efficient in terms of speed and size and work well when the pipeline is run
 in full.
 When modifying a pretrained v3 pipeline, it's important to understand how the
 components **depend on** each other. Unlike spaCy v2, where the `tagger`,
 `parser` and `ner` components were all independent, some v3 components depend on
 earlier components in the pipeline. As a result, disabling or reordering
 components can affect the annotation quality or lead to warnings and errors.
 Main changes from spaCy v2 models:
 - The [`Tok2Vec`](/api/tok2vec) component may be a separate, shared component. A
  component like a tagger or parser can
  [listen](/api/architectures#Tok2VecListener) to an earlier `tok2vec` or
  `transformer` rather than having its own separate tok2vec layer.
 - Rule-based exceptions move from individual components to the
  `attribute_ruler`. Lemma and POS exceptions move from the tokenizer exceptions
  to the attribute ruler and the tag map and morph rules move from the tagger to
  the attribute ruler.
 - The lemmatizer tables and processing move from the vocab and tagger to a
  separate `lemmatizer` component.
 ### CNN/CPU pipeline design
 In the `sm`/`md`/`lg` models:
 - The `tagger`, `morphologizer` and `parser` components listen to the `tok2vec`
  component.
 - The `attribute_ruler` maps `token.tag` to `token.pos` if there is no
  `morphologizer`. The `attribute_ruler` additionally makes sure whitespace is
  tagged consistently and copies `token.pos` to `token.tag` if there is no
  tagger. For English, the attribute ruler can improve its mapping from
  `token.tag` to `token.pos` if dependency parses from a `parser` are present,
  but the parser is not required.
 - The rule-based `lemmatizer` (Dutch, English, French, Greek, Macedonian,
  Norwegian and Spanish) requires `token.pos` annotation from either
  `tagger`+`attribute_ruler` or `morphologizer`.
 - The `ner` component is independent with its own internal tok2vec layer.
 <!-- TODO: pretty diagram -->
 ### Transformer pipeline design
 In the tranformer (`trf`) models, the `tagger`, `parser` and `ner` (if present)
 all listen to the `transformer` component. The `attribute_ruler` and
 `lemmatizer` have the same configuration as in the CNN models.
 <!-- TODO: pretty diagram -->
 ### Modifying the default pipeline
 For faster processing, you may only want to run a subset of the components in a
 pretrained pipeline. The `disable` and `exclude` arguments to
 [`spacy.load`](/api/top-level#spacy.load) let you control which components are
 loaded and run. Disabled components are loaded in the background so it's
 possible to reenable them in the same pipeline in the future with
 [`nlp.enable_pipe`](/api/language/#enable_pipe). To skip loading a component
 completely, use `exclude` instead of `disable`.
 #### Disable part-of-speech tagging and lemmatization
 To disable part-of-speech tagging and lemmatization, disable the `tagger`,
 `morphologizer`, `attribute_ruler` and `lemmatizer` components.
 ```python
 # Note: English doesn't include a morphologizer
 nlp = spacy.load("en_core_web_sm", disable=["tagger", "attribute_ruler", "lemmatizer"])
 nlp = spacy.load("en_core_web_trf", disable=["tagger", "attribute_ruler", "lemmatizer"])
 ```
 <Infobox variant="warning" title="Rule-based lemmatizers require Token.pos">
 The lemmatizer depends on `tagger`+`attribute_ruler` or `morphologizer` for
 Dutch, English, French, Greek, Macedonian, Norwegian and Spanish. If you disable
 any of these components, you'll see lemmatizer warnings unless the lemmatizer is
 also disabled.
 </Infobox>
 #### Use senter rather than parser for fast sentence segmentation
 If you need fast sentence segmentation without dependency parses, disable the
 `parser` use the `senter` component instead:
 ```python
 nlp = spacy.load("en_core_web_sm")
 nlp.disable_pipe("parser")
 nlp.enable_pipe("senter")
 ```
 The `senter` component is ~10&times; faster than the parser and more accurate
 than the rule-based `sentencizer`.
 #### Switch from rule-based to lookup lemmatization
 For the Dutch, English, French, Greek, Macedonian, Norwegian and Spanish
 pipelines, you can switch from the default rule-based lemmatizer to a lookup
 lemmatizer:
 ```python
 # Requirements: pip install spacy-lookups-data
 nlp = spacy.load("en_core_web_sm")
 nlp.remove_pipe("lemmatizer")
 nlp.add_pipe("lemmatizer", config={"mode": "lookup"}).initialize()
 ```
 #### Disable everything except NER
 For the non-transformer models, the `ner` component is independent, so you can
 disable everything else:
 ```python
 nlp = spacy.load("en_core_web_sm", disable=["tok2vec", "tagger", "parser", "attribute_ruler", "lemmatizer"])
 ```
 In the transformer models, `ner` listens to the `transformer` layer, so you can
 disable all components related tagging, parsing, and lemmatization.
 ```python
 nlp = spacy.load("en_core_web_trf", disable=["tagger", "parser", "attribute_ruler", "lemmatizer"])
 ```
 #### Move NER to the end of the pipeline
 For access to `POS` and `LEMMA` features in an `entity_ruler`, move `ner` to the
 end of the pipeline after `attribute_ruler` and `lemmatizer`:
 ```python
 # load without NER
 nlp = spacy.load("en_core_web_sm", exclude=["ner"])
 # source NER from the same pipeline package as the last component
 nlp.add_pipe("ner", source=spacy.load("en_core_web_sm"))
 # insert the entity ruler
 nlp.add_pipe("entity_ruler", before="ner")
 ```