Update docs

2025-08-01 10:59:55 +03:00 · 2020-08-05 15:00:54 +02:00 · 2020-08-05 15:00:54 +02:00 · cdec46493f
commit cdec46493f
parent ab5ef37abb
4 changed files with 180 additions and 28 deletions
--- a/website/docs/api/language.md
+++ b/website/docs/api/language.md
@ -363,7 +363,7 @@ that take a `Doc` object, modify it and return it. Only one of `before`,
 <Infobox title="Changed in v3.0" variant="warning">

 As of v3.0, the [`Language.add_pipe`](/api/language#add_pipe) method doesn't
-take callables anymore and instead expects the name of a component factory
+take callables anymore and instead expects the **name of a component factory**
 registered using [`@Language.component`](/api/language#component) or
 [`@Language.factory`](/api/language#factory). It now takes care of creating the
 component, adds it to the pipeline and returns it.
@ -379,20 +379,25 @@ component, adds it to the pipeline and returns it.
 >
 > nlp.add_pipe("component", before="ner")
 > component = nlp.add_pipe("component", name="custom_name", last=True)
+>
+> # Add component from source model
+> source_nlp = spacy.load("en_core_web_sm")
+> nlp.add_pipe("ner", source=source_nlp)
 > ```

-| Name                                   | Type             | Description                                                                                                                                               |
-| -------------------------------------- | ---------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| `factory_name`                         | str              | Name of the registered component factory.                                                                                                                 |
-| `name`                                 | str              | Optional unique name of pipeline component instance. If not set, the factory name is used. An error is raised if the name already exists in the pipeline. |
-| _keyword-only_                         |                  |                                                                                                                                                           |
-| `before`                               | str / int        | Component name or index to insert component directly before.                                                                                              |
-| `after`                                | str / int        | Component name or index to insert component directly after:                                                                                               |
-| `first`                                | bool             | Insert component first / not first in the pipeline.                                                                                                       |
-| `last`                                 | bool             | Insert component last / not last in the pipeline.                                                                                                         |
-| `config` <Tag variant="new">3</Tag>    | `Dict[str, Any]` | Optional config parameters to use for this component. Will be merged with the `default_config` specified by the component factory.                        |
-| `validate` <Tag variant="new">3</Tag>  | bool             | Whether to validate the component config and arguments against the types expected by the factory. Defaults to `True`.                                     |
-| **RETURNS** <Tag variant="new">3</Tag> | callable         | The pipeline component.                                                                                                                                   |
+| Name                                   | Type             | Description                                                                                                                                                                                                                                              |
+| -------------------------------------- | ---------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `factory_name`                         | str              | Name of the registered component factory.                                                                                                                                                                                                                |
+| `name`                                 | str              | Optional unique name of pipeline component instance. If not set, the factory name is used. An error is raised if the name already exists in the pipeline.                                                                                                |
+| _keyword-only_                         |                  |                                                                                                                                                                                                                                                          |
+| `before`                               | str / int        | Component name or index to insert component directly before.                                                                                                                                                                                             |
+| `after`                                | str / int        | Component name or index to insert component directly after:                                                                                                                                                                                              |
+| `first`                                | bool             | Insert component first / not first in the pipeline.                                                                                                                                                                                                      |
+| `last`                                 | bool             | Insert component last / not last in the pipeline.                                                                                                                                                                                                        |
+| `config` <Tag variant="new">3</Tag>    | `Dict[str, Any]` | Optional config parameters to use for this component. Will be merged with the `default_config` specified by the component factory.                                                                                                                       |
+| `source` <Tag variant="new">3</Tag>    | `Language`       | Optional source model to copy component from. If a source is provided, the `factory_name` is interpreted as the name of the component in the source pipeline. Make sure that the vocab, vectors and settings of the source model match the target model. |
+| `validate` <Tag variant="new">3</Tag>  | bool             | Whether to validate the component config and arguments against the types expected by the factory. Defaults to `True`.                                                                                                                                    |
+| **RETURNS** <Tag variant="new">3</Tag> | callable         | The pipeline component.                                                                                                                                                                                                                                  |

 ## Language.has_factory {#has_factory tag="classmethod" new="3"}

--- a/website/docs/api/top-level.md
+++ b/website/docs/api/top-level.md
@ -4,6 +4,7 @@ menu:
  - ['spacy', 'spacy']
  - ['displacy', 'displacy']
  - ['registry', 'registry']
+  - ['Loaders & Batchers', 'loaders-batchers']
  - ['Data & Alignment', 'gold']
  - ['Utility Functions', 'util']
 ---
@ -34,6 +35,7 @@ loaded in via [`Language.from_disk`](/api/language#from_disk).
 | Name                                       | Type              | Description                                                                       |
 | ------------------------------------------ | ----------------- | --------------------------------------------------------------------------------- |
 | `name`                                     | str / `Path`      | Model to load, i.e. package name or path.                                         |
+| _keyword-only_                             |                   |                                                                                   |
 | `disable`                                  | `List[str]`       | Names of pipeline components to [disable](/usage/processing-pipelines#disabling). |
 | `component_cfg` <Tag variant="new">3</Tag> | `Dict[str, dict]` | Optional config overrides for pipeline components, keyed by component names.      |
 | **RETURNS**                                | `Language`        | A `Language` object with the loaded model.                                        |
@ -83,11 +85,12 @@ meta data as a dictionary instead, you can use the `meta` attribute on your
 > markdown = spacy.info(markdown=True, silent=True)
 > ```

-| Name       | Type | Description                                      |
-| ---------- | ---- | ------------------------------------------------ |
-| `model`    | str  | A model, i.e. a package name or path (optional). |
-| `markdown` | bool | Print information as Markdown.                   |
-| `silent`   | bool | Don't print anything, just return.               |
+| Name           | Type | Description                                      |
+| -------------- | ---- | ------------------------------------------------ |
+| `model`        | str  | A model, i.e. a package name or path (optional). |
+| _keyword-only_ |      |                                                  |
+| `markdown`     | bool | Print information as Markdown.                   |
+| `silent`       | bool | Don't print anything, just return.               |

 ### spacy.explain {#spacy.explain tag="function"}

@ -331,6 +334,10 @@ See the [`Transformer`](/api/transformer) API reference and
 | [`span_getters`](/api/transformer#span_getters)              | Registry for functions that take a batch of `Doc` objects and return a list of `Span` objects to process by the transformer, e.g. sentences.                                                                                                      |
 | [`annotation_setters`](/api/transformers#annotation_setters) | Registry for functions that create annotation setters. Annotation setters are functions that take a batch of `Doc` objects and a [`FullTransformerBatch`](/api/transformer#fulltransformerbatch) and can set additional annotations on the `Doc`. |

+## Training data loaders and batchers {#loaders-batchers new="3"}
+
+<!-- TODO: -->
+
 ## Training data and alignment {#gold source="spacy/gold"}

 ### gold.docs_to_json {#docs_to_json tag="function"}
--- a/website/docs/usage/processing-pipelines.md
+++ b/website/docs/usage/processing-pipelines.md
@ -311,6 +311,62 @@ nlp.rename_pipe("ner", "entityrecognizer")
 nlp.replace_pipe("tagger", my_custom_tagger)
 ```

+### Sourcing pipeline components from existing models {#sourced-components new="3"}
+
+Pipeline components that are independent can also be reused across models.
+Instead of adding a new blank component to a pipeline, you can also copy an
+existing component from a pretrained model by setting the `source` argument on
+[`nlp.add_pipe`](/api/language#add_pipe). The first argument will then be
+interpreted as the name of the component in the source pipeline – for instance,
+`"ner"`. This is especially useful for
+[training a model](/usage/training#config-components) because it lets you mix
+and match components and create fully custom model packages with updated
+pretrained components and new components trained on your data.
+
+<Infobox variant="warning" title="Important note for pretrained components">
+
+When reusing components across models, keep in mind that the **vocabulary**,
+**vectors** and model settings **must match**. If a pretrained model includes
+[word vectors](/usage/vectors-embeddings) and the component uses them as
+features, the model you copy it to needs to have the _same_ vectors available –
+otherwise, it won't be able to make the same predictions.
+
+</Infobox>
+
+> #### In training config
+>
+> Instead of providing a `factory`, component blocks in the training
+> [config](/usage/training#config) can also define a `source`. The string needs
+> to be a loadable spaCy model package or path. The
+>
+> ```ini
+> [components.ner]
+> source = "en_core_web_sm"
+> component = "ner"
+> ```
+>
+> By default, sourced components will be updated with your data during training.
+> If you want to preserve the component as-is, you can "freeze" it:
+>
+> ```ini
+> [training]
+> frozen_components = ["ner"]
+> ```
+
+```python
+### {executable="true"}
+import spacy
+
+# The source model with different components
+source_nlp = spacy.load("en_core_web_sm")
+print(source_nlp.pipe_names)
+
+# Add only the entity recognizer to the new blank model
+nlp = spacy.blank("en")
+nlp.add_pipe("ner", source=source_nlp)
+print(nlp.pipe_names)
+```
+
 ### Analyzing pipeline components {#analysis new="3"}

 The [`nlp.analyze_pipes`](/api/language#analyze_pipes) method analyzes the
--- a/website/docs/usage/training.md
+++ b/website/docs/usage/training.md
@ -149,12 +149,14 @@ not just define static settings, but also construct objects like architectures,
 schedules, optimizers or any other custom components. The main top-level
 sections of a config file are:

-| Section       | Description                                                                                                           |
-| ------------- | --------------------------------------------------------------------------------------------------------------------- |
-| `training`    | Settings and controls for the training and evaluation process.                                                        |
-| `pretraining` | Optional settings and controls for the [language model pretraining](#pretraining).                                    |
-| `nlp`         | Definition of the `nlp` object, its tokenizer and [processing pipeline](/usage/processing-pipelines) component names. |
-| `components`  | Definitions of the [pipeline components](/usage/processing-pipelines) and their models.                               |
+| Section       | Description                                                                                                                                            |
+| ------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------ |
+| `nlp`         | Definition of the `nlp` object, its tokenizer and [processing pipeline](/usage/processing-pipelines) component names.                                  |
+| `components`  | Definitions of the [pipeline components](/usage/processing-pipelines) and their models.                                                                |
+| `paths`       | Paths to data and other assets. Can be re-used across the config as variables, e.g. `${paths:train}`, and [overwritten](#config-overrides) on the CLI. |
+| `system`      | Settings related to system and hardware.                                                                                                               |
+| `training`    | Settings and controls for the training and evaluation process.                                                                                         |
+| `pretraining` | Optional settings and controls for the [language model pretraining](#pretraining).                                                                     |

 <Infobox title="Config format and settings" emoji="📖">

@ -168,7 +170,7 @@ available for the different architectures are documented with the

 </Infobox>

-#### Overwriting config settings on the command line {#config-overrides}
+### Overwriting config settings on the command line {#config-overrides}

 The config system means that you can define all settings **in one place** and in
 a consistent format. There are no command-line arguments that need to be set,
@ -192,7 +194,87 @@ of the training, the final filled `config.cfg` is exported with your model, so
 you'll always have a record of the settings that were used, including your
 overrides.

-#### Using registered functions {#config-functions}
+### Defining pipeline components {#config-components}
+
+When you train a model, you typically train a
+[pipeline](/usage/processing-pipelines) of **one or more components**. The
+`[components]` block in the config defines the available pipeline components and
+how they should be created – either by a built-in or custom
+[factory](/usage/processing-pipelines#built-in), or
+[sourced](/usage/processing-pipelines#sourced-components) from an existing
+pretrained model. For example, `[components.parser]` defines the component named
+`"parser"` in the pipeline. There are different ways you might want to treat
+your components during training, and the most common scenarios are:
+
+1. Train a **new component** from scratch on your data.
+2. Update an existing **pretrained component** with more examples.
+3. Include an existing pretrained component without updating it.
+4. Include a non-trainable component, like a rule-based
+   [`EntityRuler`](/api/entityruler) or [`Sentencizer`](/api/sentencizer), or a
+   fully [custom component](/usage/processing-pipelines#custom-components).
+
+If a component block defines a `factory`, spaCy will look it up in the
+[built-in](/usage/processing-pipelines#built-in) or
+[custom](/usage/processing-pipelines#custom-components) components and create a
+new component from scratch. All settings defined in the config block will be
+passed to the component factory as arguments. This lets you configure the model
+settings and hyperparameters. If a component block defines a `source`, the
+component will be copied over from an existing pretrained model, with its
+existing weights. This lets you include an already trained component in your
+model pipeline, or update a pretrained components with more data specific to
+your use case.
+
+```ini
+### config.cfg (excerpt)
+[components]
+
+# "parser" and "ner" are sourced from pretrained model
+[components.parser]
+source = "en_core_web_sm"
+
+[components.ner]
+source = "en_core_web_sm"
+
+# "textcat" and "custom" are created blank from built-in / custom factory
+[components.textcat]
+factory = "textcat"
+
+[components.custom]
+factory = "your_custom_factory"
+your_custom_setting = true
+```
+
+The `pipeline` setting in the `[nlp]` block defines the pipeline components
+added to the pipeline, in order. For example, `"parser"` here references
+`[components.parser]`. By default, spaCy will **update all components that can
+be updated**. Trainable components that are created from scratch are initialized
+with random weights. For sourced components, spaCy will keep the existing
+weights and [resume training](/api/language#resume_training).
+
+If you don't want a component to be updated, you can **freeze** it by adding it
+to the `frozen_components` list in the `[training]` block. Frozen components are
+**not updated** during training and are included in the final trained model
+as-is.
+
+> #### Note on frozen components
+>
+> Even though frozen components are not **updated** during training, they will
+> still **run** during training and evaluation. This is very important, because
+> they may still impact your model's performance – for instance, a sentence
+> boundary detector can impact what the parser or entity recognizer considers a
+> valid parse. So the evaluation results should always reflect what your model
+> will produce at runtime.
+
+```ini
+[nlp]
+lang = "en"
+pipeline = ["parser", "ner", "textcat", "custom"]
+
+[training]
+frozen_components = ["parser", "custom"]
+```
+
+### Using registered functions {#config-functions}

 The training configuration defined in the config file doesn't have to only
 consist of static values. Some settings can also be **functions**. For instance,
@ -373,7 +455,9 @@ In your config, you can now reference the schedule in the
 starting with an `@`, it's interpreted as a reference to a function. All other
 settings in the block will be passed to the function as keyword arguments. Keep
 in mind that the config shouldn't have any hidden defaults and all arguments on
-the functions need to be represented in the config.
+the functions need to be represented in the config. If your function defines
+**default argument values**, spaCy is able to auto-fill your config when you run
+[`init config`](/api/cli#init-config).

 <!-- TODO: this needs to be updated once we've decided on a workflow for "fill config" -->

@ -405,7 +489,7 @@ using [`pydantic`](https://github.com/samuelcolvin/pydantic). If your registered
 function provides type hints, the values that are passed in will be checked
 against the expected types. For example, `start: int` in the example above will
 ensure that the value received as the argument `start` is an integer. If the
-value can't be cast to an integer, spaCy will raise an error.
+value can't be coerced into an integer, spaCy will raise an error.
 `start: pydantic.StrictInt` will force the value to be an integer and raise an
 error if it's not – for instance, if your config defines a float.