Update docs

This commit is contained in:
Ines Montani 2020-08-05 15:00:54 +02:00
parent ab5ef37abb
commit cdec46493f
4 changed files with 180 additions and 28 deletions

View File

@ -363,7 +363,7 @@ that take a `Doc` object, modify it and return it. Only one of `before`,
<Infobox title="Changed in v3.0" variant="warning">
As of v3.0, the [`Language.add_pipe`](/api/language#add_pipe) method doesn't
take callables anymore and instead expects the name of a component factory
take callables anymore and instead expects the **name of a component factory**
registered using [`@Language.component`](/api/language#component) or
[`@Language.factory`](/api/language#factory). It now takes care of creating the
component, adds it to the pipeline and returns it.
@ -379,20 +379,25 @@ component, adds it to the pipeline and returns it.
>
> nlp.add_pipe("component", before="ner")
> component = nlp.add_pipe("component", name="custom_name", last=True)
>
> # Add component from source model
> source_nlp = spacy.load("en_core_web_sm")
> nlp.add_pipe("ner", source=source_nlp)
> ```
| Name | Type | Description |
| -------------------------------------- | ---------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `factory_name` | str | Name of the registered component factory. |
| `name` | str | Optional unique name of pipeline component instance. If not set, the factory name is used. An error is raised if the name already exists in the pipeline. |
| _keyword-only_ | | |
| `before` | str / int | Component name or index to insert component directly before. |
| `after` | str / int | Component name or index to insert component directly after: |
| `first` | bool | Insert component first / not first in the pipeline. |
| `last` | bool | Insert component last / not last in the pipeline. |
| `config` <Tag variant="new">3</Tag> | `Dict[str, Any]` | Optional config parameters to use for this component. Will be merged with the `default_config` specified by the component factory. |
| `validate` <Tag variant="new">3</Tag> | bool | Whether to validate the component config and arguments against the types expected by the factory. Defaults to `True`. |
| **RETURNS** <Tag variant="new">3</Tag> | callable | The pipeline component. |
| Name | Type | Description |
| -------------------------------------- | ---------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `factory_name` | str | Name of the registered component factory. |
| `name` | str | Optional unique name of pipeline component instance. If not set, the factory name is used. An error is raised if the name already exists in the pipeline. |
| _keyword-only_ | | |
| `before` | str / int | Component name or index to insert component directly before. |
| `after` | str / int | Component name or index to insert component directly after: |
| `first` | bool | Insert component first / not first in the pipeline. |
| `last` | bool | Insert component last / not last in the pipeline. |
| `config` <Tag variant="new">3</Tag> | `Dict[str, Any]` | Optional config parameters to use for this component. Will be merged with the `default_config` specified by the component factory. |
| `source` <Tag variant="new">3</Tag> | `Language` | Optional source model to copy component from. If a source is provided, the `factory_name` is interpreted as the name of the component in the source pipeline. Make sure that the vocab, vectors and settings of the source model match the target model. |
| `validate` <Tag variant="new">3</Tag> | bool | Whether to validate the component config and arguments against the types expected by the factory. Defaults to `True`. |
| **RETURNS** <Tag variant="new">3</Tag> | callable | The pipeline component. |
## Language.has_factory {#has_factory tag="classmethod" new="3"}

View File

@ -4,6 +4,7 @@ menu:
- ['spacy', 'spacy']
- ['displacy', 'displacy']
- ['registry', 'registry']
- ['Loaders & Batchers', 'loaders-batchers']
- ['Data & Alignment', 'gold']
- ['Utility Functions', 'util']
---
@ -34,6 +35,7 @@ loaded in via [`Language.from_disk`](/api/language#from_disk).
| Name | Type | Description |
| ------------------------------------------ | ----------------- | --------------------------------------------------------------------------------- |
| `name` | str / `Path` | Model to load, i.e. package name or path. |
| _keyword-only_ | | |
| `disable` | `List[str]` | Names of pipeline components to [disable](/usage/processing-pipelines#disabling). |
| `component_cfg` <Tag variant="new">3</Tag> | `Dict[str, dict]` | Optional config overrides for pipeline components, keyed by component names. |
| **RETURNS** | `Language` | A `Language` object with the loaded model. |
@ -83,11 +85,12 @@ meta data as a dictionary instead, you can use the `meta` attribute on your
> markdown = spacy.info(markdown=True, silent=True)
> ```
| Name | Type | Description |
| ---------- | ---- | ------------------------------------------------ |
| `model` | str | A model, i.e. a package name or path (optional). |
| `markdown` | bool | Print information as Markdown. |
| `silent` | bool | Don't print anything, just return. |
| Name | Type | Description |
| -------------- | ---- | ------------------------------------------------ |
| `model` | str | A model, i.e. a package name or path (optional). |
| _keyword-only_ | | |
| `markdown` | bool | Print information as Markdown. |
| `silent` | bool | Don't print anything, just return. |
### spacy.explain {#spacy.explain tag="function"}
@ -331,6 +334,10 @@ See the [`Transformer`](/api/transformer) API reference and
| [`span_getters`](/api/transformer#span_getters) | Registry for functions that take a batch of `Doc` objects and return a list of `Span` objects to process by the transformer, e.g. sentences. |
| [`annotation_setters`](/api/transformers#annotation_setters) | Registry for functions that create annotation setters. Annotation setters are functions that take a batch of `Doc` objects and a [`FullTransformerBatch`](/api/transformer#fulltransformerbatch) and can set additional annotations on the `Doc`. |
## Training data loaders and batchers {#loaders-batchers new="3"}
<!-- TODO: -->
## Training data and alignment {#gold source="spacy/gold"}
### gold.docs_to_json {#docs_to_json tag="function"}

View File

@ -311,6 +311,62 @@ nlp.rename_pipe("ner", "entityrecognizer")
nlp.replace_pipe("tagger", my_custom_tagger)
```
### Sourcing pipeline components from existing models {#sourced-components new="3"}
Pipeline components that are independent can also be reused across models.
Instead of adding a new blank component to a pipeline, you can also copy an
existing component from a pretrained model by setting the `source` argument on
[`nlp.add_pipe`](/api/language#add_pipe). The first argument will then be
interpreted as the name of the component in the source pipeline for instance,
`"ner"`. This is especially useful for
[training a model](/usage/training#config-components) because it lets you mix
and match components and create fully custom model packages with updated
pretrained components and new components trained on your data.
<Infobox variant="warning" title="Important note for pretrained components">
When reusing components across models, keep in mind that the **vocabulary**,
**vectors** and model settings **must match**. If a pretrained model includes
[word vectors](/usage/vectors-embeddings) and the component uses them as
features, the model you copy it to needs to have the _same_ vectors available
otherwise, it won't be able to make the same predictions.
</Infobox>
> #### In training config
>
> Instead of providing a `factory`, component blocks in the training
> [config](/usage/training#config) can also define a `source`. The string needs
> to be a loadable spaCy model package or path. The
>
> ```ini
> [components.ner]
> source = "en_core_web_sm"
> component = "ner"
> ```
>
> By default, sourced components will be updated with your data during training.
> If you want to preserve the component as-is, you can "freeze" it:
>
> ```ini
> [training]
> frozen_components = ["ner"]
> ```
```python
### {executable="true"}
import spacy
# The source model with different components
source_nlp = spacy.load("en_core_web_sm")
print(source_nlp.pipe_names)
# Add only the entity recognizer to the new blank model
nlp = spacy.blank("en")
nlp.add_pipe("ner", source=source_nlp)
print(nlp.pipe_names)
```
### Analyzing pipeline components {#analysis new="3"}
The [`nlp.analyze_pipes`](/api/language#analyze_pipes) method analyzes the

View File

@ -149,12 +149,14 @@ not just define static settings, but also construct objects like architectures,
schedules, optimizers or any other custom components. The main top-level
sections of a config file are:
| Section | Description |
| ------------- | --------------------------------------------------------------------------------------------------------------------- |
| `training` | Settings and controls for the training and evaluation process. |
| `pretraining` | Optional settings and controls for the [language model pretraining](#pretraining). |
| `nlp` | Definition of the `nlp` object, its tokenizer and [processing pipeline](/usage/processing-pipelines) component names. |
| `components` | Definitions of the [pipeline components](/usage/processing-pipelines) and their models. |
| Section | Description |
| ------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `nlp` | Definition of the `nlp` object, its tokenizer and [processing pipeline](/usage/processing-pipelines) component names. |
| `components` | Definitions of the [pipeline components](/usage/processing-pipelines) and their models. |
| `paths` | Paths to data and other assets. Can be re-used across the config as variables, e.g. `${paths:train}`, and [overwritten](#config-overrides) on the CLI. |
| `system` | Settings related to system and hardware. |
| `training` | Settings and controls for the training and evaluation process. |
| `pretraining` | Optional settings and controls for the [language model pretraining](#pretraining). |
<Infobox title="Config format and settings" emoji="📖">
@ -168,7 +170,7 @@ available for the different architectures are documented with the
</Infobox>
#### Overwriting config settings on the command line {#config-overrides}
### Overwriting config settings on the command line {#config-overrides}
The config system means that you can define all settings **in one place** and in
a consistent format. There are no command-line arguments that need to be set,
@ -192,7 +194,87 @@ of the training, the final filled `config.cfg` is exported with your model, so
you'll always have a record of the settings that were used, including your
overrides.
#### Using registered functions {#config-functions}
### Defining pipeline components {#config-components}
When you train a model, you typically train a
[pipeline](/usage/processing-pipelines) of **one or more components**. The
`[components]` block in the config defines the available pipeline components and
how they should be created either by a built-in or custom
[factory](/usage/processing-pipelines#built-in), or
[sourced](/usage/processing-pipelines#sourced-components) from an existing
pretrained model. For example, `[components.parser]` defines the component named
`"parser"` in the pipeline. There are different ways you might want to treat
your components during training, and the most common scenarios are:
1. Train a **new component** from scratch on your data.
2. Update an existing **pretrained component** with more examples.
3. Include an existing pretrained component without updating it.
4. Include a non-trainable component, like a rule-based
[`EntityRuler`](/api/entityruler) or [`Sentencizer`](/api/sentencizer), or a
fully [custom component](/usage/processing-pipelines#custom-components).
If a component block defines a `factory`, spaCy will look it up in the
[built-in](/usage/processing-pipelines#built-in) or
[custom](/usage/processing-pipelines#custom-components) components and create a
new component from scratch. All settings defined in the config block will be
passed to the component factory as arguments. This lets you configure the model
settings and hyperparameters. If a component block defines a `source`, the
component will be copied over from an existing pretrained model, with its
existing weights. This lets you include an already trained component in your
model pipeline, or update a pretrained components with more data specific to
your use case.
```ini
### config.cfg (excerpt)
[components]
# "parser" and "ner" are sourced from pretrained model
[components.parser]
source = "en_core_web_sm"
[components.ner]
source = "en_core_web_sm"
# "textcat" and "custom" are created blank from built-in / custom factory
[components.textcat]
factory = "textcat"
[components.custom]
factory = "your_custom_factory"
your_custom_setting = true
```
The `pipeline` setting in the `[nlp]` block defines the pipeline components
added to the pipeline, in order. For example, `"parser"` here references
`[components.parser]`. By default, spaCy will **update all components that can
be updated**. Trainable components that are created from scratch are initialized
with random weights. For sourced components, spaCy will keep the existing
weights and [resume training](/api/language#resume_training).
If you don't want a component to be updated, you can **freeze** it by adding it
to the `frozen_components` list in the `[training]` block. Frozen components are
**not updated** during training and are included in the final trained model
as-is.
> #### Note on frozen components
>
> Even though frozen components are not **updated** during training, they will
> still **run** during training and evaluation. This is very important, because
> they may still impact your model's performance for instance, a sentence
> boundary detector can impact what the parser or entity recognizer considers a
> valid parse. So the evaluation results should always reflect what your model
> will produce at runtime.
```ini
[nlp]
lang = "en"
pipeline = ["parser", "ner", "textcat", "custom"]
[training]
frozen_components = ["parser", "custom"]
```
### Using registered functions {#config-functions}
The training configuration defined in the config file doesn't have to only
consist of static values. Some settings can also be **functions**. For instance,
@ -373,7 +455,9 @@ In your config, you can now reference the schedule in the
starting with an `@`, it's interpreted as a reference to a function. All other
settings in the block will be passed to the function as keyword arguments. Keep
in mind that the config shouldn't have any hidden defaults and all arguments on
the functions need to be represented in the config.
the functions need to be represented in the config. If your function defines
**default argument values**, spaCy is able to auto-fill your config when you run
[`init config`](/api/cli#init-config).
<!-- TODO: this needs to be updated once we've decided on a workflow for "fill config" -->
@ -405,7 +489,7 @@ using [`pydantic`](https://github.com/samuelcolvin/pydantic). If your registered
function provides type hints, the values that are passed in will be checked
against the expected types. For example, `start: int` in the example above will
ensure that the value received as the argument `start` is an integer. If the
value can't be cast to an integer, spaCy will raise an error.
value can't be coerced into an integer, spaCy will raise an error.
`start: pydantic.StrictInt` will force the value to be an integer and raise an
error if it's not for instance, if your config defines a float.