mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-12 18:26:30 +03:00
Update docs
This commit is contained in:
parent
ab5ef37abb
commit
cdec46493f
|
@ -363,7 +363,7 @@ that take a `Doc` object, modify it and return it. Only one of `before`,
|
|||
<Infobox title="Changed in v3.0" variant="warning">
|
||||
|
||||
As of v3.0, the [`Language.add_pipe`](/api/language#add_pipe) method doesn't
|
||||
take callables anymore and instead expects the name of a component factory
|
||||
take callables anymore and instead expects the **name of a component factory**
|
||||
registered using [`@Language.component`](/api/language#component) or
|
||||
[`@Language.factory`](/api/language#factory). It now takes care of creating the
|
||||
component, adds it to the pipeline and returns it.
|
||||
|
@ -379,20 +379,25 @@ component, adds it to the pipeline and returns it.
|
|||
>
|
||||
> nlp.add_pipe("component", before="ner")
|
||||
> component = nlp.add_pipe("component", name="custom_name", last=True)
|
||||
>
|
||||
> # Add component from source model
|
||||
> source_nlp = spacy.load("en_core_web_sm")
|
||||
> nlp.add_pipe("ner", source=source_nlp)
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| -------------------------------------- | ---------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `factory_name` | str | Name of the registered component factory. |
|
||||
| `name` | str | Optional unique name of pipeline component instance. If not set, the factory name is used. An error is raised if the name already exists in the pipeline. |
|
||||
| _keyword-only_ | | |
|
||||
| `before` | str / int | Component name or index to insert component directly before. |
|
||||
| `after` | str / int | Component name or index to insert component directly after: |
|
||||
| `first` | bool | Insert component first / not first in the pipeline. |
|
||||
| `last` | bool | Insert component last / not last in the pipeline. |
|
||||
| `config` <Tag variant="new">3</Tag> | `Dict[str, Any]` | Optional config parameters to use for this component. Will be merged with the `default_config` specified by the component factory. |
|
||||
| `validate` <Tag variant="new">3</Tag> | bool | Whether to validate the component config and arguments against the types expected by the factory. Defaults to `True`. |
|
||||
| **RETURNS** <Tag variant="new">3</Tag> | callable | The pipeline component. |
|
||||
| Name | Type | Description |
|
||||
| -------------------------------------- | ---------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `factory_name` | str | Name of the registered component factory. |
|
||||
| `name` | str | Optional unique name of pipeline component instance. If not set, the factory name is used. An error is raised if the name already exists in the pipeline. |
|
||||
| _keyword-only_ | | |
|
||||
| `before` | str / int | Component name or index to insert component directly before. |
|
||||
| `after` | str / int | Component name or index to insert component directly after: |
|
||||
| `first` | bool | Insert component first / not first in the pipeline. |
|
||||
| `last` | bool | Insert component last / not last in the pipeline. |
|
||||
| `config` <Tag variant="new">3</Tag> | `Dict[str, Any]` | Optional config parameters to use for this component. Will be merged with the `default_config` specified by the component factory. |
|
||||
| `source` <Tag variant="new">3</Tag> | `Language` | Optional source model to copy component from. If a source is provided, the `factory_name` is interpreted as the name of the component in the source pipeline. Make sure that the vocab, vectors and settings of the source model match the target model. |
|
||||
| `validate` <Tag variant="new">3</Tag> | bool | Whether to validate the component config and arguments against the types expected by the factory. Defaults to `True`. |
|
||||
| **RETURNS** <Tag variant="new">3</Tag> | callable | The pipeline component. |
|
||||
|
||||
## Language.has_factory {#has_factory tag="classmethod" new="3"}
|
||||
|
||||
|
|
|
@ -4,6 +4,7 @@ menu:
|
|||
- ['spacy', 'spacy']
|
||||
- ['displacy', 'displacy']
|
||||
- ['registry', 'registry']
|
||||
- ['Loaders & Batchers', 'loaders-batchers']
|
||||
- ['Data & Alignment', 'gold']
|
||||
- ['Utility Functions', 'util']
|
||||
---
|
||||
|
@ -34,6 +35,7 @@ loaded in via [`Language.from_disk`](/api/language#from_disk).
|
|||
| Name | Type | Description |
|
||||
| ------------------------------------------ | ----------------- | --------------------------------------------------------------------------------- |
|
||||
| `name` | str / `Path` | Model to load, i.e. package name or path. |
|
||||
| _keyword-only_ | | |
|
||||
| `disable` | `List[str]` | Names of pipeline components to [disable](/usage/processing-pipelines#disabling). |
|
||||
| `component_cfg` <Tag variant="new">3</Tag> | `Dict[str, dict]` | Optional config overrides for pipeline components, keyed by component names. |
|
||||
| **RETURNS** | `Language` | A `Language` object with the loaded model. |
|
||||
|
@ -83,11 +85,12 @@ meta data as a dictionary instead, you can use the `meta` attribute on your
|
|||
> markdown = spacy.info(markdown=True, silent=True)
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ---------- | ---- | ------------------------------------------------ |
|
||||
| `model` | str | A model, i.e. a package name or path (optional). |
|
||||
| `markdown` | bool | Print information as Markdown. |
|
||||
| `silent` | bool | Don't print anything, just return. |
|
||||
| Name | Type | Description |
|
||||
| -------------- | ---- | ------------------------------------------------ |
|
||||
| `model` | str | A model, i.e. a package name or path (optional). |
|
||||
| _keyword-only_ | | |
|
||||
| `markdown` | bool | Print information as Markdown. |
|
||||
| `silent` | bool | Don't print anything, just return. |
|
||||
|
||||
### spacy.explain {#spacy.explain tag="function"}
|
||||
|
||||
|
@ -331,6 +334,10 @@ See the [`Transformer`](/api/transformer) API reference and
|
|||
| [`span_getters`](/api/transformer#span_getters) | Registry for functions that take a batch of `Doc` objects and return a list of `Span` objects to process by the transformer, e.g. sentences. |
|
||||
| [`annotation_setters`](/api/transformers#annotation_setters) | Registry for functions that create annotation setters. Annotation setters are functions that take a batch of `Doc` objects and a [`FullTransformerBatch`](/api/transformer#fulltransformerbatch) and can set additional annotations on the `Doc`. |
|
||||
|
||||
## Training data loaders and batchers {#loaders-batchers new="3"}
|
||||
|
||||
<!-- TODO: -->
|
||||
|
||||
## Training data and alignment {#gold source="spacy/gold"}
|
||||
|
||||
### gold.docs_to_json {#docs_to_json tag="function"}
|
||||
|
|
|
@ -311,6 +311,62 @@ nlp.rename_pipe("ner", "entityrecognizer")
|
|||
nlp.replace_pipe("tagger", my_custom_tagger)
|
||||
```
|
||||
|
||||
### Sourcing pipeline components from existing models {#sourced-components new="3"}
|
||||
|
||||
Pipeline components that are independent can also be reused across models.
|
||||
Instead of adding a new blank component to a pipeline, you can also copy an
|
||||
existing component from a pretrained model by setting the `source` argument on
|
||||
[`nlp.add_pipe`](/api/language#add_pipe). The first argument will then be
|
||||
interpreted as the name of the component in the source pipeline – for instance,
|
||||
`"ner"`. This is especially useful for
|
||||
[training a model](/usage/training#config-components) because it lets you mix
|
||||
and match components and create fully custom model packages with updated
|
||||
pretrained components and new components trained on your data.
|
||||
|
||||
<Infobox variant="warning" title="Important note for pretrained components">
|
||||
|
||||
When reusing components across models, keep in mind that the **vocabulary**,
|
||||
**vectors** and model settings **must match**. If a pretrained model includes
|
||||
[word vectors](/usage/vectors-embeddings) and the component uses them as
|
||||
features, the model you copy it to needs to have the _same_ vectors available –
|
||||
otherwise, it won't be able to make the same predictions.
|
||||
|
||||
</Infobox>
|
||||
|
||||
> #### In training config
|
||||
>
|
||||
> Instead of providing a `factory`, component blocks in the training
|
||||
> [config](/usage/training#config) can also define a `source`. The string needs
|
||||
> to be a loadable spaCy model package or path. The
|
||||
>
|
||||
> ```ini
|
||||
> [components.ner]
|
||||
> source = "en_core_web_sm"
|
||||
> component = "ner"
|
||||
> ```
|
||||
>
|
||||
> By default, sourced components will be updated with your data during training.
|
||||
> If you want to preserve the component as-is, you can "freeze" it:
|
||||
>
|
||||
> ```ini
|
||||
> [training]
|
||||
> frozen_components = ["ner"]
|
||||
> ```
|
||||
|
||||
```python
|
||||
### {executable="true"}
|
||||
import spacy
|
||||
|
||||
# The source model with different components
|
||||
source_nlp = spacy.load("en_core_web_sm")
|
||||
print(source_nlp.pipe_names)
|
||||
|
||||
# Add only the entity recognizer to the new blank model
|
||||
nlp = spacy.blank("en")
|
||||
nlp.add_pipe("ner", source=source_nlp)
|
||||
print(nlp.pipe_names)
|
||||
```
|
||||
|
||||
### Analyzing pipeline components {#analysis new="3"}
|
||||
|
||||
The [`nlp.analyze_pipes`](/api/language#analyze_pipes) method analyzes the
|
||||
|
|
|
@ -149,12 +149,14 @@ not just define static settings, but also construct objects like architectures,
|
|||
schedules, optimizers or any other custom components. The main top-level
|
||||
sections of a config file are:
|
||||
|
||||
| Section | Description |
|
||||
| ------------- | --------------------------------------------------------------------------------------------------------------------- |
|
||||
| `training` | Settings and controls for the training and evaluation process. |
|
||||
| `pretraining` | Optional settings and controls for the [language model pretraining](#pretraining). |
|
||||
| `nlp` | Definition of the `nlp` object, its tokenizer and [processing pipeline](/usage/processing-pipelines) component names. |
|
||||
| `components` | Definitions of the [pipeline components](/usage/processing-pipelines) and their models. |
|
||||
| Section | Description |
|
||||
| ------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||
| `nlp` | Definition of the `nlp` object, its tokenizer and [processing pipeline](/usage/processing-pipelines) component names. |
|
||||
| `components` | Definitions of the [pipeline components](/usage/processing-pipelines) and their models. |
|
||||
| `paths` | Paths to data and other assets. Can be re-used across the config as variables, e.g. `${paths:train}`, and [overwritten](#config-overrides) on the CLI. |
|
||||
| `system` | Settings related to system and hardware. |
|
||||
| `training` | Settings and controls for the training and evaluation process. |
|
||||
| `pretraining` | Optional settings and controls for the [language model pretraining](#pretraining). |
|
||||
|
||||
<Infobox title="Config format and settings" emoji="📖">
|
||||
|
||||
|
@ -168,7 +170,7 @@ available for the different architectures are documented with the
|
|||
|
||||
</Infobox>
|
||||
|
||||
#### Overwriting config settings on the command line {#config-overrides}
|
||||
### Overwriting config settings on the command line {#config-overrides}
|
||||
|
||||
The config system means that you can define all settings **in one place** and in
|
||||
a consistent format. There are no command-line arguments that need to be set,
|
||||
|
@ -192,7 +194,87 @@ of the training, the final filled `config.cfg` is exported with your model, so
|
|||
you'll always have a record of the settings that were used, including your
|
||||
overrides.
|
||||
|
||||
#### Using registered functions {#config-functions}
|
||||
### Defining pipeline components {#config-components}
|
||||
|
||||
When you train a model, you typically train a
|
||||
[pipeline](/usage/processing-pipelines) of **one or more components**. The
|
||||
`[components]` block in the config defines the available pipeline components and
|
||||
how they should be created – either by a built-in or custom
|
||||
[factory](/usage/processing-pipelines#built-in), or
|
||||
[sourced](/usage/processing-pipelines#sourced-components) from an existing
|
||||
pretrained model. For example, `[components.parser]` defines the component named
|
||||
`"parser"` in the pipeline. There are different ways you might want to treat
|
||||
your components during training, and the most common scenarios are:
|
||||
|
||||
1. Train a **new component** from scratch on your data.
|
||||
2. Update an existing **pretrained component** with more examples.
|
||||
3. Include an existing pretrained component without updating it.
|
||||
4. Include a non-trainable component, like a rule-based
|
||||
[`EntityRuler`](/api/entityruler) or [`Sentencizer`](/api/sentencizer), or a
|
||||
fully [custom component](/usage/processing-pipelines#custom-components).
|
||||
|
||||
If a component block defines a `factory`, spaCy will look it up in the
|
||||
[built-in](/usage/processing-pipelines#built-in) or
|
||||
[custom](/usage/processing-pipelines#custom-components) components and create a
|
||||
new component from scratch. All settings defined in the config block will be
|
||||
passed to the component factory as arguments. This lets you configure the model
|
||||
settings and hyperparameters. If a component block defines a `source`, the
|
||||
component will be copied over from an existing pretrained model, with its
|
||||
existing weights. This lets you include an already trained component in your
|
||||
model pipeline, or update a pretrained components with more data specific to
|
||||
your use case.
|
||||
|
||||
```ini
|
||||
### config.cfg (excerpt)
|
||||
[components]
|
||||
|
||||
# "parser" and "ner" are sourced from pretrained model
|
||||
[components.parser]
|
||||
source = "en_core_web_sm"
|
||||
|
||||
[components.ner]
|
||||
source = "en_core_web_sm"
|
||||
|
||||
# "textcat" and "custom" are created blank from built-in / custom factory
|
||||
[components.textcat]
|
||||
factory = "textcat"
|
||||
|
||||
[components.custom]
|
||||
factory = "your_custom_factory"
|
||||
your_custom_setting = true
|
||||
```
|
||||
|
||||
The `pipeline` setting in the `[nlp]` block defines the pipeline components
|
||||
added to the pipeline, in order. For example, `"parser"` here references
|
||||
`[components.parser]`. By default, spaCy will **update all components that can
|
||||
be updated**. Trainable components that are created from scratch are initialized
|
||||
with random weights. For sourced components, spaCy will keep the existing
|
||||
weights and [resume training](/api/language#resume_training).
|
||||
|
||||
If you don't want a component to be updated, you can **freeze** it by adding it
|
||||
to the `frozen_components` list in the `[training]` block. Frozen components are
|
||||
**not updated** during training and are included in the final trained model
|
||||
as-is.
|
||||
|
||||
> #### Note on frozen components
|
||||
>
|
||||
> Even though frozen components are not **updated** during training, they will
|
||||
> still **run** during training and evaluation. This is very important, because
|
||||
> they may still impact your model's performance – for instance, a sentence
|
||||
> boundary detector can impact what the parser or entity recognizer considers a
|
||||
> valid parse. So the evaluation results should always reflect what your model
|
||||
> will produce at runtime.
|
||||
|
||||
```ini
|
||||
[nlp]
|
||||
lang = "en"
|
||||
pipeline = ["parser", "ner", "textcat", "custom"]
|
||||
|
||||
[training]
|
||||
frozen_components = ["parser", "custom"]
|
||||
```
|
||||
|
||||
### Using registered functions {#config-functions}
|
||||
|
||||
The training configuration defined in the config file doesn't have to only
|
||||
consist of static values. Some settings can also be **functions**. For instance,
|
||||
|
@ -373,7 +455,9 @@ In your config, you can now reference the schedule in the
|
|||
starting with an `@`, it's interpreted as a reference to a function. All other
|
||||
settings in the block will be passed to the function as keyword arguments. Keep
|
||||
in mind that the config shouldn't have any hidden defaults and all arguments on
|
||||
the functions need to be represented in the config.
|
||||
the functions need to be represented in the config. If your function defines
|
||||
**default argument values**, spaCy is able to auto-fill your config when you run
|
||||
[`init config`](/api/cli#init-config).
|
||||
|
||||
<!-- TODO: this needs to be updated once we've decided on a workflow for "fill config" -->
|
||||
|
||||
|
@ -405,7 +489,7 @@ using [`pydantic`](https://github.com/samuelcolvin/pydantic). If your registered
|
|||
function provides type hints, the values that are passed in will be checked
|
||||
against the expected types. For example, `start: int` in the example above will
|
||||
ensure that the value received as the argument `start` is an integer. If the
|
||||
value can't be cast to an integer, spaCy will raise an error.
|
||||
value can't be coerced into an integer, spaCy will raise an error.
|
||||
`start: pydantic.StrictInt` will force the value to be an integer and raise an
|
||||
error if it's not – for instance, if your config defines a float.
|
||||
|
||||
|
|
Loading…
Reference in New Issue
Block a user