mirror of
				https://github.com/explosion/spaCy.git
				synced 2025-11-04 09:57:26 +03:00 
			
		
		
		
	* Add info to sections * Update website/docs/usage/training.mdx --------- Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
		
			
				
	
	
		
			1795 lines
		
	
	
		
			78 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
			
		
		
	
	
			1795 lines
		
	
	
		
			78 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
---
 | 
						||
title: Training Pipelines & Models
 | 
						||
teaser: Train and update components on your own data and integrate custom models
 | 
						||
next: /usage/layers-architectures
 | 
						||
menu:
 | 
						||
  - ['Introduction', 'basics']
 | 
						||
  - ['Quickstart', 'quickstart']
 | 
						||
  - ['Config System', 'config']
 | 
						||
  - ['Training Data', 'training-data']
 | 
						||
  - ['Custom Training', 'config-custom']
 | 
						||
  - ['Custom Functions', 'custom-functions']
 | 
						||
  - ['Initialization', 'initialization']
 | 
						||
  - ['Data Utilities', 'data']
 | 
						||
  - ['Parallel Training', 'parallel-training']
 | 
						||
  - ['Internal API', 'api']
 | 
						||
---
 | 
						||
 | 
						||
## Introduction to training {id="basics",hidden="true"}
 | 
						||
 | 
						||
<Training101 />
 | 
						||
 | 
						||
<Infobox title="Tip: Try the Prodigy annotation tool">
 | 
						||
 | 
						||
<Image
 | 
						||
  src="/images/prodigy.jpg"
 | 
						||
  href="https://prodi.gy"
 | 
						||
  alt="Prodigy: Radically efficient machine teaching"
 | 
						||
/>
 | 
						||
 | 
						||
If you need to label a lot of data, check out [Prodigy](https://prodi.gy), a
 | 
						||
new, active learning-powered annotation tool we've developed. Prodigy is fast
 | 
						||
and extensible, and comes with a modern **web application** that helps you
 | 
						||
collect training data faster. It integrates seamlessly with spaCy, pre-selects
 | 
						||
the **most relevant examples** for annotation, and lets you train and evaluate
 | 
						||
ready-to-use spaCy pipelines.
 | 
						||
 | 
						||
</Infobox>
 | 
						||
 | 
						||
## Quickstart {id="quickstart",tag="new"}
 | 
						||
 | 
						||
The recommended way to train your spaCy pipelines is via the
 | 
						||
[`spacy train`](/api/cli#train) command on the command line. It only needs a
 | 
						||
single [`config.cfg`](#config) **configuration file** that includes all settings
 | 
						||
and hyperparameters. You can optionally [overwrite](#config-overrides) settings
 | 
						||
on the command line, and load in a Python file to register
 | 
						||
[custom functions](#custom-code) and architectures. This quickstart widget helps
 | 
						||
you generate a starter config with the **recommended settings** for your
 | 
						||
specific use case. It's also available in spaCy as the
 | 
						||
[`init config`](/api/cli#init-config) command.
 | 
						||
 | 
						||
<Infobox variant="warning">
 | 
						||
 | 
						||
Upgrade to the [latest version of spaCy](/usage) to use the quickstart widget.
 | 
						||
For earlier releases, follow the CLI instructions to generate a compatible
 | 
						||
config.
 | 
						||
 | 
						||
</Infobox>
 | 
						||
 | 
						||
> #### Instructions: widget
 | 
						||
>
 | 
						||
> 1. Select your requirements and settings.
 | 
						||
> 2. Use the buttons at the bottom to save the result to your clipboard or a
 | 
						||
>    file `base_config.cfg`.
 | 
						||
> 3. Run [`init fill-config`](/api/cli#init-fill-config) to create a full
 | 
						||
>    config.
 | 
						||
> 4. Run [`train`](/api/cli#train) with your config and data.
 | 
						||
>
 | 
						||
> #### Instructions: CLI
 | 
						||
>
 | 
						||
> 1. Run the [`init config`](/api/cli#init-config) command and specify your
 | 
						||
>    requirements and settings as CLI arguments.
 | 
						||
> 2. Run [`train`](/api/cli#train) with the exported config and data.
 | 
						||
 | 
						||
<QuickstartTraining />
 | 
						||
 | 
						||
After you've saved the starter config to a file `base_config.cfg`, you can use
 | 
						||
the [`init fill-config`](/api/cli#init-fill-config) command to fill in the
 | 
						||
remaining defaults. Training configs should always be **complete and without
 | 
						||
hidden defaults**, to keep your experiments reproducible.
 | 
						||
 | 
						||
```bash
 | 
						||
$ python -m spacy init fill-config base_config.cfg config.cfg
 | 
						||
```
 | 
						||
 | 
						||
> #### Tip: Debug your data
 | 
						||
>
 | 
						||
> The [`debug data` command](/api/cli#debug-data) lets you analyze and validate
 | 
						||
> your training and development data, get useful stats, and find problems like
 | 
						||
> invalid entity annotations, cyclic dependencies, low data labels and more.
 | 
						||
>
 | 
						||
> ```bash
 | 
						||
> $ python -m spacy debug data config.cfg
 | 
						||
> ```
 | 
						||
 | 
						||
Instead of exporting your starter config from the quickstart widget and
 | 
						||
auto-filling it, you can also use the [`init config`](/api/cli#init-config)
 | 
						||
command and specify your requirement and settings as CLI arguments. You can now
 | 
						||
add your data and run [`train`](/api/cli#train) with your config. See the
 | 
						||
[`convert`](/api/cli#convert) command for details on how to convert your data to
 | 
						||
spaCy's binary `.spacy` format. You can either include the data paths in the
 | 
						||
`[paths]` section of your config, or pass them in via the command line.
 | 
						||
 | 
						||
```bash
 | 
						||
$ python -m spacy train config.cfg --output ./output --paths.train ./train.spacy --paths.dev ./dev.spacy
 | 
						||
```
 | 
						||
 | 
						||
> #### Tip: Enable your GPU
 | 
						||
>
 | 
						||
> Use the `--gpu-id` option to select the GPU:
 | 
						||
>
 | 
						||
> ```bash
 | 
						||
> $ python -m spacy train config.cfg --gpu-id 0
 | 
						||
> ```
 | 
						||
 | 
						||
<Accordion title="How are the config recommendations generated?" id="quickstart-source" spaced>
 | 
						||
 | 
						||
The recommended config settings generated by the quickstart widget and the
 | 
						||
[`init config`](/api/cli#init-config) command are based on some general **best
 | 
						||
practices** and things we've found to work well in our experiments. The goal is
 | 
						||
to provide you with the most **useful defaults**.
 | 
						||
 | 
						||
Under the hood, the
 | 
						||
[`quickstart_training.jinja`](%%GITHUB_SPACY/spacy/cli/templates/quickstart_training.jinja)
 | 
						||
template defines the different combinations – for example, which parameters to
 | 
						||
change if the pipeline should optimize for efficiency vs. accuracy. The file
 | 
						||
[`quickstart_training_recommendations.yml`](%%GITHUB_SPACY/spacy/cli/templates/quickstart_training_recommendations.yml)
 | 
						||
collects the recommended settings and available resources for each language
 | 
						||
including the different transformer weights. For some languages, we include
 | 
						||
different transformer recommendations, depending on whether you want the model
 | 
						||
to be more efficient or more accurate. The recommendations will be **evolving**
 | 
						||
as we run more experiments.
 | 
						||
 | 
						||
</Accordion>
 | 
						||
 | 
						||
<Project id="pipelines/tagger_parser_ud">
 | 
						||
 | 
						||
The easiest way to get started is to clone a [project template](/usage/projects)
 | 
						||
and run it – for example, this end-to-end template that lets you train a
 | 
						||
**part-of-speech tagger** and **dependency parser** on a Universal Dependencies
 | 
						||
treebank.
 | 
						||
 | 
						||
</Project>
 | 
						||
 | 
						||
## Training config system {id="config"}
 | 
						||
 | 
						||
Training config files include all **settings and hyperparameters** for training
 | 
						||
your pipeline. Instead of providing lots of arguments on the command line, you
 | 
						||
only need to pass your `config.cfg` file to [`spacy train`](/api/cli#train).
 | 
						||
Under the hood, the training config uses the
 | 
						||
[configuration system](https://thinc.ai/docs/usage-config) provided by our
 | 
						||
machine learning library [Thinc](https://thinc.ai). This also makes it easy to
 | 
						||
integrate custom models and architectures, written in your framework of choice.
 | 
						||
Some of the main advantages and features of spaCy's training config are:
 | 
						||
 | 
						||
- **Structured sections.** The config is grouped into sections, and nested
 | 
						||
  sections are defined using the `.` notation. For example, `[components.ner]`
 | 
						||
  defines the settings for the pipeline's named entity recognizer. The config
 | 
						||
  can be loaded as a Python dict.
 | 
						||
- **References to registered functions.** Sections can refer to registered
 | 
						||
  functions like [model architectures](/api/architectures),
 | 
						||
  [optimizers](https://thinc.ai/docs/api-optimizers) or
 | 
						||
  [schedules](https://thinc.ai/docs/api-schedules) and define arguments that are
 | 
						||
  passed into them. You can also
 | 
						||
  [register your own functions](#custom-functions) to define custom
 | 
						||
  architectures or methods, reference them in your config and tweak their
 | 
						||
  parameters.
 | 
						||
- **Interpolation.** If you have hyperparameters or other settings used by
 | 
						||
  multiple components, define them once and reference them as
 | 
						||
  [variables](#config-interpolation).
 | 
						||
- **Reproducibility with no hidden defaults.** The config file is the "single
 | 
						||
  source of truth" and includes all settings.
 | 
						||
- **Automated checks and validation.** When you load a config, spaCy checks if
 | 
						||
  the settings are complete and if all values have the correct types. This lets
 | 
						||
  you catch potential mistakes early. In your custom architectures, you can use
 | 
						||
  Python [type hints](https://docs.python.org/3/library/typing.html) to tell the
 | 
						||
  config which types of data to expect.
 | 
						||
 | 
						||
```ini
 | 
						||
%%GITHUB_SPACY/spacy/default_config.cfg
 | 
						||
```
 | 
						||
 | 
						||
Under the hood, the config is parsed into a dictionary. It's divided into
 | 
						||
sections and subsections, indicated by the square brackets and dot notation. For
 | 
						||
example, `[training]` is a section and `[training.batch_size]` a subsection.
 | 
						||
Subsections can define values, just like a dictionary, or use the `@` syntax to
 | 
						||
refer to [registered functions](#config-functions). This allows the config to
 | 
						||
not just define static settings, but also construct objects like architectures,
 | 
						||
schedules, optimizers or any other custom components. The main top-level
 | 
						||
sections of a config file are:
 | 
						||
 | 
						||
| Section       | Description                                                                                                                                                     |
 | 
						||
| ------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------- |
 | 
						||
| `nlp`         | Definition of the `nlp` object, its tokenizer and [processing pipeline](/usage/processing-pipelines) component names.                                           |
 | 
						||
| `components`  | Definitions of the [pipeline components](/usage/processing-pipelines) and their models.                                                                         |
 | 
						||
| `paths`       | Paths to data and other assets. Re-used across the config as variables, e.g. `${paths.train}`, and can be [overwritten](#config-overrides) on the CLI.          |
 | 
						||
| `system`      | Settings related to system and hardware. Re-used across the config as variables, e.g. `${system.seed}`, and can be [overwritten](#config-overrides) on the CLI. |
 | 
						||
| `training`    | Settings and controls for the training and evaluation process.                                                                                                  |
 | 
						||
| `pretraining` | Optional settings and controls for the [language model pretraining](/usage/embeddings-transformers#pretraining).                                                |
 | 
						||
| `initialize`  | Data resources and arguments passed to components when [`nlp.initialize`](/api/language#initialize) is called before training (but not at runtime).             |
 | 
						||
 | 
						||
<Infobox title="Config format and settings" emoji="📖">
 | 
						||
 | 
						||
For a full overview of spaCy's config format and settings, see the
 | 
						||
[data format documentation](/api/data-formats#config) and
 | 
						||
[Thinc's config system docs](https://thinc.ai/docs/usage-config). The settings
 | 
						||
available for the different architectures are documented with the
 | 
						||
[model architectures API](/api/architectures). See the Thinc documentation for
 | 
						||
[optimizers](https://thinc.ai/docs/api-optimizers) and
 | 
						||
[schedules](https://thinc.ai/docs/api-schedules).
 | 
						||
 | 
						||
</Infobox>
 | 
						||
 | 
						||
<YouTube id="BWhh3r6W-qE"></YouTube>
 | 
						||
 | 
						||
### Config lifecycle at runtime and training {id="config-lifecycle"}
 | 
						||
 | 
						||
A pipeline's `config.cfg` is considered the "single source of truth", both at
 | 
						||
**training** and **runtime**. Under the hood,
 | 
						||
[`Language.from_config`](/api/language#from_config) takes care of constructing
 | 
						||
the `nlp` object using the settings defined in the config. An `nlp` object's
 | 
						||
config is available as [`nlp.config`](/api/language#config) and it includes all
 | 
						||
information about the pipeline, as well as the settings used to train and
 | 
						||
initialize it.
 | 
						||
 | 
						||

 | 
						||
 | 
						||
At runtime spaCy will only use the `[nlp]` and `[components]` blocks of the
 | 
						||
config and load all data, including tokenization rules, model weights and other
 | 
						||
resources from the pipeline directory. The `[training]` block contains the
 | 
						||
settings for training the model and is only used during training. Similarly, the
 | 
						||
`[initialize]` block defines how the initial `nlp` object should be set up
 | 
						||
before training and whether it should be initialized with vectors or pretrained
 | 
						||
tok2vec weights, or any other data needed by the components.
 | 
						||
 | 
						||
The initialization settings are only loaded and used when
 | 
						||
[`nlp.initialize`](/api/language#initialize) is called (typically right before
 | 
						||
training). This allows you to set up your pipeline using local data resources
 | 
						||
and custom functions, and preserve the information in your config – but without
 | 
						||
requiring it to be available at runtime. You can also use this mechanism to
 | 
						||
provide data paths to custom pipeline components and custom tokenizers – see the
 | 
						||
section on [custom initialization](#initialization) for details.
 | 
						||
 | 
						||
### Overwriting config settings on the command line {id="config-overrides"}
 | 
						||
 | 
						||
The config system means that you can define all settings **in one place** and in
 | 
						||
a consistent format. There are no command-line arguments that need to be set,
 | 
						||
and no hidden defaults. However, there can still be scenarios where you may want
 | 
						||
to override config settings when you run [`spacy train`](/api/cli#train). This
 | 
						||
includes **file paths** to vectors or other resources that shouldn't be
 | 
						||
hard-coded in a config file, or **system-dependent settings**.
 | 
						||
 | 
						||
For cases like this, you can set additional command-line options starting with
 | 
						||
`--` that correspond to the config section and value to override. For example,
 | 
						||
`--paths.train ./corpus/train.spacy` sets the `train` value in the `[paths]`
 | 
						||
block.
 | 
						||
 | 
						||
```bash
 | 
						||
$ python -m spacy train config.cfg --paths.train ./corpus/train.spacy --paths.dev ./corpus/dev.spacy --training.batch_size 128
 | 
						||
```
 | 
						||
 | 
						||
Only existing sections and values in the config can be overwritten. At the end
 | 
						||
of the training, the final filled `config.cfg` is exported with your pipeline,
 | 
						||
so you'll always have a record of the settings that were used, including your
 | 
						||
overrides. Overrides are added before [variables](#config-interpolation) are
 | 
						||
resolved, by the way – so if you need to use a value in multiple places,
 | 
						||
reference it across your config and override it on the CLI once.
 | 
						||
 | 
						||
> #### 💡 Tip: Verbose logging
 | 
						||
>
 | 
						||
> If you're using config overrides, you can set the `--verbose` flag on
 | 
						||
> [`spacy train`](/api/cli#train) to make spaCy log more info, including which
 | 
						||
> overrides were set via the CLI and environment variables.
 | 
						||
 | 
						||
#### Adding overrides via environment variables {id="config-overrides-env"}
 | 
						||
 | 
						||
Instead of defining the overrides as CLI arguments, you can also use the
 | 
						||
`SPACY_CONFIG_OVERRIDES` environment variable using the same argument syntax.
 | 
						||
This is especially useful if you're training models as part of an automated
 | 
						||
process. Environment variables **take precedence** over CLI overrides and values
 | 
						||
defined in the config file.
 | 
						||
 | 
						||
```bash
 | 
						||
$ SPACY_CONFIG_OVERRIDES="--system.gpu_allocator pytorch --training.batch_size 128" ./your_script.sh
 | 
						||
```
 | 
						||
 | 
						||
### Reading from standard input {id="config-stdin"}
 | 
						||
 | 
						||
Setting the config path to `-` on the command line lets you read the config from
 | 
						||
standard input and pipe it forward from a different process, like
 | 
						||
[`init config`](/api/cli#init-config) or your own custom script. This is
 | 
						||
especially useful for quick experiments, as it lets you generate a config on the
 | 
						||
fly without having to save to and load from disk.
 | 
						||
 | 
						||
> #### 💡 Tip: Writing to stdout
 | 
						||
>
 | 
						||
> When you run `init config`, you can set the output path to `-` to write to
 | 
						||
> stdout. In a custom script, you can print the string config, e.g.
 | 
						||
> `print(nlp.config.to_str())`.
 | 
						||
 | 
						||
```bash
 | 
						||
$ python -m spacy init config - --lang en --pipeline ner,textcat --optimize accuracy | python -m spacy train - --paths.train ./corpus/train.spacy --paths.dev ./corpus/dev.spacy
 | 
						||
```
 | 
						||
 | 
						||
### Using variable interpolation {id="config-interpolation"}
 | 
						||
 | 
						||
Another very useful feature of the config system is that it supports variable
 | 
						||
interpolation for both **values and sections**. This means that you only need to
 | 
						||
define a setting once and can reference it across your config using the
 | 
						||
`${section.value}` syntax. In this example, the value of `seed` is reused within
 | 
						||
the `[training]` block, and the whole block of `[training.optimizer]` is reused
 | 
						||
in `[pretraining]` and will become `pretraining.optimizer`.
 | 
						||
 | 
						||
```ini {title="config.cfg (excerpt)",highlight="5,18"}
 | 
						||
[system]
 | 
						||
seed = 0
 | 
						||
 | 
						||
[training]
 | 
						||
seed = ${system.seed}
 | 
						||
 | 
						||
[training.optimizer]
 | 
						||
@optimizers = "Adam.v1"
 | 
						||
beta1 = 0.9
 | 
						||
beta2 = 0.999
 | 
						||
L2_is_weight_decay = true
 | 
						||
L2 = 0.01
 | 
						||
grad_clip = 1.0
 | 
						||
use_averages = false
 | 
						||
eps = 1e-8
 | 
						||
 | 
						||
[pretraining]
 | 
						||
optimizer = ${training.optimizer}
 | 
						||
```
 | 
						||
 | 
						||
You can also use variables inside strings. In that case, it works just like
 | 
						||
f-strings in Python. If the value of a variable is not a string, it's converted
 | 
						||
to a string.
 | 
						||
 | 
						||
```ini
 | 
						||
[paths]
 | 
						||
version = 5
 | 
						||
root = "/Users/you/data"
 | 
						||
train = "${paths.root}/train_${paths.version}.spacy"
 | 
						||
# Result: /Users/you/data/train_5.spacy
 | 
						||
```
 | 
						||
 | 
						||
<Infobox title="Tip: Override variables on the CLI" emoji="💡">
 | 
						||
 | 
						||
If you need to change certain values between training runs, you can define them
 | 
						||
once, reference them as variables and then [override](#config-overrides) them on
 | 
						||
the CLI. For example, `--paths.root /other/root` will change the value of `root`
 | 
						||
in the block `[paths]` and the change will be reflected across all other values
 | 
						||
that reference this variable.
 | 
						||
 | 
						||
</Infobox>
 | 
						||
 | 
						||
## Preparing Training Data {id="training-data"}
 | 
						||
 | 
						||
Training data for NLP projects comes in many different formats. For some common
 | 
						||
formats such as CoNLL, spaCy provides [converters](/api/cli#convert) you can use
 | 
						||
from the command line. In other cases you'll have to prepare the training data
 | 
						||
yourself.
 | 
						||
 | 
						||
When converting training data for use in spaCy, the main thing is to create
 | 
						||
[`Doc`](/api/doc) objects just like the results you want as output from the
 | 
						||
pipeline. For example, if you're creating an NER pipeline, loading your
 | 
						||
annotations and setting them as the `.ents` property on a `Doc` is all you need
 | 
						||
to worry about. On disk the annotations will be saved as a
 | 
						||
[`DocBin`](/api/docbin) in the
 | 
						||
[`.spacy` format](/api/data-formats#binary-training), but the details of that
 | 
						||
are handled automatically.
 | 
						||
 | 
						||
Here's an example of creating a `.spacy` file from some NER annotations.
 | 
						||
 | 
						||
```python {title="preprocess.py"}
 | 
						||
import spacy
 | 
						||
from spacy.tokens import DocBin
 | 
						||
 | 
						||
nlp = spacy.blank("en")
 | 
						||
training_data = [
 | 
						||
  ("Tokyo Tower is 333m tall.", [(0, 11, "BUILDING")]),
 | 
						||
]
 | 
						||
# the DocBin will store the example documents
 | 
						||
db = DocBin()
 | 
						||
for text, annotations in training_data:
 | 
						||
    doc = nlp(text)
 | 
						||
    ents = []
 | 
						||
    for start, end, label in annotations:
 | 
						||
        span = doc.char_span(start, end, label=label)
 | 
						||
        ents.append(span)
 | 
						||
    doc.ents = ents
 | 
						||
    db.add(doc)
 | 
						||
db.to_disk("./train.spacy")
 | 
						||
```
 | 
						||
 | 
						||
For more examples of how to convert training data from a wide variety of formats
 | 
						||
for use with spaCy, look at the preprocessing steps in the
 | 
						||
[tutorial projects](https://github.com/explosion/projects/tree/v3/tutorials).
 | 
						||
 | 
						||
<Accordion title="What about the spaCy JSON format?" id="json-annotations" spaced>
 | 
						||
 | 
						||
In spaCy v2, the recommended way to store training data was in
 | 
						||
[a particular JSON format](/api/data-formats#json-input), but in v3 this format
 | 
						||
is deprecated. It's fine as a readable storage format, but there's no need to
 | 
						||
convert your data to JSON before creating a `.spacy` file.
 | 
						||
 | 
						||
</Accordion>
 | 
						||
 | 
						||
## Customizing the pipeline and training {id="config-custom"}
 | 
						||
 | 
						||
### Defining pipeline components {id="config-components"}
 | 
						||
 | 
						||
You typically train a [pipeline](/usage/processing-pipelines) of **one or more
 | 
						||
components**. The `[components]` block in the config defines the available
 | 
						||
pipeline components and how they should be created – either by a built-in or
 | 
						||
custom [factory](/usage/processing-pipelines#built-in), or
 | 
						||
[sourced](/usage/processing-pipelines#sourced-components) from an existing
 | 
						||
trained pipeline. For example, `[components.parser]` defines the component named
 | 
						||
`"parser"` in the pipeline. There are different ways you might want to treat
 | 
						||
your components during training, and the most common scenarios are:
 | 
						||
 | 
						||
1. Train a **new component** from scratch on your data.
 | 
						||
2. Update an existing **trained component** with more examples.
 | 
						||
3. Include an existing trained component without updating it.
 | 
						||
4. Include a non-trainable component, like a rule-based
 | 
						||
   [`EntityRuler`](/api/entityruler) or [`Sentencizer`](/api/sentencizer), or a
 | 
						||
   fully [custom component](/usage/processing-pipelines#custom-components).
 | 
						||
 | 
						||
If a component block defines a `factory`, spaCy will look it up in the
 | 
						||
[built-in](/usage/processing-pipelines#built-in) or
 | 
						||
[custom](/usage/processing-pipelines#custom-components) components and create a
 | 
						||
new component from scratch. All settings defined in the config block will be
 | 
						||
passed to the component factory as arguments. This lets you configure the model
 | 
						||
settings and hyperparameters. If a component block defines a `source`, the
 | 
						||
component will be copied over from an existing trained pipeline, with its
 | 
						||
existing weights. This lets you include an already trained component in your
 | 
						||
pipeline, or update a trained component with more data specific to your use
 | 
						||
case.
 | 
						||
 | 
						||
```ini {title="config.cfg (excerpt)"}
 | 
						||
[components]
 | 
						||
 | 
						||
# "parser" and "ner" are sourced from a trained pipeline
 | 
						||
[components.parser]
 | 
						||
source = "en_core_web_sm"
 | 
						||
 | 
						||
[components.ner]
 | 
						||
source = "en_core_web_sm"
 | 
						||
 | 
						||
# "textcat" and "custom" are created blank from a built-in / custom factory
 | 
						||
[components.textcat]
 | 
						||
factory = "textcat"
 | 
						||
 | 
						||
[components.custom]
 | 
						||
factory = "your_custom_factory"
 | 
						||
your_custom_setting = true
 | 
						||
```
 | 
						||
 | 
						||
The `pipeline` setting in the `[nlp]` block defines the pipeline components
 | 
						||
added to the pipeline, in order. For example, `"parser"` here references
 | 
						||
`[components.parser]`. By default, spaCy will **update all components that can
 | 
						||
be updated**. Trainable components that are created from scratch are initialized
 | 
						||
with random weights. For sourced components, spaCy will keep the existing
 | 
						||
weights and [resume training](/api/language#resume_training).
 | 
						||
 | 
						||
If you don't want a component to be updated, you can **freeze** it by adding it
 | 
						||
to the `frozen_components` list in the `[training]` block. Frozen components are
 | 
						||
**not updated** during training and are included in the final trained pipeline
 | 
						||
as-is. They are also excluded when calling
 | 
						||
[`nlp.initialize`](/api/language#initialize).
 | 
						||
 | 
						||
> #### Note on frozen components
 | 
						||
>
 | 
						||
> Even though frozen components are not **updated** during training, they will
 | 
						||
> still **run** during evaluation. This is very important, because they may
 | 
						||
> still impact your model's performance – for instance, a sentence boundary
 | 
						||
> detector can impact what the parser or entity recognizer considers a valid
 | 
						||
> parse. So the evaluation results should always reflect what your pipeline will
 | 
						||
> produce at runtime. If you want a frozen component to run (without updating)
 | 
						||
> during training as well, so that downstream components can use its
 | 
						||
> **predictions**, you should add it to the list of
 | 
						||
> [`annotating_components`](/usage/training#annotating-components).
 | 
						||
 | 
						||
```ini
 | 
						||
[nlp]
 | 
						||
lang = "en"
 | 
						||
pipeline = ["parser", "ner", "textcat", "custom"]
 | 
						||
 | 
						||
[training]
 | 
						||
frozen_components = ["parser", "custom"]
 | 
						||
```
 | 
						||
 | 
						||
<Infobox variant="warning" title="Shared Tok2Vec listener layer" id="config-components-listeners">
 | 
						||
 | 
						||
When the components in your pipeline
 | 
						||
[share an embedding layer](/usage/embeddings-transformers#embedding-layers), the
 | 
						||
**performance** of your frozen component will be **degraded** if you continue
 | 
						||
training other layers with the same underlying `Tok2Vec` instance. As a rule of
 | 
						||
thumb, ensure that your frozen components are truly **independent** in the
 | 
						||
pipeline.
 | 
						||
 | 
						||
To automatically replace a shared token-to-vector listener with an independent
 | 
						||
copy of the token-to-vector layer, you can use the `replace_listeners` setting
 | 
						||
of a sourced component, pointing to the listener layer(s) in the config. For
 | 
						||
more details on how this works under the hood, see
 | 
						||
[`Language.replace_listeners`](/api/language#replace_listeners).
 | 
						||
 | 
						||
```ini
 | 
						||
[training]
 | 
						||
frozen_components = ["tagger"]
 | 
						||
 | 
						||
[components.tagger]
 | 
						||
source = "en_core_web_sm"
 | 
						||
replace_listeners = ["model.tok2vec"]
 | 
						||
```
 | 
						||
 | 
						||
</Infobox>
 | 
						||
 | 
						||
### Using predictions from preceding components {id="annotating-components",version="3.1"}
 | 
						||
 | 
						||
By default, components are updated in isolation during training, which means
 | 
						||
that they don't see the predictions of any earlier components in the pipeline. A
 | 
						||
component receives [`Example.predicted`](/api/example) as input and compares its
 | 
						||
predictions to [`Example.reference`](/api/example) without saving its
 | 
						||
annotations in the `predicted` doc.
 | 
						||
 | 
						||
Instead, if certain components should **set their annotations** during training,
 | 
						||
use the setting `annotating_components` in the `[training]` block to specify a
 | 
						||
list of components. For example, the feature `DEP` from the parser could be used
 | 
						||
as a tagger feature by including `DEP` in the tok2vec `attrs` and including
 | 
						||
`parser` in `annotating_components`:
 | 
						||
 | 
						||
```ini {title="config.cfg (excerpt)",highlight="7,12"}
 | 
						||
[nlp]
 | 
						||
pipeline = ["parser", "tagger"]
 | 
						||
 | 
						||
[components.tagger.model.tok2vec.embed]
 | 
						||
@architectures = "spacy.MultiHashEmbed.v1"
 | 
						||
width = ${components.tagger.model.tok2vec.encode.width}
 | 
						||
attrs = ["NORM","DEP"]
 | 
						||
rows = [5000,2500]
 | 
						||
include_static_vectors = false
 | 
						||
 | 
						||
[training]
 | 
						||
annotating_components = ["parser"]
 | 
						||
```
 | 
						||
 | 
						||
Any component in the pipeline can be included as an annotating component,
 | 
						||
including frozen components. Frozen components can set annotations during
 | 
						||
training just as they would set annotations during evaluation or when the final
 | 
						||
pipeline is run. The config excerpt below shows how a frozen `ner` component and
 | 
						||
a `sentencizer` can provide the required `doc.sents` and `doc.ents` for the
 | 
						||
entity linker during training:
 | 
						||
 | 
						||
```ini {title="config.cfg (excerpt)"}
 | 
						||
[nlp]
 | 
						||
pipeline = ["sentencizer", "ner", "entity_linker"]
 | 
						||
 | 
						||
[components.ner]
 | 
						||
source = "en_core_web_sm"
 | 
						||
 | 
						||
[training]
 | 
						||
frozen_components = ["ner"]
 | 
						||
annotating_components = ["sentencizer", "ner"]
 | 
						||
```
 | 
						||
 | 
						||
Similarly, a pretrained `tok2vec` layer can be frozen and specified in the list
 | 
						||
of `annotating_components` to ensure that a downstream component can use the
 | 
						||
embedding layer without updating it.
 | 
						||
 | 
						||
<Infobox variant="warning" title="Training speed with annotating components" id="annotating-components-speed">
 | 
						||
 | 
						||
Be aware that non-frozen annotating components with statistical models will
 | 
						||
**run twice** on each batch, once to update the model and once to apply the
 | 
						||
now-updated model to the predicted docs.
 | 
						||
 | 
						||
</Infobox>
 | 
						||
 | 
						||
### Using registered functions {id="config-functions"}
 | 
						||
 | 
						||
The training configuration defined in the config file doesn't have to only
 | 
						||
consist of static values. Some settings can also be **functions**. For instance,
 | 
						||
the `batch_size` can be a number that doesn't change, or a schedule, like a
 | 
						||
sequence of compounding values, which has shown to be an effective trick (see
 | 
						||
[Smith et al., 2017](https://arxiv.org/abs/1711.00489)).
 | 
						||
 | 
						||
```ini {title="With static value"}
 | 
						||
[training]
 | 
						||
batch_size = 128
 | 
						||
```
 | 
						||
 | 
						||
To refer to a function instead, you can make `[training.batch_size]` its own
 | 
						||
section and use the `@` syntax to specify the function and its arguments – in
 | 
						||
this case [`compounding.v1`](https://thinc.ai/docs/api-schedules#compounding)
 | 
						||
defined in the [function registry](/api/top-level#registry). All other values
 | 
						||
defined in the block are passed to the function as keyword arguments when it's
 | 
						||
initialized. You can also use this mechanism to register
 | 
						||
[custom implementations and architectures](#custom-functions) and reference them
 | 
						||
from your configs.
 | 
						||
 | 
						||
> #### How the config is resolved
 | 
						||
>
 | 
						||
> The config file is parsed into a regular dictionary and is resolved and
 | 
						||
> validated **bottom-up**. Arguments provided for registered functions are
 | 
						||
> checked against the function's signature and type annotations. The return
 | 
						||
> value of a registered function can also be passed into another function – for
 | 
						||
> instance, a learning rate schedule can be provided as the an argument of an
 | 
						||
> optimizer.
 | 
						||
 | 
						||
```ini {title="With registered function"}
 | 
						||
[training.batch_size]
 | 
						||
@schedules = "compounding.v1"
 | 
						||
start = 100
 | 
						||
stop = 1000
 | 
						||
compound = 1.001
 | 
						||
```
 | 
						||
 | 
						||
### Model architectures {id="model-architectures"}
 | 
						||
 | 
						||
> #### 💡 Model type annotations
 | 
						||
>
 | 
						||
> In the documentation and code base, you may come across type annotations and
 | 
						||
> descriptions of [Thinc](https://thinc.ai) model types, like ~~Model[List[Doc],
 | 
						||
> List[Floats2d]]~~. This so-called generic type describes the layer and its
 | 
						||
> input and output type – in this case, it takes a list of `Doc` objects as the
 | 
						||
> input and list of 2-dimensional arrays of floats as the output. You can read
 | 
						||
> more about defining Thinc models [here](https://thinc.ai/docs/usage-models).
 | 
						||
> Also see the [type checking](https://thinc.ai/docs/usage-type-checking) for
 | 
						||
> how to enable linting in your editor to see live feedback if your inputs and
 | 
						||
> outputs don't match.
 | 
						||
 | 
						||
A **model architecture** is a function that wires up a Thinc
 | 
						||
[`Model`](https://thinc.ai/docs/api-model) instance, which you can then use in a
 | 
						||
component or as a layer of a larger network. You can use Thinc as a thin
 | 
						||
[wrapper around frameworks](https://thinc.ai/docs/usage-frameworks) such as
 | 
						||
PyTorch, TensorFlow or MXNet, or you can implement your logic in Thinc
 | 
						||
[directly](https://thinc.ai/docs/usage-models). For more details and examples,
 | 
						||
see the usage guide on [layers and architectures](/usage/layers-architectures).
 | 
						||
 | 
						||
spaCy's built-in components will never construct their `Model` instances
 | 
						||
themselves, so you won't have to subclass the component to change its model
 | 
						||
architecture. You can just **update the config** so that it refers to a
 | 
						||
different registered function. Once the component has been created, its `Model`
 | 
						||
instance has already been assigned, so you cannot change its model architecture.
 | 
						||
The architecture is like a recipe for the network, and you can't change the
 | 
						||
recipe once the dish has already been prepared. You have to make a new one.
 | 
						||
spaCy includes a variety of built-in [architectures](/api/architectures) for
 | 
						||
different tasks. For example:
 | 
						||
 | 
						||
| Architecture                                                      | Description                                                                                                                                                                                                                                               |
 | 
						||
| ----------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
 | 
						||
| [HashEmbedCNN](/api/architectures#HashEmbedCNN)                   | Build spaCy’s "standard" embedding layer, which uses hash embedding with subword features and a CNN with layer-normalized maxout. ~~Model[List[Doc], List[Floats2d]]~~                                                                                    |
 | 
						||
| [TransitionBasedParser](/api/architectures#TransitionBasedParser) | Build a [transition-based parser](https://explosion.ai/blog/parsing-english-in-python) model used in the default [`EntityRecognizer`](/api/entityrecognizer) and [`DependencyParser`](/api/dependencyparser). ~~Model[List[Docs], List[List[Floats2d]]]~~ |
 | 
						||
| [TextCatEnsemble](/api/architectures#TextCatEnsemble)             | Stacked ensemble of a bag-of-words model and a neural network model with an internal CNN embedding layer. Used in the default [`TextCategorizer`](/api/textcategorizer). ~~Model[List[Doc], Floats2d]~~                                                   |
 | 
						||
 | 
						||
### Metrics, training output and weighted scores {id="metrics"}
 | 
						||
 | 
						||
When you train a pipeline using the [`spacy train`](/api/cli#train) command,
 | 
						||
you'll see a table showing the metrics after each pass over the data. The
 | 
						||
available metrics **depend on the pipeline components**. Pipeline components
 | 
						||
also define which scores are shown and how they should be **weighted in the
 | 
						||
final score** that decides about the best model.
 | 
						||
 | 
						||
The `training.score_weights` setting in your `config.cfg` lets you customize the
 | 
						||
scores shown in the table and how they should be weighted. In this example, the
 | 
						||
labeled dependency accuracy and NER F-score count towards the final score with
 | 
						||
40% each and the tagging accuracy makes up the remaining 20%. The tokenization
 | 
						||
accuracy and speed are both shown in the table, but not counted towards the
 | 
						||
score.
 | 
						||
 | 
						||
> #### Why do I need score weights?
 | 
						||
>
 | 
						||
> At the end of your training process, you typically want to select the **best
 | 
						||
> model** – but what "best" means depends on the available components and your
 | 
						||
> specific use case. For instance, you may prefer a pipeline with higher NER and
 | 
						||
> lower POS tagging accuracy over a pipeline with lower NER and higher POS
 | 
						||
> accuracy. You can express this preference in the score weights, e.g. by
 | 
						||
> assigning `ents_f` (NER F-score) a higher weight.
 | 
						||
 | 
						||
```ini
 | 
						||
[training.score_weights]
 | 
						||
dep_las = 0.4
 | 
						||
dep_uas = null
 | 
						||
ents_f = 0.4
 | 
						||
tag_acc = 0.2
 | 
						||
token_acc = 0.0
 | 
						||
speed = 0.0
 | 
						||
```
 | 
						||
 | 
						||
The `score_weights` don't _have to_ sum to `1.0` – but it's recommended. When
 | 
						||
you generate a config for a given pipeline, the score weights are generated by
 | 
						||
combining and normalizing the default score weights of the pipeline components.
 | 
						||
The default score weights are defined by each pipeline component via the
 | 
						||
`default_score_weights` setting on the
 | 
						||
[`@Language.factory`](/api/language#factory) decorator. By default, all pipeline
 | 
						||
components are weighted equally. If a score weight is set to `null`, it will be
 | 
						||
excluded from the logs and the score won't be weighted.
 | 
						||
 | 
						||
<Accordion title="Understanding the training output and score types" spaced id="score-types">
 | 
						||
 | 
						||
| Name              | Description                                                                                                             |
 | 
						||
| ----------------- | ----------------------------------------------------------------------------------------------------------------------- |
 | 
						||
| **Loss**          | The training loss representing the amount of work left for the optimizer. Should decrease, but usually not to `0`.      |
 | 
						||
| **Precision** (P) | Percentage of predicted annotations that were correct. Should increase.                                                 |
 | 
						||
| **Recall** (R)    | Percentage of reference annotations recovered. Should increase.                                                         |
 | 
						||
| **F-Score** (F)   | Harmonic mean of precision and recall. Should increase.                                                                 |
 | 
						||
| **UAS** / **LAS** | Unlabeled and labeled attachment score for the dependency parser, i.e. the percentage of correct arcs. Should increase. |
 | 
						||
| **Speed**         | Prediction speed in words per second (WPS). Should stay stable.                                                         |
 | 
						||
 | 
						||
Note that if the development data has raw text, some of the gold-standard
 | 
						||
entities might not align to the predicted tokenization. These tokenization
 | 
						||
errors are **excluded from the NER evaluation**. If your tokenization makes it
 | 
						||
impossible for the model to predict 50% of your entities, your NER F-score might
 | 
						||
still look good.
 | 
						||
 | 
						||
</Accordion>
 | 
						||
 | 
						||
## Custom functions {id="custom-functions"}
 | 
						||
 | 
						||
Registered functions in the training config files can refer to built-in
 | 
						||
implementations, but you can also plug in fully **custom implementations**. All
 | 
						||
you need to do is register your function using the `@spacy.registry` decorator
 | 
						||
with the name of the respective [registry](/api/top-level#registry), e.g.
 | 
						||
`@spacy.registry.architectures`, and a string name to assign to your function.
 | 
						||
Registering custom functions allows you to **plug in models** defined in PyTorch
 | 
						||
or TensorFlow, make **custom modifications** to the `nlp` object, create custom
 | 
						||
optimizers or schedules, or **stream in data** and preprocess it on the fly
 | 
						||
while training.
 | 
						||
 | 
						||
Each custom function can have any number of arguments that are passed in via the
 | 
						||
[config](#config), just the built-in functions. If your function defines
 | 
						||
**default argument values**, spaCy is able to auto-fill your config when you run
 | 
						||
[`init fill-config`](/api/cli#init-fill-config). If you want to make sure that a
 | 
						||
given parameter is always explicitly set in the config, avoid setting a default
 | 
						||
value for it.
 | 
						||
 | 
						||
### Training with custom code {id="custom-code"}
 | 
						||
 | 
						||
> ```bash
 | 
						||
> ### Training
 | 
						||
> $ python -m spacy train config.cfg --code functions.py
 | 
						||
> ```
 | 
						||
>
 | 
						||
> ```bash
 | 
						||
> ### Packaging
 | 
						||
> $ python -m spacy package ./model-best ./packages --code functions.py
 | 
						||
> ```
 | 
						||
 | 
						||
The [`spacy train`](/api/cli#train) recipe lets you specify an optional argument
 | 
						||
`--code` that points to a Python file. The file is imported before training and
 | 
						||
allows you to add custom functions and architectures to the function registry
 | 
						||
that can then be referenced from your `config.cfg`. This lets you train spaCy
 | 
						||
pipelines with custom components, without having to re-implement the whole
 | 
						||
training workflow. When you package your trained pipeline later using
 | 
						||
[`spacy package`](/api/cli#package), you can provide one or more Python files to
 | 
						||
be included in the package and imported in its `__init__.py`. This means that
 | 
						||
any custom architectures, functions or
 | 
						||
[components](/usage/processing-pipelines#custom-components) will be shipped with
 | 
						||
your pipeline and registered when it's loaded. See the documentation on
 | 
						||
[saving and loading pipelines](/usage/saving-loading#models-custom) for details.
 | 
						||
 | 
						||
<Infobox variant="warning">
 | 
						||
 | 
						||
Note that the unpackaged models produced by `spacy train` are data directories
 | 
						||
that **do not include custom code**. You need to import the code in your script
 | 
						||
before loading in unpackaged models. For more details, see
 | 
						||
[`spacy.load`](/api/top-level#spacy.load).
 | 
						||
 | 
						||
</Infobox>
 | 
						||
 | 
						||
#### Example: Modifying the nlp object {id="custom-code-nlp-callbacks"}
 | 
						||
 | 
						||
For many use cases, you don't necessarily want to implement the whole `Language`
 | 
						||
subclass and language data from scratch – it's often enough to make a few small
 | 
						||
modifications, like adjusting the
 | 
						||
[tokenization rules](/usage/linguistic-features#native-tokenizer-additions) or
 | 
						||
[language defaults](/api/language#defaults) like stop words. The config lets you
 | 
						||
provide five optional **callback functions** that give you access to the
 | 
						||
language class and `nlp` object at different points of the lifecycle:
 | 
						||
 | 
						||
| Callback                      | Description                                                                                                                                                                                                                |
 | 
						||
| ----------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
 | 
						||
| `nlp.before_creation`         | Called before the `nlp` object is created and receives the language subclass like `English` (not the instance). Useful for writing to the [`Language.Defaults`](/api/language#defaults) aside from the tokenizer settings. |
 | 
						||
| `nlp.after_creation`          | Called right after the `nlp` object is created, but before the pipeline components are added to the pipeline and receives the `nlp` object.                                                                                |
 | 
						||
| `nlp.after_pipeline_creation` | Called right after the pipeline components are created and added and receives the `nlp` object. Useful for modifying pipeline components.                                                                                  |
 | 
						||
| `initialize.before_init`      | Called before the pipeline components are initialized and receives the `nlp` object for in-place modification. Useful for modifying the tokenizer settings, similar to the v2 base model option.                           |
 | 
						||
| `initialize.after_init`       | Called after the pipeline components are initialized and receives the `nlp` object for in-place modification.                                                                                                              |
 | 
						||
 | 
						||
The `@spacy.registry.callbacks` decorator lets you register your custom function
 | 
						||
in the `callbacks` [registry](/api/top-level#registry) under a given name. You
 | 
						||
can then reference the function in a config block using the `@callbacks` key. If
 | 
						||
a block contains a key starting with an `@`, it's interpreted as a reference to
 | 
						||
a function. Because you've registered the function, spaCy knows how to create it
 | 
						||
when you reference `"customize_language_data"` in your config. Here's an example
 | 
						||
of a callback that runs before the `nlp` object is created and adds a custom
 | 
						||
stop word to the defaults:
 | 
						||
 | 
						||
> #### config.cfg
 | 
						||
>
 | 
						||
> ```ini
 | 
						||
> [nlp.before_creation]
 | 
						||
> @callbacks = "customize_language_data"
 | 
						||
> ```
 | 
						||
 | 
						||
```python {title="functions.py",highlight="3,6"}
 | 
						||
import spacy
 | 
						||
 | 
						||
@spacy.registry.callbacks("customize_language_data")
 | 
						||
def create_callback():
 | 
						||
    def customize_language_data(lang_cls):
 | 
						||
        lang_cls.Defaults.stop_words.add("good")
 | 
						||
        return lang_cls
 | 
						||
 | 
						||
    return customize_language_data
 | 
						||
```
 | 
						||
 | 
						||
<Infobox variant="warning">
 | 
						||
 | 
						||
Remember that a registered function should always be a function that spaCy
 | 
						||
**calls to create something**. In this case, it **creates a callback** – it's
 | 
						||
not the callback itself.
 | 
						||
 | 
						||
</Infobox>
 | 
						||
 | 
						||
Any registered function – in this case `create_callback` – can also take
 | 
						||
**arguments** that can be **set by the config**. This lets you implement and
 | 
						||
keep track of different configurations, without having to hack at your code. You
 | 
						||
can choose any arguments that make sense for your use case. In this example,
 | 
						||
we're adding the arguments `extra_stop_words` (a list of strings) and `debug`
 | 
						||
(boolean) for printing additional info when the function runs.
 | 
						||
 | 
						||
> #### config.cfg
 | 
						||
>
 | 
						||
> ```ini
 | 
						||
> [nlp.before_creation]
 | 
						||
> @callbacks = "customize_language_data"
 | 
						||
> extra_stop_words = ["ooh", "aah"]
 | 
						||
> debug = true
 | 
						||
> ```
 | 
						||
 | 
						||
```python {title="functions.py",highlight="5,7-9"}
 | 
						||
from typing import List
 | 
						||
import spacy
 | 
						||
 | 
						||
@spacy.registry.callbacks("customize_language_data")
 | 
						||
def create_callback(extra_stop_words: List[str] = [], debug: bool = False):
 | 
						||
    def customize_language_data(lang_cls):
 | 
						||
        lang_cls.Defaults.stop_words.update(extra_stop_words)
 | 
						||
        if debug:
 | 
						||
            print("Updated stop words")
 | 
						||
        return lang_cls
 | 
						||
 | 
						||
    return customize_language_data
 | 
						||
```
 | 
						||
 | 
						||
<Infobox title="Tip: Use Python type hints" emoji="💡">
 | 
						||
 | 
						||
spaCy's configs are powered by our machine learning library Thinc's
 | 
						||
[configuration system](https://thinc.ai/docs/usage-config), which supports
 | 
						||
[type hints](https://docs.python.org/3/library/typing.html) and even
 | 
						||
[advanced type annotations](https://thinc.ai/docs/usage-config#advanced-types)
 | 
						||
using [`pydantic`](https://github.com/samuelcolvin/pydantic). If your registered
 | 
						||
function provides type hints, the values that are passed in will be checked
 | 
						||
against the expected types. For example, `debug: bool` in the example above will
 | 
						||
ensure that the value received as the argument `debug` is a boolean. If the
 | 
						||
value can't be coerced into a boolean, spaCy will raise an error.
 | 
						||
`debug: pydantic.StrictBool` will force the value to be a boolean and raise an
 | 
						||
error if it's not – for instance, if your config defines `1` instead of `true`.
 | 
						||
 | 
						||
</Infobox>
 | 
						||
 | 
						||
With your `functions.py` defining additional code and the updated `config.cfg`,
 | 
						||
you can now run [`spacy train`](/api/cli#train) and point the argument `--code`
 | 
						||
to your Python file. Before loading the config, spaCy will import the
 | 
						||
`functions.py` module and your custom functions will be registered.
 | 
						||
 | 
						||
```bash
 | 
						||
$ python -m spacy train config.cfg --output ./output --code ./functions.py
 | 
						||
```
 | 
						||
 | 
						||
#### Example: Modifying tokenizer settings {id="custom-tokenizer"}
 | 
						||
 | 
						||
Use the `initialize.before_init` callback to modify the tokenizer settings when
 | 
						||
training a new pipeline. Write a registered callback that modifies the tokenizer
 | 
						||
settings and specify this callback in your config:
 | 
						||
 | 
						||
> #### config.cfg
 | 
						||
>
 | 
						||
> ```ini
 | 
						||
> [initialize]
 | 
						||
>
 | 
						||
> [initialize.before_init]
 | 
						||
> @callbacks = "customize_tokenizer"
 | 
						||
> ```
 | 
						||
 | 
						||
```python {title="functions.py"}
 | 
						||
from spacy.util import registry, compile_suffix_regex
 | 
						||
 | 
						||
@registry.callbacks("customize_tokenizer")
 | 
						||
def make_customize_tokenizer():
 | 
						||
    def customize_tokenizer(nlp):
 | 
						||
        # remove a suffix
 | 
						||
        suffixes = list(nlp.Defaults.suffixes)
 | 
						||
        suffixes.remove("\\[")
 | 
						||
        suffix_regex = compile_suffix_regex(suffixes)
 | 
						||
        nlp.tokenizer.suffix_search = suffix_regex.search
 | 
						||
 | 
						||
        # add a special case
 | 
						||
        nlp.tokenizer.add_special_case("_SPECIAL_", [{"ORTH": "_SPECIAL_"}])
 | 
						||
    return customize_tokenizer
 | 
						||
```
 | 
						||
 | 
						||
When training, provide the function above with the `--code` option:
 | 
						||
 | 
						||
```bash
 | 
						||
$ python -m spacy train config.cfg --code ./functions.py
 | 
						||
```
 | 
						||
 | 
						||
Because this callback is only called in the one-time initialization step before
 | 
						||
training, the callback code does not need to be packaged with the final pipeline
 | 
						||
package. However, to make it easier for others to replicate your training setup,
 | 
						||
you can choose to package the initialization callbacks with the pipeline package
 | 
						||
or to publish them separately.
 | 
						||
 | 
						||
<Infobox variant="warning" title="nlp.before_creation vs. initialize.before_init">
 | 
						||
 | 
						||
- `nlp.before_creation` is the best place to modify language defaults other than
 | 
						||
  the tokenizer settings.
 | 
						||
- `initialize.before_init` is the best place to modify tokenizer settings when
 | 
						||
  training a new pipeline.
 | 
						||
 | 
						||
Unlike the other language defaults, the tokenizer settings are saved with the
 | 
						||
pipeline with `nlp.to_disk()`, so modifications made in `nlp.before_creation`
 | 
						||
will be clobbered by the saved settings when the trained pipeline is loaded from
 | 
						||
disk.
 | 
						||
 | 
						||
</Infobox>
 | 
						||
 | 
						||
#### Example: Custom logging function {id="custom-logging"}
 | 
						||
 | 
						||
During training, the results of each step are passed to a logger function. By
 | 
						||
default, these results are written to the console with the
 | 
						||
[`ConsoleLogger`](/api/top-level#ConsoleLogger). There is also built-in support
 | 
						||
for writing the log files to [Weights & Biases](https://www.wandb.com/) with the
 | 
						||
[`WandbLogger`](https://github.com/explosion/spacy-loggers#wandblogger). On each
 | 
						||
step, the logger function receives a **dictionary** with the following keys:
 | 
						||
 | 
						||
| Key            | Value                                                                                                 |
 | 
						||
| -------------- | ----------------------------------------------------------------------------------------------------- |
 | 
						||
| `epoch`        | How many passes over the data have been completed. ~~int~~                                            |
 | 
						||
| `step`         | How many steps have been completed. ~~int~~                                                           |
 | 
						||
| `score`        | The main score from the last evaluation, measured on the dev set. ~~float~~                           |
 | 
						||
| `other_scores` | The other scores from the last evaluation, measured on the dev set. ~~Dict[str, Any]~~                |
 | 
						||
| `losses`       | The accumulated training losses, keyed by component name. ~~Dict[str, float]~~                        |
 | 
						||
| `checkpoints`  | A list of previous results, where each result is a `(score, step)` tuple. ~~List[Tuple[float, int]]~~ |
 | 
						||
 | 
						||
You can easily implement and plug in your own logger that records the training
 | 
						||
results in a custom way, or sends them to an experiment management tracker of
 | 
						||
your choice. In this example, the function `my_custom_logger.v1` writes the
 | 
						||
tabular results to a file:
 | 
						||
 | 
						||
> ```ini
 | 
						||
> ### config.cfg (excerpt)
 | 
						||
> [training.logger]
 | 
						||
> @loggers = "my_custom_logger.v1"
 | 
						||
> log_path = "my_file.tab"
 | 
						||
> ```
 | 
						||
 | 
						||
```python {title="functions.py"}
 | 
						||
import sys
 | 
						||
from typing import IO, Tuple, Callable, Dict, Any, Optional
 | 
						||
import spacy
 | 
						||
from spacy import Language
 | 
						||
from pathlib import Path
 | 
						||
 | 
						||
@spacy.registry.loggers("my_custom_logger.v1")
 | 
						||
def custom_logger(log_path):
 | 
						||
    def setup_logger(
 | 
						||
        nlp: Language,
 | 
						||
        stdout: IO=sys.stdout,
 | 
						||
        stderr: IO=sys.stderr
 | 
						||
    ) -> Tuple[Callable, Callable]:
 | 
						||
        stdout.write(f"Logging to {log_path}\\n")
 | 
						||
        log_file = Path(log_path).open("w", encoding="utf8")
 | 
						||
        log_file.write("step\\t")
 | 
						||
        log_file.write("score\\t")
 | 
						||
        for pipe in nlp.pipe_names:
 | 
						||
            log_file.write(f"loss_{pipe}\\t")
 | 
						||
        log_file.write("\\n")
 | 
						||
 | 
						||
        def log_step(info: Optional[Dict[str, Any]]):
 | 
						||
            if info:
 | 
						||
                log_file.write(f"{info['step']}\\t")
 | 
						||
                log_file.write(f"{info['score']}\\t")
 | 
						||
                for pipe in nlp.pipe_names:
 | 
						||
                    log_file.write(f"{info['losses'][pipe]}\\t")
 | 
						||
                log_file.write("\\n")
 | 
						||
 | 
						||
        def finalize():
 | 
						||
            log_file.close()
 | 
						||
 | 
						||
        return log_step, finalize
 | 
						||
 | 
						||
    return setup_logger
 | 
						||
```
 | 
						||
 | 
						||
#### Example: Custom batch size schedule {id="custom-code-schedule"}
 | 
						||
 | 
						||
You can also implement your own batch size schedule to use during training. The
 | 
						||
`@spacy.registry.schedules` decorator lets you register that function in the
 | 
						||
`schedules` [registry](/api/top-level#registry) and assign it a string name:
 | 
						||
 | 
						||
> #### Why the version in the name?
 | 
						||
>
 | 
						||
> A big benefit of the config system is that it makes your experiments
 | 
						||
> reproducible. We recommend versioning the functions you register, especially
 | 
						||
> if you expect them to change (like a new model architecture). This way, you
 | 
						||
> know that a config referencing `v1` means a different function than a config
 | 
						||
> referencing `v2`.
 | 
						||
 | 
						||
```python {title="functions.py"}
 | 
						||
import spacy
 | 
						||
 | 
						||
@spacy.registry.schedules("my_custom_schedule.v1")
 | 
						||
def my_custom_schedule(start: int = 1, factor: float = 1.001):
 | 
						||
   while True:
 | 
						||
      yield start
 | 
						||
      start = start * factor
 | 
						||
```
 | 
						||
 | 
						||
In your config, you can now reference the schedule in the
 | 
						||
`[training.batch_size]` block via `@schedules`. If a block contains a key
 | 
						||
starting with an `@`, it's interpreted as a reference to a function. All other
 | 
						||
settings in the block will be passed to the function as keyword arguments. Keep
 | 
						||
in mind that the config shouldn't have any hidden defaults and all arguments on
 | 
						||
the functions need to be represented in the config.
 | 
						||
 | 
						||
```ini {title="config.cfg (excerpt)"}
 | 
						||
[training.batch_size]
 | 
						||
@schedules = "my_custom_schedule.v1"
 | 
						||
start = 2
 | 
						||
factor = 1.005
 | 
						||
```
 | 
						||
 | 
						||
### Defining custom architectures {id="custom-architectures"}
 | 
						||
 | 
						||
Built-in pipeline components such as the tagger or named entity recognizer are
 | 
						||
constructed with default neural network [models](/api/architectures). You can
 | 
						||
change the model architecture entirely by implementing your own custom models
 | 
						||
and providing those in the config when creating the pipeline component. See the
 | 
						||
documentation on [layers and model architectures](/usage/layers-architectures)
 | 
						||
for more details.
 | 
						||
 | 
						||
> ```ini
 | 
						||
> ### config.cfg
 | 
						||
> [components.tagger]
 | 
						||
> factory = "tagger"
 | 
						||
>
 | 
						||
> [components.tagger.model]
 | 
						||
> @architectures = "custom_neural_network.v1"
 | 
						||
> output_width = 512
 | 
						||
> ```
 | 
						||
 | 
						||
```python {title="functions.py"}
 | 
						||
from typing import List
 | 
						||
from thinc.types import Floats2d
 | 
						||
from thinc.api import Model
 | 
						||
import spacy
 | 
						||
from spacy.tokens import Doc
 | 
						||
 | 
						||
@spacy.registry.architectures("custom_neural_network.v1")
 | 
						||
def custom_neural_network(output_width: int) -> Model[List[Doc], List[Floats2d]]:
 | 
						||
    return create_model(output_width)
 | 
						||
```
 | 
						||
 | 
						||
## Customizing the initialization {id="initialization"}
 | 
						||
 | 
						||
When you start training a new model from scratch,
 | 
						||
[`spacy train`](/api/cli#train) will call
 | 
						||
[`nlp.initialize`](/api/language#initialize) to initialize the pipeline and load
 | 
						||
the required data. All settings for this are defined in the
 | 
						||
[`[initialize]`](/api/data-formats#config-initialize) block of the config, so
 | 
						||
you can keep track of how the initial `nlp` object was created. The
 | 
						||
initialization process typically includes the following:
 | 
						||
 | 
						||
> #### config.cfg (excerpt)
 | 
						||
>
 | 
						||
> ```ini
 | 
						||
> [initialize]
 | 
						||
> vectors = ${paths.vectors}
 | 
						||
> init_tok2vec = ${paths.init_tok2vec}
 | 
						||
>
 | 
						||
> [initialize.components]
 | 
						||
> # Settings for components
 | 
						||
> ```
 | 
						||
 | 
						||
1. Load in **data resources** defined in the `[initialize]` config, including
 | 
						||
   **word vectors** and
 | 
						||
   [pretrained](/usage/embeddings-transformers/#pretraining) **tok2vec
 | 
						||
   weights**.
 | 
						||
2. Call the `initialize` methods of the tokenizer (if implemented, e.g. for
 | 
						||
   [Chinese](/usage/models#chinese)) and pipeline components with a callback to
 | 
						||
   access the training data, the current `nlp` object and any **custom
 | 
						||
   arguments** defined in the `[initialize]` config.
 | 
						||
3. In **pipeline components**: if needed, use the data to
 | 
						||
   [infer missing shapes](/usage/layers-architectures#thinc-shape-inference) and
 | 
						||
   set up the label scheme if no labels are provided. Components may also load
 | 
						||
   other data like lookup tables or dictionaries.
 | 
						||
 | 
						||
The initialization step allows the config to define **all settings** required
 | 
						||
for the pipeline, while keeping a separation between settings and functions that
 | 
						||
should only be used **before training** to set up the initial pipeline, and
 | 
						||
logic and configuration that needs to be available **at runtime**. Without that
 | 
						||
separation, it would be very difficult to use the same, reproducible config file
 | 
						||
because the component settings required for training (load data from an external
 | 
						||
file) wouldn't match the component settings required at runtime (load what's
 | 
						||
included with the saved `nlp` object and don't depend on external file).
 | 
						||
 | 
						||

 | 
						||
 | 
						||
<Infobox title="How components save and load data" emoji="📖">
 | 
						||
 | 
						||
For details and examples of how pipeline components can **save and load data
 | 
						||
assets** like model weights or lookup tables, and how the component
 | 
						||
initialization is implemented under the hood, see the usage guide on
 | 
						||
[serializing and initializing component data](/usage/processing-pipelines#component-data-initialization).
 | 
						||
 | 
						||
</Infobox>
 | 
						||
 | 
						||
#### Initializing labels {id="initialization-labels"}
 | 
						||
 | 
						||
Built-in pipeline components like the
 | 
						||
[`EntityRecognizer`](/api/entityrecognizer) or
 | 
						||
[`DependencyParser`](/api/dependencyparser) need to know their available labels
 | 
						||
and associated internal meta information to initialize their model weights.
 | 
						||
Using the `get_examples` callback provided on initialization, they're able to
 | 
						||
**read the labels off the training data** automatically, which is very
 | 
						||
convenient – but it can also slow down the training process to compute this
 | 
						||
information on every run.
 | 
						||
 | 
						||
The [`init labels`](/api/cli#init-labels) command lets you auto-generate JSON
 | 
						||
files containing the label data for all supported components. You can then pass
 | 
						||
in the labels in the `[initialize]` settings for the respective components to
 | 
						||
allow them to initialize faster.
 | 
						||
 | 
						||
> #### config.cfg
 | 
						||
>
 | 
						||
> ```ini
 | 
						||
> [initialize.components.ner]
 | 
						||
>
 | 
						||
> [initialize.components.ner.labels]
 | 
						||
> @readers = "spacy.read_labels.v1"
 | 
						||
> path = "corpus/labels/ner.json
 | 
						||
> ```
 | 
						||
 | 
						||
```bash
 | 
						||
$ python -m spacy init labels config.cfg ./corpus --paths.train ./corpus/train.spacy
 | 
						||
```
 | 
						||
 | 
						||
Under the hood, the command delegates to the `label_data` property of the
 | 
						||
pipeline components, for instance
 | 
						||
[`EntityRecognizer.label_data`](/api/entityrecognizer#label_data).
 | 
						||
 | 
						||
<Infobox variant="warning" title="Important note">
 | 
						||
 | 
						||
The JSON format differs for each component and some components need additional
 | 
						||
meta information about their labels. The format exported by
 | 
						||
[`init labels`](/api/cli#init-labels) matches what the components need, so you
 | 
						||
should always let spaCy **auto-generate the labels** for you.
 | 
						||
 | 
						||
</Infobox>
 | 
						||
 | 
						||
## Data utilities {id="data"}
 | 
						||
 | 
						||
spaCy includes various features and utilities to make it easy to train models
 | 
						||
using your own data, manage training and evaluation corpora, convert existing
 | 
						||
annotations and configure data augmentation strategies for more robust models.
 | 
						||
 | 
						||
### Converting existing corpora and annotations {id="data-convert"}
 | 
						||
 | 
						||
If you have training data in a standard format like `.conll` or `.conllu`, the
 | 
						||
easiest way to convert it for use with spaCy is to run
 | 
						||
[`spacy convert`](/api/cli#convert) and pass it a file and an output directory.
 | 
						||
By default, the command will pick the converter based on the file extension.
 | 
						||
 | 
						||
```bash
 | 
						||
$ python -m spacy convert ./train.gold.conll ./corpus
 | 
						||
```
 | 
						||
 | 
						||
> #### 💡 Tip: Converting from Prodigy
 | 
						||
>
 | 
						||
> If you're using the [Prodigy](https://prodi.gy) annotation tool to create
 | 
						||
> training data, you can run the
 | 
						||
> [`data-to-spacy` command](https://prodi.gy/docs/recipes#data-to-spacy) to
 | 
						||
> merge and export multiple datasets for use with
 | 
						||
> [`spacy train`](/api/cli#train). Different types of annotations on the same
 | 
						||
> text will be combined, giving you one corpus to train multiple components.
 | 
						||
 | 
						||
<Infobox title="Tip: Manage multi-step workflows with projects" emoji="💡">
 | 
						||
 | 
						||
Training workflows often consist of multiple steps, from preprocessing the data
 | 
						||
all the way to packaging and deploying the trained model.
 | 
						||
[spaCy projects](/usage/projects) let you define all steps in one file, manage
 | 
						||
data assets, track changes and share your end-to-end processes with your team.
 | 
						||
 | 
						||
</Infobox>
 | 
						||
 | 
						||
The binary `.spacy` format is a serialized [`DocBin`](/api/docbin) containing
 | 
						||
one or more [`Doc`](/api/doc) objects. It's extremely **efficient in storage**,
 | 
						||
especially when packing multiple documents together. You can also create `Doc`
 | 
						||
objects manually, so you can write your own custom logic to convert and store
 | 
						||
existing annotations for use in spaCy.
 | 
						||
 | 
						||
```python {title="Training data from Doc objects",highlight="6-9"}
 | 
						||
import spacy
 | 
						||
from spacy.tokens import Doc, DocBin
 | 
						||
 | 
						||
nlp = spacy.blank("en")
 | 
						||
docbin = DocBin()
 | 
						||
words = ["Apple", "is", "looking", "at", "buying", "U.K.", "startup", "."]
 | 
						||
spaces = [True, True, True, True, True, True, True, False]
 | 
						||
ents = ["B-ORG", "O", "O", "O", "O", "B-GPE", "O", "O"]
 | 
						||
doc = Doc(nlp.vocab, words=words, spaces=spaces, ents=ents)
 | 
						||
docbin.add(doc)
 | 
						||
docbin.to_disk("./train.spacy")
 | 
						||
```
 | 
						||
 | 
						||
### Working with corpora {id="data-corpora"}
 | 
						||
 | 
						||
> #### Example
 | 
						||
>
 | 
						||
> ```ini
 | 
						||
> [corpora]
 | 
						||
>
 | 
						||
> [corpora.train]
 | 
						||
> @readers = "spacy.Corpus.v1"
 | 
						||
> path = ${paths.train}
 | 
						||
> gold_preproc = false
 | 
						||
> max_length = 0
 | 
						||
> limit = 0
 | 
						||
> augmenter = null
 | 
						||
>
 | 
						||
> [training]
 | 
						||
> train_corpus = "corpora.train"
 | 
						||
> ```
 | 
						||
 | 
						||
The [`[corpora]`](/api/data-formats#config-corpora) block in your config lets
 | 
						||
you define **data resources** to use for training, evaluation, pretraining or
 | 
						||
any other custom workflows. `corpora.train` and `corpora.dev` are used as
 | 
						||
conventions within spaCy's default configs, but you can also define any other
 | 
						||
custom blocks. Each section in the corpora config should resolve to a
 | 
						||
[`Corpus`](/api/corpus) – for example, using spaCy's built-in
 | 
						||
[corpus reader](/api/top-level#corpus-readers) that takes a path to a binary
 | 
						||
`.spacy` file. The `train_corpus` and `dev_corpus` fields in the
 | 
						||
[`[training]`](/api/data-formats#config-training) block specify where to find
 | 
						||
the corpus in your config. This makes it easy to **swap out** different corpora
 | 
						||
by only changing a single config setting.
 | 
						||
 | 
						||
Instead of making `[corpora]` a block with multiple subsections for each portion
 | 
						||
of the data, you can also use a single function that returns a dictionary of
 | 
						||
corpora, keyed by corpus name, e.g. `"train"` and `"dev"`. This can be
 | 
						||
especially useful if you need to split a single file into corpora for training
 | 
						||
and evaluation, without loading the same file twice.
 | 
						||
 | 
						||
By default, the training data is loaded into memory and shuffled before each
 | 
						||
epoch. If the corpus is **too large to fit into memory** during training, stream
 | 
						||
the corpus using a custom reader as described in the next section.
 | 
						||
 | 
						||
### Custom data reading and batching {id="custom-code-readers-batchers"}
 | 
						||
 | 
						||
Some use-cases require **streaming in data** or manipulating datasets on the
 | 
						||
fly, rather than generating all data beforehand and storing it to disk. Instead
 | 
						||
of using the built-in [`Corpus`](/api/corpus) reader, which uses static file
 | 
						||
paths, you can create and register a custom function that generates
 | 
						||
[`Example`](/api/example) objects.
 | 
						||
 | 
						||
In the following example we assume a custom function `read_custom_data` which
 | 
						||
loads or generates texts with relevant text classification annotations. Then,
 | 
						||
small lexical variations of the input text are created before generating the
 | 
						||
final [`Example`](/api/example) objects. The `@spacy.registry.readers` decorator
 | 
						||
lets you register the function creating the custom reader in the `readers`
 | 
						||
[registry](/api/top-level#registry) and assign it a string name, so it can be
 | 
						||
used in your config. All arguments on the registered function become available
 | 
						||
as **config settings** – in this case, `source`.
 | 
						||
 | 
						||
> #### config.cfg
 | 
						||
>
 | 
						||
> ```ini
 | 
						||
> [corpora.train]
 | 
						||
> @readers = "corpus_variants.v1"
 | 
						||
> source = "s3://your_bucket/path/data.csv"
 | 
						||
> ```
 | 
						||
 | 
						||
```python {title="functions.py",highlight="7-8"}
 | 
						||
from typing import Callable, Iterator, List
 | 
						||
import spacy
 | 
						||
from spacy.training import Example
 | 
						||
from spacy.language import Language
 | 
						||
import random
 | 
						||
 | 
						||
@spacy.registry.readers("corpus_variants.v1")
 | 
						||
def stream_data(source: str) -> Callable[[Language], Iterator[Example]]:
 | 
						||
    def generate_stream(nlp):
 | 
						||
        for text, cats in read_custom_data(source):
 | 
						||
            # Create a random variant of the example text
 | 
						||
            i = random.randint(0, len(text) - 1)
 | 
						||
            variant = text[:i] + text[i].upper() + text[i + 1:]
 | 
						||
            doc = nlp.make_doc(variant)
 | 
						||
            example = Example.from_dict(doc, {"cats": cats})
 | 
						||
            yield example
 | 
						||
 | 
						||
    return generate_stream
 | 
						||
```
 | 
						||
 | 
						||
<Infobox variant="warning">
 | 
						||
 | 
						||
Remember that a registered function should always be a function that spaCy
 | 
						||
**calls to create something**. In this case, it **creates the reader function**
 | 
						||
– it's not the reader itself.
 | 
						||
 | 
						||
</Infobox>
 | 
						||
 | 
						||
If the corpus is **too large to load into memory** or the corpus reader is an
 | 
						||
**infinite generator**, use the setting `max_epochs = -1` to indicate that the
 | 
						||
train corpus should be streamed. With this setting the train corpus is merely
 | 
						||
streamed and batched, not shuffled, so any shuffling needs to be implemented in
 | 
						||
the corpus reader itself. In the example below, a corpus reader that generates
 | 
						||
sentences containing even or odd numbers is used with an unlimited number of
 | 
						||
examples for the train corpus and a limited number of examples for the dev
 | 
						||
corpus. The dev corpus should always be finite and fit in memory during the
 | 
						||
evaluation step. `max_steps` and/or `patience` are used to determine when the
 | 
						||
training should stop.
 | 
						||
 | 
						||
> #### config.cfg
 | 
						||
>
 | 
						||
> ```ini
 | 
						||
> [corpora.dev]
 | 
						||
> @readers = "even_odd.v1"
 | 
						||
> limit = 100
 | 
						||
>
 | 
						||
> [corpora.train]
 | 
						||
> @readers = "even_odd.v1"
 | 
						||
> limit = -1
 | 
						||
>
 | 
						||
> [training]
 | 
						||
> max_epochs = -1
 | 
						||
> patience = 500
 | 
						||
> max_steps = 2000
 | 
						||
> ```
 | 
						||
 | 
						||
```python {title="functions.py"}
 | 
						||
from typing import Callable, Iterable, Iterator
 | 
						||
from spacy import util
 | 
						||
import random
 | 
						||
from spacy.training import Example
 | 
						||
from spacy import Language
 | 
						||
 | 
						||
 | 
						||
@util.registry.readers("even_odd.v1")
 | 
						||
def create_even_odd_corpus(limit: int = -1) -> Callable[[Language], Iterable[Example]]:
 | 
						||
    return EvenOddCorpus(limit)
 | 
						||
 | 
						||
 | 
						||
class EvenOddCorpus:
 | 
						||
    def __init__(self, limit):
 | 
						||
        self.limit = limit
 | 
						||
 | 
						||
    def __call__(self, nlp: Language) -> Iterator[Example]:
 | 
						||
        i = 0
 | 
						||
        while i < self.limit or self.limit < 0:
 | 
						||
            r = random.randint(0, 1000)
 | 
						||
            cat = r % 2 == 0
 | 
						||
            text = "This is sentence " + str(r)
 | 
						||
            yield Example.from_dict(
 | 
						||
                nlp.make_doc(text), {"cats": {"EVEN": cat, "ODD": not cat}}
 | 
						||
            )
 | 
						||
            i += 1
 | 
						||
```
 | 
						||
 | 
						||
> #### config.cfg
 | 
						||
>
 | 
						||
> ```ini
 | 
						||
> [initialize.components.textcat.labels]
 | 
						||
> @readers = "spacy.read_labels.v1"
 | 
						||
> path = "labels/textcat.json"
 | 
						||
> require = true
 | 
						||
> ```
 | 
						||
 | 
						||
If the train corpus is streamed, the initialize step peeks at the first 100
 | 
						||
examples in the corpus to find the labels for each component. If this isn't
 | 
						||
sufficient, you'll need to [provide the labels](#initialization-labels) for each
 | 
						||
component in the `[initialize]` block. [`init labels`](/api/cli#init-labels) can
 | 
						||
be used to generate JSON files in the correct format, which you can extend with
 | 
						||
the full label set.
 | 
						||
 | 
						||
We can also customize the **batching strategy** by registering a new batcher
 | 
						||
function in the `batchers` [registry](/api/top-level#registry). A batcher turns
 | 
						||
a stream of items into a stream of batches. spaCy has several useful built-in
 | 
						||
[batching strategies](/api/top-level#batchers) with customizable sizes, but it's
 | 
						||
also easy to implement your own. For instance, the following function takes the
 | 
						||
stream of generated [`Example`](/api/example) objects, and removes those which
 | 
						||
have the same underlying raw text, to avoid duplicates within each batch. Note
 | 
						||
that in a more realistic implementation, you'd also want to check whether the
 | 
						||
annotations are the same.
 | 
						||
 | 
						||
> #### config.cfg
 | 
						||
>
 | 
						||
> ```ini
 | 
						||
> [training.batcher]
 | 
						||
> @batchers = "filtering_batch.v1"
 | 
						||
> size = 150
 | 
						||
> ```
 | 
						||
 | 
						||
```python {title="functions.py"}
 | 
						||
from typing import Callable, Iterable, Iterator, List
 | 
						||
import spacy
 | 
						||
from spacy.training import Example
 | 
						||
 | 
						||
@spacy.registry.batchers("filtering_batch.v1")
 | 
						||
def filter_batch(size: int) -> Callable[[Iterable[Example]], Iterator[List[Example]]]:
 | 
						||
    def create_filtered_batches(examples):
 | 
						||
        batch = []
 | 
						||
        for eg in examples:
 | 
						||
            # Remove duplicate examples with the same text from batch
 | 
						||
            if eg.text not in [x.text for x in batch]:
 | 
						||
                batch.append(eg)
 | 
						||
            if len(batch) == size:
 | 
						||
                yield batch
 | 
						||
                batch = []
 | 
						||
 | 
						||
    return create_filtered_batches
 | 
						||
```
 | 
						||
 | 
						||
{/* TODO: Custom corpus class, Minibatching */}
 | 
						||
 | 
						||
### Data augmentation {id="data-augmentation"}
 | 
						||
 | 
						||
Data augmentation is the process of applying small **modifications** to the
 | 
						||
training data. It can be especially useful for punctuation and case replacement
 | 
						||
– for example, if your corpus only uses smart quotes and you want to include
 | 
						||
variations using regular quotes, or to make the model less sensitive to
 | 
						||
capitalization by including a mix of capitalized and lowercase examples.
 | 
						||
 | 
						||
The easiest way to use data augmentation during training is to provide an
 | 
						||
`augmenter` to the training corpus, e.g. in the `[corpora.train]` section of
 | 
						||
your config. The built-in [`orth_variants`](/api/top-level#orth_variants)
 | 
						||
augmenter creates a data augmentation callback that uses orth-variant
 | 
						||
replacement.
 | 
						||
 | 
						||
```ini {title="config.cfg (excerpt)",highlight="8,14"}
 | 
						||
[corpora.train]
 | 
						||
@readers = "spacy.Corpus.v1"
 | 
						||
path = ${paths.train}
 | 
						||
gold_preproc = false
 | 
						||
max_length = 0
 | 
						||
limit = 0
 | 
						||
 | 
						||
[corpora.train.augmenter]
 | 
						||
@augmenters = "spacy.orth_variants.v1"
 | 
						||
# Percentage of texts that will be augmented / lowercased
 | 
						||
level = 0.1
 | 
						||
lower = 0.5
 | 
						||
 | 
						||
[corpora.train.augmenter.orth_variants]
 | 
						||
@readers = "srsly.read_json.v1"
 | 
						||
path = "corpus/orth_variants.json"
 | 
						||
```
 | 
						||
 | 
						||
The `orth_variants` argument lets you pass in a dictionary of replacement rules,
 | 
						||
typically loaded from a JSON file. There are two types of orth variant rules:
 | 
						||
`"single"` for single tokens that should be replaced (e.g. hyphens) and
 | 
						||
`"paired"` for pairs of tokens (e.g. quotes).
 | 
						||
 | 
						||
```json {title="orth_variants.json"}
 | 
						||
{
 | 
						||
  "single": [{ "tags": ["NFP"], "variants": ["…", "..."] }],
 | 
						||
  "paired": [
 | 
						||
    {
 | 
						||
      "tags": ["``", "''"],
 | 
						||
      "variants": [
 | 
						||
        ["'", "'"],
 | 
						||
        ["‘", "’"]
 | 
						||
      ]
 | 
						||
    }
 | 
						||
  ]
 | 
						||
}
 | 
						||
```
 | 
						||
 | 
						||
<Accordion title="Full examples for English and German" spaced>
 | 
						||
 | 
						||
```json
 | 
						||
https://github.com/explosion/spacy-lookups-data/blob/master/spacy_lookups_data/data/en_orth_variants.json
 | 
						||
```
 | 
						||
 | 
						||
```json
 | 
						||
https://github.com/explosion/spacy-lookups-data/blob/master/spacy_lookups_data/data/de_orth_variants.json
 | 
						||
```
 | 
						||
 | 
						||
</Accordion>
 | 
						||
 | 
						||
<Infobox title="Important note" variant="warning">
 | 
						||
 | 
						||
When adding data augmentation, keep in mind that it typically only makes sense
 | 
						||
to apply it to the **training corpus**, not the development data.
 | 
						||
 | 
						||
</Infobox>
 | 
						||
 | 
						||
#### Writing custom data augmenters {id="data-augmentation-custom"}
 | 
						||
 | 
						||
Using the [`@spacy.augmenters`](/api/top-level#registry) registry, you can also
 | 
						||
register your own data augmentation callbacks. The callback should be a function
 | 
						||
that takes the current `nlp` object and a training [`Example`](/api/example) and
 | 
						||
yields `Example` objects. Keep in mind that the augmenter should yield **all
 | 
						||
examples** you want to use in your corpus, not only the augmented examples
 | 
						||
(unless you want to augment all examples).
 | 
						||
 | 
						||
Here'a an example of a custom augmentation callback that produces text variants
 | 
						||
in ["SpOnGeBoB cAsE"](https://knowyourmeme.com/memes/mocking-spongebob). The
 | 
						||
registered function takes one argument `randomize` that can be set via the
 | 
						||
config and decides whether the uppercase/lowercase transformation is applied
 | 
						||
randomly or not. The augmenter yields two `Example` objects: the original
 | 
						||
example and the augmented example.
 | 
						||
 | 
						||
> #### config.cfg
 | 
						||
>
 | 
						||
> ```ini
 | 
						||
> [corpora.train.augmenter]
 | 
						||
> @augmenters = "spongebob_augmenter.v1"
 | 
						||
> randomize = false
 | 
						||
> ```
 | 
						||
 | 
						||
```python
 | 
						||
import spacy
 | 
						||
import random
 | 
						||
 | 
						||
@spacy.registry.augmenters("spongebob_augmenter.v1")
 | 
						||
def create_augmenter(randomize: bool = False):
 | 
						||
    def augment(nlp, example):
 | 
						||
        text = example.text
 | 
						||
        if randomize:
 | 
						||
            # Randomly uppercase/lowercase characters
 | 
						||
            chars = [c.lower() if random.random() < 0.5 else c.upper() for c in text]
 | 
						||
        else:
 | 
						||
            # Uppercase followed by lowercase
 | 
						||
            chars = [c.lower() if i % 2 else c.upper() for i, c in enumerate(text)]
 | 
						||
        # Create augmented training example
 | 
						||
        example_dict = example.to_dict()
 | 
						||
        doc = nlp.make_doc("".join(chars))
 | 
						||
        example_dict["token_annotation"]["ORTH"] = [t.text for t in doc]
 | 
						||
        # Original example followed by augmented example
 | 
						||
        yield example
 | 
						||
        yield example.from_dict(doc, example_dict)
 | 
						||
 | 
						||
    return augment
 | 
						||
```
 | 
						||
 | 
						||
An easy way to create modified `Example` objects is to use the
 | 
						||
[`Example.from_dict`](/api/example#from_dict) method with a new reference
 | 
						||
[`Doc`](/api/doc) created from the modified text. In this case, only the
 | 
						||
capitalization changes, so only the `ORTH` values of the tokens will be
 | 
						||
different between the original and augmented examples.
 | 
						||
 | 
						||
Note that if your data augmentation strategy involves changing the tokenization
 | 
						||
(for instance, removing or adding tokens) and your training examples include
 | 
						||
token-based annotations like the dependency parse or entity labels, you'll need
 | 
						||
to take care to adjust the `Example` object so its annotations match and remain
 | 
						||
valid.
 | 
						||
 | 
						||
## Parallel & distributed training with Ray {id="parallel-training"}
 | 
						||
 | 
						||
> #### Installation
 | 
						||
>
 | 
						||
> ```bash
 | 
						||
> $ pip install -U %%SPACY_PKG_NAME[ray]%%SPACY_PKG_FLAGS
 | 
						||
> # Check that the CLI is registered
 | 
						||
> $ python -m spacy ray --help
 | 
						||
> ```
 | 
						||
 | 
						||
[Ray](https://ray.io/) is a fast and simple framework for building and running
 | 
						||
**distributed applications**. You can use Ray to train spaCy on one or more
 | 
						||
remote machines, potentially speeding up your training process. Parallel
 | 
						||
training won't always be faster though – it depends on your batch size, models,
 | 
						||
and hardware.
 | 
						||
 | 
						||
<Infobox variant="warning">
 | 
						||
 | 
						||
To use Ray with spaCy, you need the
 | 
						||
[`spacy-ray`](https://github.com/explosion/spacy-ray) package installed.
 | 
						||
Installing the package will automatically add the `ray` command to the spaCy
 | 
						||
CLI.
 | 
						||
 | 
						||
</Infobox>
 | 
						||
 | 
						||
The [`spacy ray train`](/api/cli#ray-train) command follows the same API as
 | 
						||
[`spacy train`](/api/cli#train), with a few extra options to configure the Ray
 | 
						||
setup. You can optionally set the `--address` option to point to your Ray
 | 
						||
cluster. If it's not set, Ray will run locally.
 | 
						||
 | 
						||
```bash
 | 
						||
python -m spacy ray train config.cfg --n-workers 2
 | 
						||
```
 | 
						||
 | 
						||
<Project id="integrations/ray">
 | 
						||
 | 
						||
Get started with parallel training using our project template. It trains a
 | 
						||
simple model on a Universal Dependencies Treebank and lets you parallelize the
 | 
						||
training with Ray.
 | 
						||
 | 
						||
</Project>
 | 
						||
 | 
						||
### How parallel training works {id="parallel-training-details"}
 | 
						||
 | 
						||
Each worker receives a shard of the **data** and builds a copy of the **model
 | 
						||
and optimizer** from the [`config.cfg`](#config). It also has a communication
 | 
						||
channel to **pass gradients and parameters** to the other workers. Additionally,
 | 
						||
each worker is given ownership of a subset of the parameter arrays. Every
 | 
						||
parameter array is owned by exactly one worker, and the workers are given a
 | 
						||
mapping so they know which worker owns which parameter.
 | 
						||
 | 
						||

 | 
						||
 | 
						||
As training proceeds, every worker will be computing gradients for **all** of
 | 
						||
the model parameters. When they compute gradients for parameters they don't own,
 | 
						||
they'll **send them to the worker** that does own that parameter, along with a
 | 
						||
version identifier so that the owner can decide whether to discard the gradient.
 | 
						||
Workers use the gradients they receive and the ones they compute locally to
 | 
						||
update the parameters they own, and then broadcast the updated array and a new
 | 
						||
version ID to the other workers.
 | 
						||
 | 
						||
This training procedure is **asynchronous** and **non-blocking**. Workers always
 | 
						||
push their gradient increments and parameter updates, they do not have to pull
 | 
						||
them and block on the result, so the transfers can happen in the background,
 | 
						||
overlapped with the actual training work. The workers also do not have to stop
 | 
						||
and wait for each other ("synchronize") at the start of each batch. This is very
 | 
						||
useful for spaCy, because spaCy is often trained on long documents, which means
 | 
						||
**batches can vary in size** significantly. Uneven workloads make synchronous
 | 
						||
gradient descent inefficient, because if one batch is slow, all of the other
 | 
						||
workers are stuck waiting for it to complete before they can continue.
 | 
						||
 | 
						||
## Internal training API {id="api"}
 | 
						||
 | 
						||
<Infobox variant="danger">
 | 
						||
 | 
						||
spaCy gives you full control over the training loop. However, for most use
 | 
						||
cases, it's recommended to train your pipelines via the
 | 
						||
[`spacy train`](/api/cli#train) command with a [`config.cfg`](#config) to keep
 | 
						||
track of your settings and hyperparameters, instead of writing your own training
 | 
						||
scripts from scratch. [Custom registered functions](#custom-code) should
 | 
						||
typically give you everything you need to train fully custom pipelines with
 | 
						||
[`spacy train`](/api/cli#train).
 | 
						||
 | 
						||
</Infobox>
 | 
						||
 | 
						||
### Training from a Python script {id="api-train",version="3.2"}
 | 
						||
 | 
						||
If you want to run the training from a Python script instead of using the
 | 
						||
[`spacy train`](/api/cli#train) CLI command, you can call into the
 | 
						||
[`train`](/api/cli#train-function) helper function directly. It takes the path
 | 
						||
to the config file, an optional output directory and an optional dictionary of
 | 
						||
[config overrides](#config-overrides).
 | 
						||
 | 
						||
```python
 | 
						||
from spacy.cli.train import train
 | 
						||
 | 
						||
train("./config.cfg", overrides={"paths.train": "./train.spacy", "paths.dev": "./dev.spacy"})
 | 
						||
```
 | 
						||
 | 
						||
### Internal training loop API {id="api-loop"}
 | 
						||
 | 
						||
<Infobox variant="warning">
 | 
						||
 | 
						||
This section documents how the training loop and updates to the `nlp` object
 | 
						||
work internally. You typically shouldn't have to implement this in Python unless
 | 
						||
you're writing your own trainable components. To train a pipeline, use
 | 
						||
[`spacy train`](/api/cli#train) or the [`train`](/api/cli#train-function) helper
 | 
						||
function instead.
 | 
						||
 | 
						||
</Infobox>
 | 
						||
 | 
						||
The [`Example`](/api/example) object contains annotated training data, also
 | 
						||
called the **gold standard**. It's initialized with a [`Doc`](/api/doc) object
 | 
						||
that will hold the predictions, and another `Doc` object that holds the
 | 
						||
gold-standard annotations. It also includes the **alignment** between those two
 | 
						||
documents if they differ in tokenization. The `Example` class ensures that spaCy
 | 
						||
can rely on one **standardized format** that's passed through the pipeline. For
 | 
						||
instance, let's say we want to define gold-standard part-of-speech tags:
 | 
						||
 | 
						||
```python
 | 
						||
words = ["I", "like", "stuff"]
 | 
						||
predicted = Doc(vocab, words=words)
 | 
						||
# create the reference Doc with gold-standard TAG annotations
 | 
						||
tags = ["NOUN", "VERB", "NOUN"]
 | 
						||
tag_ids = [vocab.strings.add(tag) for tag in tags]
 | 
						||
reference = Doc(vocab, words=words).from_array("TAG", numpy.array(tag_ids, dtype="uint64"))
 | 
						||
example = Example(predicted, reference)
 | 
						||
```
 | 
						||
 | 
						||
As this is quite verbose, there's an alternative way to create the reference
 | 
						||
`Doc` with the gold-standard annotations. The function `Example.from_dict` takes
 | 
						||
a dictionary with keyword arguments specifying the annotations, like `tags` or
 | 
						||
`entities`. Using the resulting `Example` object and its gold-standard
 | 
						||
annotations, the model can be updated to learn a sentence of three words with
 | 
						||
their assigned part-of-speech tags.
 | 
						||
 | 
						||
```python
 | 
						||
words = ["I", "like", "stuff"]
 | 
						||
tags = ["NOUN", "VERB", "NOUN"]
 | 
						||
predicted = Doc(nlp.vocab, words=words)
 | 
						||
example = Example.from_dict(predicted, {"tags": tags})
 | 
						||
```
 | 
						||
 | 
						||
Here's another example that shows how to define gold-standard named entities.
 | 
						||
The letters added before the labels refer to the tags of the
 | 
						||
[BILUO scheme](/usage/linguistic-features#updating-biluo) – `O` is a token
 | 
						||
outside an entity, `U` a single entity unit, `B` the beginning of an entity, `I`
 | 
						||
a token inside an entity and `L` the last token of an entity.
 | 
						||
 | 
						||
```python
 | 
						||
doc = Doc(nlp.vocab, words=["Facebook", "released", "React", "in", "2014"])
 | 
						||
example = Example.from_dict(doc, {"entities": ["U-ORG", "O", "U-TECHNOLOGY", "O", "U-DATE"]})
 | 
						||
```
 | 
						||
 | 
						||
<Infobox title="Migrating from v2.x" variant="warning">
 | 
						||
 | 
						||
As of v3.0, the [`Example`](/api/example) object replaces the `GoldParse` class.
 | 
						||
It can be constructed in a very similar way – from a `Doc` and a dictionary of
 | 
						||
annotations. For more details, see the
 | 
						||
[migration guide](/usage/v3#migrating-training).
 | 
						||
 | 
						||
```diff
 | 
						||
- gold = GoldParse(doc, entities=entities)
 | 
						||
+ example = Example.from_dict(doc, {"entities": entities})
 | 
						||
```
 | 
						||
 | 
						||
</Infobox>
 | 
						||
 | 
						||
Of course, it's not enough to only show a model a single example once.
 | 
						||
Especially if you only have few examples, you'll want to train for a **number of
 | 
						||
iterations**. At each iteration, the training data is **shuffled** to ensure the
 | 
						||
model doesn't make any generalizations based on the order of examples. Another
 | 
						||
technique to improve the learning results is to set a **dropout rate**, a rate
 | 
						||
at which to randomly "drop" individual features and representations. This makes
 | 
						||
it harder for the model to memorize the training data. For example, a `0.25`
 | 
						||
dropout means that each feature or internal representation has a 1/4 likelihood
 | 
						||
of being dropped.
 | 
						||
 | 
						||
> - [`nlp`](/api/language): The `nlp` object with the pipeline components and
 | 
						||
>   their models.
 | 
						||
> - [`nlp.initialize`](/api/language#initialize): Initialize the pipeline and
 | 
						||
>   return an optimizer to update the component model weights.
 | 
						||
> - [`Optimizer`](https://thinc.ai/docs/api-optimizers): Function that holds
 | 
						||
>   state between updates.
 | 
						||
> - [`nlp.update`](/api/language#update): Update component models with examples.
 | 
						||
> - [`Example`](/api/example): object holding predictions and gold-standard
 | 
						||
>   annotations.
 | 
						||
> - [`nlp.to_disk`](/api/language#to_disk): Save the updated pipeline to a
 | 
						||
>   directory.
 | 
						||
 | 
						||
```python {title="Example training loop"}
 | 
						||
optimizer = nlp.initialize()
 | 
						||
for itn in range(100):
 | 
						||
    random.shuffle(train_data)
 | 
						||
    for raw_text, entity_offsets in train_data:
 | 
						||
        doc = nlp.make_doc(raw_text)
 | 
						||
        example = Example.from_dict(doc, {"entities": entity_offsets})
 | 
						||
        nlp.update([example], sgd=optimizer)
 | 
						||
nlp.to_disk("/output")
 | 
						||
```
 | 
						||
 | 
						||
The [`nlp.update`](/api/language#update) method takes the following arguments:
 | 
						||
 | 
						||
| Name       | Description                                                                                                                                                            |
 | 
						||
| ---------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
 | 
						||
| `examples` | [`Example`](/api/example) objects. The `update` method takes a sequence of them, so you can batch up your training examples.                                           |
 | 
						||
| `drop`     | Dropout rate. Makes it harder for the model to just memorize the data.                                                                                                 |
 | 
						||
| `sgd`      | An [`Optimizer`](https://thinc.ai/docs/api-optimizers) object, which updates the model's weights. If not set, spaCy will create a new one and save it for further use. |
 | 
						||
 | 
						||
<Infobox title="Migrating from v2.x" variant="warning">
 | 
						||
 | 
						||
As of v3.0, the [`Example`](/api/example) object replaces the `GoldParse` class
 | 
						||
and the "simple training style" of calling `nlp.update` with a text and a
 | 
						||
dictionary of annotations. Updating your code to use the `Example` object should
 | 
						||
be very straightforward: you can call
 | 
						||
[`Example.from_dict`](/api/example#from_dict) with a [`Doc`](/api/doc) and the
 | 
						||
dictionary of annotations:
 | 
						||
 | 
						||
```diff
 | 
						||
text = "Facebook released React in 2014"
 | 
						||
annotations = {"entities": ["U-ORG", "O", "U-TECHNOLOGY", "O", "U-DATE"]}
 | 
						||
+ example = Example.from_dict(nlp.make_doc(text), annotations)
 | 
						||
- nlp.update([text], [annotations])
 | 
						||
+ nlp.update([example])
 | 
						||
```
 | 
						||
 | 
						||
</Infobox>
 |