spaCy/website/docs/usage/training.md

---
title: Training Pipelines & Models
teaser: Train and update components on your own data and integrate custom models
next: /usage/layers-architectures
menu:
  - ['Introduction', 'basics']
  - ['Quickstart', 'quickstart']
  - ['Config System', 'config']
  - ['Custom Training', 'config-custom']
  - ['Custom Functions', 'custom-functions']
  - ['Initialization', 'initialization']
  - ['Data Utilities', 'data']
  - ['Parallel Training', 'parallel-training']
  - ['Internal API', 'api']
---

## Introduction to training {#basics hidden="true"}

import Training101 from 'usage/101/\_training.md'

<Training101 />

<Infobox title="Tip: Try the Prodigy annotation tool">

[![Prodigy: Radically efficient machine teaching](../images/prodigy.jpg)](https://prodi.gy)

If you need to label a lot of data, check out [Prodigy](https://prodi.gy), a
new, active learning-powered annotation tool we've developed. Prodigy is fast
and extensible, and comes with a modern **web application** that helps you
collect training data faster. It integrates seamlessly with spaCy, pre-selects
the **most relevant examples** for annotation, and lets you train and evaluate
ready-to-use spaCy pipelines.

</Infobox>

## Quickstart {#quickstart tag="new"}

The recommended way to train your spaCy pipelines is via the
[`spacy train`](/api/cli#train) command on the command line. It only needs a
single [`config.cfg`](#config) **configuration file** that includes all settings
and hyperparameters. You can optionally [overwrite](#config-overrides) settings
on the command line, and load in a Python file to register
[custom functions](#custom-code) and architectures. This quickstart widget helps
you generate a starter config with the **recommended settings** for your
specific use case. It's also available in spaCy as the
[`init config`](/api/cli#init-config) command.

> #### Instructions: widget
>
> 1. Select your requirements and settings.
> 2. Use the buttons at the bottom to save the result to your clipboard or a
>    file `base_config.cfg`.
> 3. Run [`init fill-config`](/api/cli#init-fill-config) to create a full
>    config.
> 4. Run [`train`](/api/cli#train) with your config and data.
>
> #### Instructions: CLI
>
> 1. Run the [`init config`](/api/cli#init-config) command and specify your
>    requirements and settings as CLI arguments.
> 2. Run [`train`](/api/cli#train) with the exported config and data.

import QuickstartTraining from 'widgets/quickstart-training.js'

<QuickstartTraining />

After you've saved the starter config to a file `base_config.cfg`, you can use
the [`init fill-config`](/api/cli#init-fill-config) command to fill in the
remaining defaults. Training configs should always be **complete and without
hidden defaults**, to keep your experiments reproducible.

```cli
$ python -m spacy init fill-config base_config.cfg config.cfg
```

> #### Tip: Debug your data
>
> The [`debug data` command](/api/cli#debug-data) lets you analyze and validate
> your training and development data, get useful stats, and find problems like
> invalid entity annotations, cyclic dependencies, low data labels and more.
>
> ```cli
> $ python -m spacy debug data config.cfg
> ```

Instead of exporting your starter config from the quickstart widget and
auto-filling it, you can also use the [`init config`](/api/cli#init-config)
command and specify your requirement and settings as CLI arguments. You can now
add your data and run [`train`](/api/cli#train) with your config. See the
[`convert`](/api/cli#convert) command for details on how to convert your data to
spaCy's binary `.spacy` format. You can either include the data paths in the
`[paths]` section of your config, or pass them in via the command line.

```cli
$ python -m spacy train config.cfg --output ./output --paths.train ./train.spacy --paths.dev ./dev.spacy
```

> #### Tip: Enable your GPU
>
> Use the `--gpu-id` option to select the GPU:
>
> ```cli
> $ python -m spacy train config.cfg --gpu-id 0
> ```

<Accordion title="How are the config recommendations generated?" id="quickstart-source" spaced>

The recommended config settings generated by the quickstart widget and the
[`init config`](/api/cli#init-config) command are based on some general **best
practices** and things we've found to work well in our experiments. The goal is
to provide you with the most **useful defaults**.

Under the hood, the
[`quickstart_training.jinja`](%%GITHUB_SPACY/spacy/cli/templates/quickstart_training.jinja)
template defines the different combinations – for example, which parameters to
change if the pipeline should optimize for efficiency vs. accuracy. The file
[`quickstart_training_recommendations.yml`](%%GITHUB_SPACY/spacy/cli/templates/quickstart_training_recommendations.yml)
collects the recommended settings and available resources for each language
including the different transformer weights. For some languages, we include
different transformer recommendations, depending on whether you want the model
to be more efficient or more accurate. The recommendations will be **evolving**
as we run more experiments.

</Accordion>

<Project id="pipelines/tagger_parser_ud">

The easiest way to get started is to clone a [project template](/usage/projects)
and run it – for example, this end-to-end template that lets you train a
**part-of-speech tagger** and **dependency parser** on a Universal Dependencies
treebank.

</Project>

## Training config system {#config}

Training config files include all **settings and hyperparameters** for training
your pipeline. Instead of providing lots of arguments on the command line, you
only need to pass your `config.cfg` file to [`spacy train`](/api/cli#train).
Under the hood, the training config uses the
[configuration system](https://thinc.ai/docs/usage-config) provided by our
machine learning library [Thinc](https://thinc.ai). This also makes it easy to
integrate custom models and architectures, written in your framework of choice.
Some of the main advantages and features of spaCy's training config are:

- **Structured sections.** The config is grouped into sections, and nested
  sections are defined using the `.` notation. For example, `[components.ner]`
  defines the settings for the pipeline's named entity recognizer. The config
  can be loaded as a Python dict.
- **References to registered functions.** Sections can refer to registered
  functions like [model architectures](/api/architectures),
  [optimizers](https://thinc.ai/docs/api-optimizers) or
  [schedules](https://thinc.ai/docs/api-schedules) and define arguments that are
  passed into them. You can also
  [register your own functions](#custom-functions) to define custom
  architectures or methods, reference them in your config and tweak their
  parameters.
- **Interpolation.** If you have hyperparameters or other settings used by
  multiple components, define them once and reference them as
  [variables](#config-interpolation).
- **Reproducibility with no hidden defaults.** The config file is the "single
  source of truth" and includes all settings.
- **Automated checks and validation.** When you load a config, spaCy checks if
  the settings are complete and if all values have the correct types. This lets
  you catch potential mistakes early. In your custom architectures, you can use
  Python [type hints](https://docs.python.org/3/library/typing.html) to tell the
  config which types of data to expect.

```ini
%%GITHUB_SPACY/spacy/default_config.cfg
```

Under the hood, the config is parsed into a dictionary. It's divided into
sections and subsections, indicated by the square brackets and dot notation. For
example, `[training]` is a section and `[training.batch_size]` a subsection.
Subsections can define values, just like a dictionary, or use the `@` syntax to
refer to [registered functions](#config-functions). This allows the config to
not just define static settings, but also construct objects like architectures,
schedules, optimizers or any other custom components. The main top-level
sections of a config file are:

| Section       | Description                                                                                                                                                     |
| ------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `nlp`         | Definition of the `nlp` object, its tokenizer and [processing pipeline](/usage/processing-pipelines) component names.                                           |
| `components`  | Definitions of the [pipeline components](/usage/processing-pipelines) and their models.                                                                         |
| `paths`       | Paths to data and other assets. Re-used across the config as variables, e.g. `${paths.train}`, and can be [overwritten](#config-overrides) on the CLI.          |
| `system`      | Settings related to system and hardware. Re-used across the config as variables, e.g. `${system.seed}`, and can be [overwritten](#config-overrides) on the CLI. |
| `training`    | Settings and controls for the training and evaluation process.                                                                                                  |
| `pretraining` | Optional settings and controls for the [language model pretraining](/usage/embeddings-transformers#pretraining).                                                |
| `initialize`  | Data resources and arguments passed to components when [`nlp.initialize`](/api/language#initialize) is called before training (but not at runtime).             |

<Infobox title="Config format and settings" emoji="📖">

For a full overview of spaCy's config format and settings, see the
[data format documentation](/api/data-formats#config) and
[Thinc's config system docs](https://thinc.ai/docs/usage-config). The settings
available for the different architectures are documented with the
[model architectures API](/api/architectures). See the Thinc documentation for
[optimizers](https://thinc.ai/docs/api-optimizers) and
[schedules](https://thinc.ai/docs/api-schedules).

</Infobox>

<YouTube id="BWhh3r6W-qE"></YouTube>

### Config lifecycle at runtime and training {#config-lifecycle}

A pipeline's `config.cfg` is considered the "single source of truth", both at
**training** and **runtime**. Under the hood,
[`Language.from_config`](/api/language#from_config) takes care of constructing
the `nlp` object using the settings defined in the config. An `nlp` object's
config is available as [`nlp.config`](/api/language#config) and it includes all
information about the pipeline, as well as the settings used to train and
initialize it.

![Illustration of pipeline lifecycle](../images/lifecycle.svg)

At runtime spaCy will only use the `[nlp]` and `[components]` blocks of the
config and load all data, including tokenization rules, model weights and other
resources from the pipeline directory. The `[training]` block contains the
settings for training the model and is only used during training. Similarly, the
`[initialize]` block defines how the initial `nlp` object should be set up
before training and whether it should be initialized with vectors or pretrained
tok2vec weights, or any other data needed by the components.

The initialization settings are only loaded and used when
[`nlp.initialize`](/api/language#initialize) is called (typically right before
training). This allows you to set up your pipeline using local data resources
and custom functions, and preserve the information in your config – but without
requiring it to be available at runtime. You can also use this mechanism to
provide data paths to custom pipeline components and custom tokenizers – see the
section on [custom initialization](#initialization) for details.

### Overwriting config settings on the command line {#config-overrides}

The config system means that you can define all settings **in one place** and in
a consistent format. There are no command-line arguments that need to be set,
and no hidden defaults. However, there can still be scenarios where you may want
to override config settings when you run [`spacy train`](/api/cli#train). This
includes **file paths** to vectors or other resources that shouldn't be
hard-code in a config file, or **system-dependent settings**.

For cases like this, you can set additional command-line options starting with
`--` that correspond to the config section and value to override. For example,
`--paths.train ./corpus/train.spacy` sets the `train` value in the `[paths]`
block.

```cli
$ python -m spacy train config.cfg --paths.train ./corpus/train.spacy --paths.dev ./corpus/dev.spacy --training.batch_size 128
```

Only existing sections and values in the config can be overwritten. At the end
of the training, the final filled `config.cfg` is exported with your pipeline,
so you'll always have a record of the settings that were used, including your
overrides. Overrides are added before [variables](#config-interpolation) are
resolved, by the way – so if you need to use a value in multiple places,
reference it across your config and override it on the CLI once.

> #### 💡 Tip: Verbose logging
>
> If you're using config overrides, you can set the `--verbose` flag on
> [`spacy train`](/api/cli#train) to make spaCy log more info, including which
> overrides were set via the CLI and environment variables.

#### Adding overrides via environment variables {#config-overrides-env}

Instead of defining the overrides as CLI arguments, you can also use the
`SPACY_CONFIG_OVERRIDES` environment variable using the same argument syntax.
This is especially useful if you're training models as part of an automated
process. Environment variables **take precedence** over CLI overrides and values
defined in the config file.

```cli
$ SPACY_CONFIG_OVERRIDES="--system.gpu_allocator pytorch --training.batch_size 128" ./your_script.sh
```

### Reading from standard input {#config-stdin}

Setting the config path to `-` on the command line lets you read the config from
standard input and pipe it forward from a different process, like
[`init config`](/api/cli#init-config) or your own custom script. This is
especially useful for quick experiments, as it lets you generate a config on the
fly without having to save to and load from disk.

> #### 💡 Tip: Writing to stdout
>
> When you run `init config`, you can set the output path to `-` to write to
> stdout. In a custom script, you can print the string config, e.g.
> `print(nlp.config.to_str())`.

```cli
$ python -m spacy init config - --lang en --pipeline ner,textcat --optimize accuracy | python -m spacy train - --paths.train ./corpus/train.spacy --paths.dev ./corpus/dev.spacy
```

<!-- TODO: add reference to Prodigy's commands once Prodigy nightly is available -->

### Using variable interpolation {#config-interpolation}

Another very useful feature of the config system is that it supports variable
interpolation for both **values and sections**. This means that you only need to
define a setting once and can reference it across your config using the
`${section.value}` syntax. In this example, the value of `seed` is reused within
the `[training]` block, and the whole block of `[training.optimizer]` is reused
in `[pretraining]` and will become `pretraining.optimizer`.

```ini
### config.cfg (excerpt) {highlight="5,18"}
[system]
seed = 0

[training]
seed = ${system.seed}

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 1e-8

[pretraining]
optimizer = ${training.optimizer}
```

You can also use variables inside strings. In that case, it works just like
f-strings in Python. If the value of a variable is not a string, it's converted
to a string.

```ini
[paths]
version = 5
root = "/Users/you/data"
train = "${paths.root}/train_${paths.version}.spacy"
# Result: /Users/you/data/train_5.spacy
```

<Infobox title="Tip: Override variables on the CLI" emoji="💡">

If you need to change certain values between training runs, you can define them
once, reference them as variables and then [override](#config-overrides) them on
the CLI. For example, `--paths.root /other/root` will change the value of `root`
in the block `[paths]` and the change will be reflected across all other values
that reference this variable.

</Infobox>

## Customizing the pipeline and training {#config-custom}

### Defining pipeline components {#config-components}

You typically train a [pipeline](/usage/processing-pipelines) of **one or more
components**. The `[components]` block in the config defines the available
pipeline components and how they should be created – either by a built-in or
custom [factory](/usage/processing-pipelines#built-in), or
[sourced](/usage/processing-pipelines#sourced-components) from an existing
trained pipeline. For example, `[components.parser]` defines the component named
`"parser"` in the pipeline. There are different ways you might want to treat
your components during training, and the most common scenarios are:

1. Train a **new component** from scratch on your data.
2. Update an existing **trained component** with more examples.
3. Include an existing trained component without updating it.
4. Include a non-trainable component, like a rule-based
   [`EntityRuler`](/api/entityruler) or [`Sentencizer`](/api/sentencizer), or a
   fully [custom component](/usage/processing-pipelines#custom-components).

If a component block defines a `factory`, spaCy will look it up in the
[built-in](/usage/processing-pipelines#built-in) or
[custom](/usage/processing-pipelines#custom-components) components and create a
new component from scratch. All settings defined in the config block will be
passed to the component factory as arguments. This lets you configure the model
settings and hyperparameters. If a component block defines a `source`, the
component will be copied over from an existing trained pipeline, with its
existing weights. This lets you include an already trained component in your
pipeline, or update a trained component with more data specific to your use
case.

```ini
### config.cfg (excerpt)
[components]

# "parser" and "ner" are sourced from a trained pipeline
[components.parser]
source = "en_core_web_sm"

[components.ner]
source = "en_core_web_sm"

# "textcat" and "custom" are created blank from a built-in / custom factory
[components.textcat]
factory = "textcat"

[components.custom]
factory = "your_custom_factory"
your_custom_setting = true
```

The `pipeline` setting in the `[nlp]` block defines the pipeline components
added to the pipeline, in order. For example, `"parser"` here references
`[components.parser]`. By default, spaCy will **update all components that can
be updated**. Trainable components that are created from scratch are initialized
with random weights. For sourced components, spaCy will keep the existing
weights and [resume training](/api/language#resume_training).

If you don't want a component to be updated, you can **freeze** it by adding it
to the `frozen_components` list in the `[training]` block. Frozen components are
**not updated** during training and are included in the final trained pipeline
as-is. They are also excluded when calling
[`nlp.initialize`](/api/language#initialize).

> #### Note on frozen components
>
> Even though frozen components are not **updated** during training, they will
> still **run** during training and evaluation. This is very important, because
> they may still impact your model's performance – for instance, a sentence
> boundary detector can impact what the parser or entity recognizer considers a
> valid parse. So the evaluation results should always reflect what your
> pipeline will produce at runtime.

```ini
[nlp]
lang = "en"
pipeline = ["parser", "ner", "textcat", "custom"]

[training]
frozen_components = ["parser", "custom"]
```

<Infobox variant="warning" title="Shared Tok2Vec listener layer" id="config-components-listeners">

When the components in your pipeline
[share an embedding layer](/usage/embeddings-transformers#embedding-layers), the
**performance** of your frozen component will be **degraded** if you continue
training other layers with the same underlying `Tok2Vec` instance. As a rule of
thumb, ensure that your frozen components are truly **independent** in the
pipeline.

To automatically replace a shared token-to-vector listener with an independent
copy of the token-to-vector layer, you can use the `replace_listeners` setting
of a sourced component, pointing to the listener layer(s) in the config. For
more details on how this works under the hood, see
[`Language.replace_listeners`](/api/language#replace_listeners).

```ini
[training]
frozen_components = ["tagger"]

[components.tagger]
source = "en_core_web_sm"
replace_listeners = ["model.tok2vec"]
```

</Infobox>

### Using registered functions {#config-functions}

The training configuration defined in the config file doesn't have to only
consist of static values. Some settings can also be **functions**. For instance,
the `batch_size` can be a number that doesn't change, or a schedule, like a
sequence of compounding values, which has shown to be an effective trick (see
[Smith et al., 2017](https://arxiv.org/abs/1711.00489)).

```ini
### With static value
[training]
batch_size = 128
```

To refer to a function instead, you can make `[training.batch_size]` its own
section and use the `@` syntax to specify the function and its arguments – in
this case [`compounding.v1`](https://thinc.ai/docs/api-schedules#compounding)
defined in the [function registry](/api/top-level#registry). All other values
defined in the block are passed to the function as keyword arguments when it's
initialized. You can also use this mechanism to register
[custom implementations and architectures](#custom-functions) and reference them
from your configs.

> #### How the config is resolved
>
> The config file is parsed into a regular dictionary and is resolved and
> validated **bottom-up**. Arguments provided for registered functions are
> checked against the function's signature and type annotations. The return
> value of a registered function can also be passed into another function – for
> instance, a learning rate schedule can be provided as the an argument of an
> optimizer.

```ini
### With registered function
[training.batch_size]
@schedules = "compounding.v1"
start = 100
stop = 1000
compound = 1.001
```

### Model architectures {#model-architectures}

> #### 💡 Model type annotations
>
> In the documentation and code base, you may come across type annotations and
> descriptions of [Thinc](https://thinc.ai) model types, like ~~Model[List[Doc],
> List[Floats2d]]~~. This so-called generic type describes the layer and its
> input and output type – in this case, it takes a list of `Doc` objects as the
> input and list of 2-dimensional arrays of floats as the output. You can read
> more about defining Thinc models [here](https://thinc.ai/docs/usage-models).
> Also see the [type checking](https://thinc.ai/docs/usage-type-checking) for
> how to enable linting in your editor to see live feedback if your inputs and
> outputs don't match.

A **model architecture** is a function that wires up a Thinc
[`Model`](https://thinc.ai/docs/api-model) instance, which you can then use in a
component or as a layer of a larger network. You can use Thinc as a thin
[wrapper around frameworks](https://thinc.ai/docs/usage-frameworks) such as
PyTorch, TensorFlow or MXNet, or you can implement your logic in Thinc
[directly](https://thinc.ai/docs/usage-models). For more details and examples,
see the usage guide on [layers and architectures](/usage/layers-architectures).

spaCy's built-in components will never construct their `Model` instances
themselves, so you won't have to subclass the component to change its model
architecture. You can just **update the config** so that it refers to a
different registered function. Once the component has been created, its `Model`
instance has already been assigned, so you cannot change its model architecture.
The architecture is like a recipe for the network, and you can't change the
recipe once the dish has already been prepared. You have to make a new one.
spaCy includes a variety of built-in [architectures](/api/architectures) for
different tasks. For example:

| Architecture                                                      | Description                                                                                                                                                                                                                                               |
| ----------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| [HashEmbedCNN](/api/architectures#HashEmbedCNN)                   | Build spaCy’s "standard" embedding layer, which uses hash embedding with subword features and a CNN with layer-normalized maxout. ~~Model[List[Doc], List[Floats2d]]~~                                                                                    |
| [TransitionBasedParser](/api/architectures#TransitionBasedParser) | Build a [transition-based parser](https://explosion.ai/blog/parsing-english-in-python) model used in the default [`EntityRecognizer`](/api/entityrecognizer) and [`DependencyParser`](/api/dependencyparser). ~~Model[List[Docs], List[List[Floats2d]]]~~ |
| [TextCatEnsemble](/api/architectures#TextCatEnsemble)             | Stacked ensemble of a bag-of-words model and a neural network model with an internal CNN embedding layer. Used in the default [`TextCategorizer`](/api/textcategorizer). ~~Model[List[Doc], Floats2d]~~                                                   |

### Metrics, training output and weighted scores {#metrics}

When you train a pipeline using the [`spacy train`](/api/cli#train) command,
you'll see a table showing the metrics after each pass over the data. The
available metrics **depend on the pipeline components**. Pipeline components
also define which scores are shown and how they should be **weighted in the
final score** that decides about the best model.

The `training.score_weights` setting in your `config.cfg` lets you customize the
scores shown in the table and how they should be weighted. In this example, the
labeled dependency accuracy and NER F-score count towards the final score with
40% each and the tagging accuracy makes up the remaining 20%. The tokenization
accuracy and speed are both shown in the table, but not counted towards the
score.

> #### Why do I need score weights?
>
> At the end of your training process, you typically want to select the **best
> model** – but what "best" means depends on the available components and your
> specific use case. For instance, you may prefer a pipeline with higher NER and
> lower POS tagging accuracy over a pipeline with lower NER and higher POS
> accuracy. You can express this preference in the score weights, e.g. by
> assigning `ents_f` (NER F-score) a higher weight.

```ini
[training.score_weights]
dep_las = 0.4
dep_uas = null
ents_f = 0.4
tag_acc = 0.2
token_acc = 0.0
speed = 0.0
```

The `score_weights` don't _have to_ sum to `1.0` – but it's recommended. When
you generate a config for a given pipeline, the score weights are generated by
combining and normalizing the default score weights of the pipeline components.
The default score weights are defined by each pipeline component via the
`default_score_weights` setting on the
[`@Language.factory`](/api/language#factory) decorator. By default, all pipeline
components are weighted equally. If a score weight is set to `null`, it will be
excluded from the logs and the score won't be weighted.

<Accordion title="Understanding the training output and score types" spaced>

| Name                       | Description                                                                                                             |
| -------------------------- | ----------------------------------------------------------------------------------------------------------------------- |
| **Loss**                   | The training loss representing the amount of work left for the optimizer. Should decrease, but usually not to `0`.      |
| **Precision** (P)          | Percentage of predicted annotations that were correct. Should increase.                                                 |
| **Recall** (R)             | Percentage of reference annotations recovered. Should increase.                                                         |
| **F-Score** (F)            | Harmonic mean of precision and recall. Should increase.                                                                 |
| **UAS** / **LAS**          | Unlabeled and labeled attachment score for the dependency parser, i.e. the percentage of correct arcs. Should increase. |
| **Words per second** (WPS) | Prediction speed in words per second. Should stay stable.                                                               |

Note that if the development data has raw text, some of the gold-standard
entities might not align to the predicted tokenization. These tokenization
errors are **excluded from the NER evaluation**. If your tokenization makes it
impossible for the model to predict 50% of your entities, your NER F-score might
still look good.

</Accordion>

## Custom functions {#custom-functions}

Registered functions in the training config files can refer to built-in
implementations, but you can also plug in fully **custom implementations**. All
you need to do is register your function using the `@spacy.registry` decorator
with the name of the respective [registry](/api/top-level#registry), e.g.
`@spacy.registry.architectures`, and a string name to assign to your function.
Registering custom functions allows you to **plug in models** defined in PyTorch
or TensorFlow, make **custom modifications** to the `nlp` object, create custom
optimizers or schedules, or **stream in data** and preprocesses it on the fly
while training.

Each custom function can have any number of arguments that are passed in via the
[config](#config), just the built-in functions. If your function defines
**default argument values**, spaCy is able to auto-fill your config when you run
[`init fill-config`](/api/cli#init-fill-config). If you want to make sure that a
given parameter is always explicitly set in the config, avoid setting a default
value for it.

### Training with custom code {#custom-code}

> ```cli
> ### Training
> $ python -m spacy train config.cfg --code functions.py
> ```
>
> ```cli
> ### Packaging
> $ python -m spacy package ./model-best ./packages --code functions.py
> ```

The [`spacy train`](/api/cli#train) recipe lets you specify an optional argument
`--code` that points to a Python file. The file is imported before training and
allows you to add custom functions and architectures to the function registry
that can then be referenced from your `config.cfg`. This lets you train spaCy
pipelines with custom components, without having to re-implement the whole
training workflow. When you package your trained pipeline later using
[`spacy package`](/api/cli#package), you can provide one or more Python files to
be included in the package and imported in its `__init__.py`. This means that
any custom architectures, functions or
[components](/usage/processing-pipelines#custom-components) will be shipped with
your pipeline and registered when it's loaded. See the documentation on
[saving and loading pipelines](/usage/saving-loading#models-custom) for details.

#### Example: Modifying the nlp object {#custom-code-nlp-callbacks}

For many use cases, you don't necessarily want to implement the whole `Language`
subclass and language data from scratch – it's often enough to make a few small
modifications, like adjusting the
[tokenization rules](/usage/linguistic-features#native-tokenizer-additions) or
[language defaults](/api/language#defaults) like stop words. The config lets you
provide five optional **callback functions** that give you access to the
language class and `nlp` object at different points of the lifecycle:

| Callback                      | Description                                                                                                                                                                                                                |
| ----------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `nlp.before_creation`         | Called before the `nlp` object is created and receives the language subclass like `English` (not the instance). Useful for writing to the [`Language.Defaults`](/api/language#defaults) aside from the tokenizer settings. |
| `nlp.after_creation`          | Called right after the `nlp` object is created, but before the pipeline components are added to the pipeline and receives the `nlp` object.                                                                                |
| `nlp.after_pipeline_creation` | Called right after the pipeline components are created and added and receives the `nlp` object. Useful for modifying pipeline components.                                                                                  |
| `initialize.before_init`      | Called before the pipeline components are initialized and receives the `nlp` object for in-place modification. Useful for modifying the tokenizer settings, similar to the v2 base model option.                           |
| `initialize.after_init`       | Called after the pipeline components are initialized and receives the `nlp` object for in-place modification.                                                                                                              |

The `@spacy.registry.callbacks` decorator lets you register your custom function
in the `callbacks` [registry](/api/top-level#registry) under a given name. You
can then reference the function in a config block using the `@callbacks` key. If
a block contains a key starting with an `@`, it's interpreted as a reference to
a function. Because you've registered the function, spaCy knows how to create it
when you reference `"customize_language_data"` in your config. Here's an example
of a callback that runs before the `nlp` object is created and adds a custom
stop word to the defaults:

> #### config.cfg
>
> ```ini
> [nlp.before_creation]
> @callbacks = "customize_language_data"
> ```

```python
### functions.py {highlight="3,6"}
import spacy

@spacy.registry.callbacks("customize_language_data")
def create_callback():
    def customize_language_data(lang_cls):
        lang_cls.Defaults.stop_words.add("good")
        return lang_cls

    return customize_language_data
```

<Infobox variant="warning">

Remember that a registered function should always be a function that spaCy
**calls to create something**. In this case, it **creates a callback** – it's
not the callback itself.

</Infobox>

Any registered function – in this case `create_callback` – can also take
**arguments** that can be **set by the config**. This lets you implement and
keep track of different configurations, without having to hack at your code. You
can choose any arguments that make sense for your use case. In this example,
we're adding the arguments `extra_stop_words` (a list of strings) and `debug`
(boolean) for printing additional info when the function runs.

> #### config.cfg
>
> ```ini
> [nlp.before_creation]
> @callbacks = "customize_language_data"
> extra_stop_words = ["ooh", "aah"]
> debug = true
> ```

```python
### functions.py {highlight="5,7-9"}
from typing import List
import spacy

@spacy.registry.callbacks("customize_language_data")
def create_callback(extra_stop_words: List[str] = [], debug: bool = False):
    def customize_language_data(lang_cls):
        lang_cls.Defaults.stop_words.update(extra_stop_words)
        if debug:
            print("Updated stop words")
        return lang_cls

    return customize_language_data
```

<Infobox title="Tip: Use Python type hints" emoji="💡">

spaCy's configs are powered by our machine learning library Thinc's
[configuration system](https://thinc.ai/docs/usage-config), which supports
[type hints](https://docs.python.org/3/library/typing.html) and even
[advanced type annotations](https://thinc.ai/docs/usage-config#advanced-types)
using [`pydantic`](https://github.com/samuelcolvin/pydantic). If your registered
function provides type hints, the values that are passed in will be checked
against the expected types. For example, `debug: bool` in the example above will
ensure that the value received as the argument `debug` is a boolean. If the
value can't be coerced into a boolean, spaCy will raise an error.
`debug: pydantic.StrictBool` will force the value to be a boolean and raise an
error if it's not – for instance, if your config defines `1` instead of `true`.

</Infobox>

With your `functions.py` defining additional code and the updated `config.cfg`,
you can now run [`spacy train`](/api/cli#train) and point the argument `--code`
to your Python file. Before loading the config, spaCy will import the
`functions.py` module and your custom functions will be registered.

```cli
$ python -m spacy train config.cfg --output ./output --code ./functions.py
```

#### Example: Modifying tokenizer settings {#custom-tokenizer}

Use the `initialize.before_init` callback to modify the tokenizer settings when
training a new pipeline. Write a registered callback that modifies the tokenizer
settings and specify this callback in your config:

> #### config.cfg
>
> ```ini
> [initialize]
>
> [initialize.before_init]
> @callbacks = "customize_tokenizer"
> ```

```python
### functions.py
from spacy.util import registry, compile_suffix_regex

@registry.callbacks("customize_tokenizer")
def make_customize_tokenizer():
    def customize_tokenizer(nlp):
        # remove a suffix
        suffixes = list(nlp.Defaults.suffixes)
        suffixes.remove("\\[")
        suffix_regex = compile_suffix_regex(suffixes)
        nlp.tokenizer.suffix_search = suffix_regex.search

        # add a special case
        nlp.tokenizer.add_special_case("_SPECIAL_", [{"ORTH": "_SPECIAL_"}])
    return customize_tokenizer
```

When training, provide the function above with the `--code` option:

```cli
$ python -m spacy train config.cfg --code ./functions.py
```

Because this callback is only called in the one-time initialization step before
training, the callback code does not need to be packaged with the final pipeline
package. However, to make it easier for others to replicate your training setup,
you can choose to package the initialization callbacks with the pipeline package
or to publish them separately.

<Infobox variant="warning" title="nlp.before_creation vs. initialize.before_init">

- `nlp.before_creation` is the best place to modify language defaults other than
  the tokenizer settings.
- `initialize.before_init` is the best place to modify tokenizer settings when
  training a new pipeline.

Unlike the other language defaults, the tokenizer settings are saved with the
pipeline with `nlp.to_disk()`, so modifications made in `nlp.before_creation`
will be clobbered by the saved settings when the trained pipeline is loaded from
disk.

</Infobox>

#### Example: Custom logging function {#custom-logging}

During training, the results of each step are passed to a logger function. By
default, these results are written to the console with the
[`ConsoleLogger`](/api/top-level#ConsoleLogger). There is also built-in support
for writing the log files to [Weights & Biases](https://www.wandb.com/) with the
[`WandbLogger`](/api/top-level#WandbLogger). On each step, the logger function
receives a **dictionary** with the following keys:

| Key            | Value                                                                                                 |
| -------------- | ----------------------------------------------------------------------------------------------------- |
| `epoch`        | How many passes over the data have been completed. ~~int~~                                            |
| `step`         | How many steps have been completed. ~~int~~                                                           |
| `score`        | The main score from the last evaluation, measured on the dev set. ~~float~~                           |
| `other_scores` | The other scores from the last evaluation, measured on the dev set. ~~Dict[str, Any]~~                |
| `losses`       | The accumulated training losses, keyed by component name. ~~Dict[str, float]~~                        |
| `checkpoints`  | A list of previous results, where each result is a `(score, step)` tuple. ~~List[Tuple[float, int]]~~ |

You can easily implement and plug in your own logger that records the training
results in a custom way, or sends them to an experiment management tracker of
your choice. In this example, the function `my_custom_logger.v1` writes the
tabular results to a file:

> ```ini
> ### config.cfg (excerpt)
> [training.logger]
> @loggers = "my_custom_logger.v1"
> log_path = "my_file.tab"
> ```

```python
### functions.py
import sys
from typing import IO, Tuple, Callable, Dict, Any, Optional
import spacy
from spacy import Language
from pathlib import Path

@spacy.registry.loggers("my_custom_logger.v1")
def custom_logger(log_path):
    def setup_logger(
        nlp: Language,
        stdout: IO=sys.stdout,
        stderr: IO=sys.stderr
    ) -> Tuple[Callable, Callable]:
        stdout.write(f"Logging to {log_path}\\n")
        log_file = Path(log_path).open("w", encoding="utf8")
        log_file.write("step\\t")
        log_file.write("score\\t")
        for pipe in nlp.pipe_names:
            log_file.write(f"loss_{pipe}\\t")
        log_file.write("\\n")

        def log_step(info: Optional[Dict[str, Any]]):
            if info:
                log_file.write(f"{info['step']}\\t")
                log_file.write(f"{info['score']}\\t")
                for pipe in nlp.pipe_names:
                    log_file.write(f"{info['losses'][pipe]}\\t")
                log_file.write("\\n")

        def finalize():
            log_file.close()

        return log_step, finalize

    return setup_logger
```

#### Example: Custom batch size schedule {#custom-code-schedule}

You can also implement your own batch size schedule to use during training. The
`@spacy.registry.schedules` decorator lets you register that function in the
`schedules` [registry](/api/top-level#registry) and assign it a string name:

> #### Why the version in the name?
>
> A big benefit of the config system is that it makes your experiments
> reproducible. We recommend versioning the functions you register, especially
> if you expect them to change (like a new model architecture). This way, you
> know that a config referencing `v1` means a different function than a config
> referencing `v2`.

```python
### functions.py
import spacy

@spacy.registry.schedules("my_custom_schedule.v1")
def my_custom_schedule(start: int = 1, factor: float = 1.001):
   while True:
      yield start
      start = start * factor
```

In your config, you can now reference the schedule in the
`[training.batch_size]` block via `@schedules`. If a block contains a key
starting with an `@`, it's interpreted as a reference to a function. All other
settings in the block will be passed to the function as keyword arguments. Keep
in mind that the config shouldn't have any hidden defaults and all arguments on
the functions need to be represented in the config.

```ini
### config.cfg (excerpt)
[training.batch_size]
@schedules = "my_custom_schedule.v1"
start = 2
factor = 1.005
```

### Defining custom architectures {#custom-architectures}

Built-in pipeline components such as the tagger or named entity recognizer are
constructed with default neural network [models](/api/architectures). You can
change the model architecture entirely by implementing your own custom models
and providing those in the config when creating the pipeline component. See the
documentation on [layers and model architectures](/usage/layers-architectures)
for more details.

> ```ini
> ### config.cfg
> [components.tagger]
> factory = "tagger"
>
> [components.tagger.model]
> @architectures = "custom_neural_network.v1"
> output_width = 512
> ```

```python
### functions.py
from typing import List
from thinc.types import Floats2d
from thinc.api import Model
import spacy
from spacy.tokens import Doc

@spacy.registry.architectures("custom_neural_network.v1")
def custom_neural_network(output_width: int) -> Model[List[Doc], List[Floats2d]]:
    return create_model(output_width)
```

## Customizing the initialization {#initialization}

When you start training a new model from scratch,
[`spacy train`](/api/cli#train) will call
[`nlp.initialize`](/api/language#initialize) to initialize the pipeline and load
the required data. All settings for this are defined in the
[`[initialize]`](/api/data-formats#config-initialize) block of the config, so
you can keep track of how the initial `nlp` object was created. The
initialization process typically includes the following:

> #### config.cfg (excerpt)
>
> ```ini
> [initialize]
> vectors = ${paths.vectors}
> init_tok2vec = ${paths.init_tok2vec}
>
> [initialize.components]
> # Settings for components
> ```

1. Load in **data resources** defined in the `[initialize]` config, including
   **word vectors** and
   [pretrained](/usage/embeddings-transformers/#pretraining) **tok2vec
   weights**.
2. Call the `initialize` methods of the tokenizer (if implemented, e.g. for
   [Chinese](/usage/models#chinese)) and pipeline components with a callback to
   access the training data, the current `nlp` object and any **custom
   arguments** defined in the `[initialize]` config.
3. In **pipeline components**: if needed, use the data to
   [infer missing shapes](/usage/layers-architectures#thinc-shape-inference) and
   set up the label scheme if no labels are provided. Components may also load
   other data like lookup tables or dictionaries.

The initialization step allows the config to define **all settings** required
for the pipeline, while keeping a separation between settings and functions that
should only be used **before training** to set up the initial pipeline, and
logic and configuration that needs to be available **at runtime**. Without that
separation, it would be very difficult to use the same, reproducible config file
because the component settings required for training (load data from an external
file) wouldn't match the component settings required at runtime (load what's
included with the saved `nlp` object and don't depend on external file).

![Illustration of pipeline lifecycle](../images/lifecycle.svg)

<Infobox title="How components save and load data" emoji="📖">

For details and examples of how pipeline components can **save and load data
assets** like model weights or lookup tables, and how the component
initialization is implemented under the hood, see the usage guide on
[serializing and initializing component data](/usage/processing-pipelines#component-data-initialization).

</Infobox>

#### Initializing labels {#initialization-labels}

Built-in pipeline components like the
[`EntityRecognizer`](/api/entityrecognizer) or
[`DependencyParser`](/api/dependencyparser) need to know their available labels
and associated internal meta information to initialize their model weights.
Using the `get_examples` callback provided on initialization, they're able to
**read the labels off the training data** automatically, which is very
convenient – but it can also slow down the training process to compute this
information on every run.

The [`init labels`](/api/cli#init-labels) command lets you auto-generate JSON
files containing the label data for all supported components. You can then pass
in the labels in the `[initialize]` settings for the respective components to
allow them to initialize faster.

> #### config.cfg
>
> ```ini
> [initialize.components.ner]
>
> [initialize.components.ner.labels]
> @readers = "spacy.read_labels.v1"
> path = "corpus/labels/ner.json
> ```

```cli
$ python -m spacy init labels config.cfg ./corpus --paths.train ./corpus/train.spacy
```

Under the hood, the command delegates to the `label_data` property of the
pipeline components, for instance
[`EntityRecognizer.label_data`](/api/entityrecognizer#label_data).

<Infobox variant="warning" title="Important note">

The JSON format differs for each component and some components need additional
meta information about their labels. The format exported by
[`init labels`](/api/cli#init-labels) matches what the components need, so you
should always let spaCy **auto-generate the labels** for you.

</Infobox>

## Data utilities {#data}

spaCy includes various features and utilities to make it easy to train models
using your own data, manage training and evaluation corpora, convert existing
annotations and configure data augmentation strategies for more robust models.

### Converting existing corpora and annotations {#data-convert}

If you have training data in a standard format like `.conll` or `.conllu`, the
easiest way to convert it for use with spaCy is to run
[`spacy convert`](/api/cli#convert) and pass it a file and an output directory.
By default, the command will pick the converter based on the file extension.

```cli
$ python -m spacy convert ./train.gold.conll ./corpus
```

> #### 💡 Tip: Converting from Prodigy
>
> If you're using the [Prodigy](https://prodi.gy) annotation tool to create
> training data, you can run the
> [`data-to-spacy` command](https://prodi.gy/docs/recipes#data-to-spacy) to
> merge and export multiple datasets for use with
> [`spacy train`](/api/cli#train). Different types of annotations on the same
> text will be combined, giving you one corpus to train multiple components.

<Infobox title="Tip: Manage multi-step workflows with projects" emoji="💡">

Training workflows often consist of multiple steps, from preprocessing the data
all the way to packaging and deploying the trained model.
[spaCy projects](/usage/projects) let you define all steps in one file, manage
data assets, track changes and share your end-to-end processes with your team.

</Infobox>

The binary `.spacy` format is a serialized [`DocBin`](/api/docbin) containing
one or more [`Doc`](/api/doc) objects. It's extremely **efficient in storage**,
especially when packing multiple documents together. You can also create `Doc`
objects manually, so you can write your own custom logic to convert and store
existing annotations for use in spaCy.

```python
### Training data from Doc objects {highlight="6-9"}
import spacy
from spacy.tokens import Doc, DocBin

nlp = spacy.blank("en")
docbin = DocBin()
words = ["Apple", "is", "looking", "at", "buying", "U.K.", "startup", "."]
spaces = [True, True, True, True, True, True, True, False]
ents = ["B-ORG", "O", "O", "O", "O", "B-GPE", "O", "O"]
doc = Doc(nlp.vocab, words=words, spaces=spaces, ents=ents)
docbin.add(doc)
docbin.to_disk("./train.spacy")
```

### Working with corpora {#data-corpora}

> #### Example
>
> ```ini
> [corpora]
>
> [corpora.train]
> @readers = "spacy.Corpus.v1"
> path = ${paths.train}
> gold_preproc = false
> max_length = 0
> limit = 0
> augmenter = null
>
> [training]
> train_corpus = "corpora.train"
> ```

The [`[corpora]`](/api/data-formats#config-corpora) block in your config lets
you define **data resources** to use for training, evaluation, pretraining or
any other custom workflows. `corpora.train` and `corpora.dev` are used as
conventions within spaCy's default configs, but you can also define any other
custom blocks. Each section in the corpora config should resolve to a
[`Corpus`](/api/corpus) – for example, using spaCy's built-in
[corpus reader](/api/top-level#readers) that takes a path to a binary `.spacy`
file. The `train_corpus` and `dev_corpus` fields in the
[`[training]`](/api/data-formats#config-training) block specify where to find
the corpus in your config. This makes it easy to **swap out** different corpora
by only changing a single config setting.

Instead of making `[corpora]` a block with multiple subsections for each portion
of the data, you can also use a single function that returns a dictionary of
corpora, keyed by corpus name, e.g. `"train"` and `"dev"`. This can be
especially useful if you need to split a single file into corpora for training
and evaluation, without loading the same file twice.

### Custom data reading and batching {#custom-code-readers-batchers}

Some use-cases require **streaming in data** or manipulating datasets on the
fly, rather than generating all data beforehand and storing it to file. Instead
of using the built-in [`Corpus`](/api/corpus) reader, which uses static file
paths, you can create and register a custom function that generates
[`Example`](/api/example) objects. The resulting generator can be infinite. When
using this dataset for training, stopping criteria such as maximum number of
steps, or stopping when the loss does not decrease further, can be used.

In this example we assume a custom function `read_custom_data` which loads or
generates texts with relevant text classification annotations. Then, small
lexical variations of the input text are created before generating the final
[`Example`](/api/example) objects. The `@spacy.registry.readers` decorator lets
you register the function creating the custom reader in the `readers`
[registry](/api/top-level#registry) and assign it a string name, so it can be
used in your config. All arguments on the registered function become available
as **config settings** – in this case, `source`.

> #### config.cfg
>
> ```ini
> [corpora.train]
> @readers = "corpus_variants.v1"
> source = "s3://your_bucket/path/data.csv"
> ```

```python
### functions.py {highlight="7-8"}
from typing import Callable, Iterator, List
import spacy
from spacy.training import Example
from spacy.language import Language
import random

@spacy.registry.readers("corpus_variants.v1")
def stream_data(source: str) -> Callable[[Language], Iterator[Example]]:
    def generate_stream(nlp):
        for text, cats in read_custom_data(source):
            # Create a random variant of the example text
            i = random.randint(0, len(text) - 1)
            variant = text[:i] + text[i].upper() + text[i + 1:]
            doc = nlp.make_doc(variant)
            example = Example.from_dict(doc, {"cats": cats})
            yield example

    return generate_stream
```

<Infobox variant="warning">

Remember that a registered function should always be a function that spaCy
**calls to create something**. In this case, it **creates the reader function**
– it's not the reader itself.

</Infobox>

We can also customize the **batching strategy** by registering a new batcher
function in the `batchers` [registry](/api/top-level#registry). A batcher turns
a stream of items into a stream of batches. spaCy has several useful built-in
[batching strategies](/api/top-level#batchers) with customizable sizes, but it's
also easy to implement your own. For instance, the following function takes the
stream of generated [`Example`](/api/example) objects, and removes those which
have the same underlying raw text, to avoid duplicates within each batch. Note
that in a more realistic implementation, you'd also want to check whether the
annotations are the same.

> #### config.cfg
>
> ```ini
> [training.batcher]
> @batchers = "filtering_batch.v1"
> size = 150
> ```

```python
### functions.py
from typing import Callable, Iterable, Iterator, List
import spacy
from spacy.training import Example

@spacy.registry.batchers("filtering_batch.v1")
def filter_batch(size: int) -> Callable[[Iterable[Example]], Iterator[List[Example]]]:
    def create_filtered_batches(examples):
        batch = []
        for eg in examples:
            # Remove duplicate examples with the same text from batch
            if eg.text not in [x.text for x in batch]:
                batch.append(eg)
            if len(batch) == size:
                yield batch
                batch = []

    return create_filtered_batches
```

<!-- TODO:
* Custom corpus class
* Minibatching
-->

### Data augmentation {#data-augmentation}

Data augmentation is the process of applying small **modifications** to the
training data. It can be especially useful for punctuation and case replacement
– for example, if your corpus only uses smart quotes and you want to include
variations using regular quotes, or to make the model less sensitive to
capitalization by including a mix of capitalized and lowercase examples.

The easiest way to use data augmentation during training is to provide an
`augmenter` to the training corpus, e.g. in the `[corpora.train]` section of
your config. The built-in [`orth_variants`](/api/top-level#orth_variants)
augmenter creates a data augmentation callback that uses orth-variant
replacement.

```ini
### config.cfg (excerpt) {highlight="8,14"}
[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
gold_preproc = false
max_length = 0
limit = 0

[corpora.train.augmenter]
@augmenters = "spacy.orth_variants.v1"
# Percentage of texts that will be augmented / lowercased
level = 0.1
lower = 0.5

[corpora.train.augmenter.orth_variants]
@readers = "srsly.read_json.v1"
path = "corpus/orth_variants.json"
```

The `orth_variants` argument lets you pass in a dictionary of replacement rules,
typically loaded from a JSON file. There are two types of orth variant rules:
`"single"` for single tokens that should be replaced (e.g. hyphens) and
`"paired"` for pairs of tokens (e.g. quotes).

<!-- prettier-ignore -->
```json
### orth_variants.json
{
  "single": [{ "tags": ["NFP"], "variants": ["…", "..."] }],
  "paired": [{ "tags": ["``", "''"], "variants": [["'", "'"], ["‘", "’"]] }]
}
```

<Accordion title="Full examples for English and German" spaced>

```json
https://github.com/explosion/spacy-lookups-data/blob/master/spacy_lookups_data/data/en_orth_variants.json
```

```json
https://github.com/explosion/spacy-lookups-data/blob/master/spacy_lookups_data/data/de_orth_variants.json
```

</Accordion>

<Infobox title="Important note" variant="warning">

When adding data augmentation, keep in mind that it typically only makes sense
to apply it to the **training corpus**, not the development data.

</Infobox>

#### Writing custom data augmenters {#data-augmentation-custom}

Using the [`@spacy.augmenters`](/api/top-level#registry) registry, you can also
register your own data augmentation callbacks. The callback should be a function
that takes the current `nlp` object and a training [`Example`](/api/example) and
yields `Example` objects. Keep in mind that the augmenter should yield **all
examples** you want to use in your corpus, not only the augmented examples
(unless you want to augment all examples).

Here'a an example of a custom augmentation callback that produces text variants
in ["SpOnGeBoB cAsE"](https://knowyourmeme.com/memes/mocking-spongebob). The
registered function takes one argument `randomize` that can be set via the
config and decides whether the uppercase/lowercase transformation is applied
randomly or not. The augmenter yields two `Example` objects: the original
example and the augmented example.

> #### config.cfg
>
> ```ini
> [corpora.train.augmenter]
> @augmenters = "spongebob_augmenter.v1"
> randomize = false
> ```

```python
import spacy
import random

@spacy.registry.augmenters("spongebob_augmenter.v1")
def create_augmenter(randomize: bool = False):
    def augment(nlp, example):
        text = example.text
        if randomize:
            # Randomly uppercase/lowercase characters
            chars = [c.lower() if random.random() < 0.5 else c.upper() for c in text]
        else:
            # Uppercase followed by lowercase
            chars = [c.lower() if i % 2 else c.upper() for i, c in enumerate(text)]
        # Create augmented training example
        example_dict = example.to_dict()
        doc = nlp.make_doc("".join(chars))
        example_dict["token_annotation"]["ORTH"] = [t.text for t in doc]
        # Original example followed by augmented example
        yield example
        yield example.from_dict(doc, example_dict)

    return augment
```

An easy way to create modified `Example` objects is to use the
[`Example.from_dict`](/api/example#from_dict) method with a new reference
[`Doc`](/api/doc) created from the modified text. In this case, only the
capitalization changes, so only the `ORTH` values of the tokens will be
different between the original and augmented examples.

Note that if your data augmentation strategy involves changing the tokenization
(for instance, removing or adding tokens) and your training examples include
token-based annotations like the dependency parse or entity labels, you'll need
to take care to adjust the `Example` object so its annotations match and remain
valid.

## Parallel & distributed training with Ray {#parallel-training}

> #### Installation
>
> ```cli
> $ pip install -U %%SPACY_PKG_NAME[ray]%%SPACY_PKG_FLAGS
> # Check that the CLI is registered
> $ python -m spacy ray --help
> ```

[Ray](https://ray.io/) is a fast and simple framework for building and running
**distributed applications**. You can use Ray to train spaCy on one or more
remote machines, potentially speeding up your training process. Parallel
training won't always be faster though – it depends on your batch size, models,
and hardware.

<Infobox variant="warning">

To use Ray with spaCy, you need the
[`spacy-ray`](https://github.com/explosion/spacy-ray) package installed.
Installing the package will automatically add the `ray` command to the spaCy
CLI.

</Infobox>

The [`spacy ray train`](/api/cli#ray-train) command follows the same API as
[`spacy train`](/api/cli#train), with a few extra options to configure the Ray
setup. You can optionally set the `--address` option to point to your Ray
cluster. If it's not set, Ray will run locally.

```cli
python -m spacy ray train config.cfg --n-workers 2
```

<Project id="integrations/ray">

Get started with parallel training using our project template. It trains a
simple model on a Universal Dependencies Treebank and lets you parallelize the
training with Ray.

</Project>

### How parallel training works {#parallel-training-details}

Each worker receives a shard of the **data** and builds a copy of the **model
and optimizer** from the [`config.cfg`](#config). It also has a communication
channel to **pass gradients and parameters** to the other workers. Additionally,
each worker is given ownership of a subset of the parameter arrays. Every
parameter array is owned by exactly one worker, and the workers are given a
mapping so they know which worker owns which parameter.

![Illustration of setup](../images/spacy-ray.svg)

As training proceeds, every worker will be computing gradients for **all** of
the model parameters. When they compute gradients for parameters they don't own,
they'll **send them to the worker** that does own that parameter, along with a
version identifier so that the owner can decide whether to discard the gradient.
Workers use the gradients they receive and the ones they compute locally to
update the parameters they own, and then broadcast the updated array and a new
version ID to the other workers.

This training procedure is **asynchronous** and **non-blocking**. Workers always
push their gradient increments and parameter updates, they do not have to pull
them and block on the result, so the transfers can happen in the background,
overlapped with the actual training work. The workers also do not have to stop
and wait for each other ("synchronize") at the start of each batch. This is very
useful for spaCy, because spaCy is often trained on long documents, which means
**batches can vary in size** significantly. Uneven workloads make synchronous
gradient descent inefficient, because if one batch is slow, all of the other
workers are stuck waiting for it to complete before they can continue.

## Internal training API {#api}

<Infobox variant="warning">

spaCy gives you full control over the training loop. However, for most use
cases, it's recommended to train your pipelines via the
[`spacy train`](/api/cli#train) command with a [`config.cfg`](#config) to keep
track of your settings and hyperparameters, instead of writing your own training
scripts from scratch. [Custom registered functions](#custom-code) should
typically give you everything you need to train fully custom pipelines with
[`spacy train`](/api/cli#train).

</Infobox>

The [`Example`](/api/example) object contains annotated training data, also
called the **gold standard**. It's initialized with a [`Doc`](/api/doc) object
that will hold the predictions, and another `Doc` object that holds the
gold-standard annotations. It also includes the **alignment** between those two
documents if they differ in tokenization. The `Example` class ensures that spaCy
can rely on one **standardized format** that's passed through the pipeline. For
instance, let's say we want to define gold-standard part-of-speech tags:

```python
words = ["I", "like", "stuff"]
predicted = Doc(vocab, words=words)
# create the reference Doc with gold-standard TAG annotations
tags = ["NOUN", "VERB", "NOUN"]
tag_ids = [vocab.strings.add(tag) for tag in tags]
reference = Doc(vocab, words=words).from_array("TAG", numpy.array(tag_ids, dtype="uint64"))
example = Example(predicted, reference)
```

As this is quite verbose, there's an alternative way to create the reference
`Doc` with the gold-standard annotations. The function `Example.from_dict` takes
a dictionary with keyword arguments specifying the annotations, like `tags` or
`entities`. Using the resulting `Example` object and its gold-standard
annotations, the model can be updated to learn a sentence of three words with
their assigned part-of-speech tags.

```python
words = ["I", "like", "stuff"]
tags = ["NOUN", "VERB", "NOUN"]
predicted = Doc(nlp.vocab, words=words)
example = Example.from_dict(predicted, {"tags": tags})
```

Here's another example that shows how to define gold-standard named entities.
The letters added before the labels refer to the tags of the
[BILUO scheme](/usage/linguistic-features#updating-biluo) – `O` is a token
outside an entity, `U` a single entity unit, `B` the beginning of an entity, `I`
a token inside an entity and `L` the last token of an entity.

```python
doc = Doc(nlp.vocab, words=["Facebook", "released", "React", "in", "2014"])
example = Example.from_dict(doc, {"entities": ["U-ORG", "O", "U-TECHNOLOGY", "O", "U-DATE"]})
```

<Infobox title="Migrating from v2.x" variant="warning">

As of v3.0, the [`Example`](/api/example) object replaces the `GoldParse` class.
It can be constructed in a very similar way – from a `Doc` and a dictionary of
annotations. For more details, see the
[migration guide](/usage/v3#migrating-training).

```diff
- gold = GoldParse(doc, entities=entities)
+ example = Example.from_dict(doc, {"entities": entities})
```

</Infobox>

Of course, it's not enough to only show a model a single example once.
Especially if you only have few examples, you'll want to train for a **number of
iterations**. At each iteration, the training data is **shuffled** to ensure the
model doesn't make any generalizations based on the order of examples. Another
technique to improve the learning results is to set a **dropout rate**, a rate
at which to randomly "drop" individual features and representations. This makes
it harder for the model to memorize the training data. For example, a `0.25`
dropout means that each feature or internal representation has a 1/4 likelihood
of being dropped.

> - [`nlp`](/api/language): The `nlp` object with the pipeline components and
>   their models.
> - [`nlp.initialize`](/api/language#initialize): Initialize the pipeline and
>   return an optimizer to update the component model weights.
> - [`Optimizer`](https://thinc.ai/docs/api-optimizers): Function that holds
>   state between updates.
> - [`nlp.update`](/api/language#update): Update component models with examples.
> - [`Example`](/api/example): object holding predictions and gold-standard
>   annotations.
> - [`nlp.to_disk`](/api/language#to_disk): Save the updated pipeline to a
>   directory.

```python
### Example training loop
optimizer = nlp.initialize()
for itn in range(100):
    random.shuffle(train_data)
    for raw_text, entity_offsets in train_data:
        doc = nlp.make_doc(raw_text)
        example = Example.from_dict(doc, {"entities": entity_offsets})
        nlp.update([example], sgd=optimizer)
nlp.to_disk("/output")
```

The [`nlp.update`](/api/language#update) method takes the following arguments:

| Name       | Description                                                                                                                                                            |
| ---------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `examples` | [`Example`](/api/example) objects. The `update` method takes a sequence of them, so you can batch up your training examples.                                           |
| `drop`     | Dropout rate. Makes it harder for the model to just memorize the data.                                                                                                 |
| `sgd`      | An [`Optimizer`](https://thinc.ai/docs/api-optimizers) object, which updates the model's weights. If not set, spaCy will create a new one and save it for further use. |

<Infobox title="Migrating from v2.x" variant="warning">

As of v3.0, the [`Example`](/api/example) object replaces the `GoldParse` class
and the "simple training style" of calling `nlp.update` with a text and a
dictionary of annotations. Updating your code to use the `Example` object should
be very straightforward: you can call
[`Example.from_dict`](/api/example#from_dict) with a [`Doc`](/api/doc) and the
dictionary of annotations:

```diff
text = "Facebook released React in 2014"
annotations = {"entities": ["U-ORG", "O", "U-TECHNOLOGY", "O", "U-DATE"]}
+ example = Example.from_dict(nlp.make_doc(text), annotations)
- nlp.update([text], [annotations])
+ nlp.update([example])
```

</Infobox>
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								---
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								title: Training Pipelines & Models
 								teaser: Train and update components on your own data and integrate custom models
-												Update docs [ci skip]

											
										
										
											2020-08-21 17:21:55 +03:00
+								next: /usage/layers-architectures
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								menu:
-												Update v3 docs

											
										
										
											2020-07-03 17:48:21 +03:00
+								  - ['Introduction', 'basics']
-												Update docs [ci skip]

											
										
										
											2020-07-31 14:26:39 +03:00
+								  - ['Quickstart', 'quickstart']
 								  - ['Config System', 'config']
-												Update docs [ci skip]

											
										
										
											2020-09-30 16:16:00 +03:00
+								  - ['Custom Training', 'config-custom']
-												rename "custom models" to "custom functions"

											
										
										
											2020-08-19 17:53:51 +03:00
+								  - ['Custom Functions', 'custom-functions']
-												Update docs [ci skip]

											
										
										
											2020-10-04 15:14:55 +03:00
+								  - ['Initialization', 'initialization']
-												Update docs [ci skip]

											
										
										
											2020-09-30 16:16:00 +03:00
+								  - ['Data Utilities', 'data']
-												Update docs [ci skip]

											
										
										
											2020-09-13 23:30:33 +03:00
+								  - ['Parallel Training', 'parallel-training']
-												Update v3 docs

											
										
										
											2020-07-03 17:48:21 +03:00
+								  - ['Internal API', 'api']
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								---
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								## Introduction to training {#basics hidden="true"}
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
 								import Training101 from 'usage/101/\_training.md'
 								<Training101 />
-												Update v3 docs

											
										
										
											2020-07-03 17:48:21 +03:00
+								<Infobox title="Tip: Try the Prodigy annotation tool">
-												Document debug-data [ci skip]

											
										
										
											2019-09-12 16:26:20 +03:00
-												Update v3 docs

											
										
										
											2020-07-03 17:48:21 +03:00
+								[![Prodigy: Radically efficient machine teaching](../images/prodigy.jpg)](https://prodi.gy)
-												Document debug-data [ci skip]

											
										
										
											2019-09-12 16:26:20 +03:00
-												Update v3 docs

											
										
										
											2020-07-03 17:48:21 +03:00
+								If you need to label a lot of data, check out [Prodigy](https://prodi.gy), a
 								new, active learning-powered annotation tool we've developed. Prodigy is fast
 								and extensible, and comes with a modern **web application** that helps you
 								collect training data faster. It integrates seamlessly with spaCy, pre-selects
 								the **most relevant examples** for annotation, and lets you train and evaluate
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								ready-to-use spaCy pipelines.
-												Document debug-data [ci skip]

											
										
										
											2019-09-12 16:26:20 +03:00
 								</Infobox>
-												Update docs [ci skip]

											
										
										
											2020-08-18 01:49:19 +03:00
+								## Quickstart {#quickstart tag="new"}
-												Update v3 docs [ci skip]

											
										
										
											2020-07-05 17:11:16 +03:00
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								The recommended way to train your spaCy pipelines is via the
-												Update docs

											
										
										
											2020-08-06 20:30:43 +03:00
+								[`spacy train`](/api/cli#train) command on the command line. It only needs a
 								single [`config.cfg`](#config) **configuration file** that includes all settings
-												small fixes

											
										
										
											2020-08-21 19:02:20 +03:00
+								and hyperparameters. You can optionally [overwrite](#config-overrides) settings
 								on the command line, and load in a Python file to register
-												Update quickstart, template and docs

											
										
										
											2020-08-15 15:50:29 +03:00
+								[custom functions](#custom-code) and architectures. This quickstart widget helps
 								you generate a starter config with the **recommended settings** for your
 								specific use case. It's also available in spaCy as the
 								[`init config`](/api/cli#init-config) command.
-												Add table explaining training metrics [closes #2644]

											
										
										
											2019-02-25 12:03:43 +03:00
-												Update quickstart, template and docs

											
										
										
											2020-08-15 15:50:29 +03:00
+								> #### Instructions: widget
-												Update docs [ci skip]

											
										
										
											2020-07-31 14:26:39 +03:00
+								>
-												Add init CLI and init config (#5854)

* Add init CLI and init config draft

* Improve config validation

* Auto-format

* Don't export anything in debug config

* Update docs
											
										
										
											2020-08-02 16:18:30 +03:00
+								> 1. Select your requirements and settings.
-												Update docs [ci skip]

											
										
										
											2020-07-31 14:26:39 +03:00
+								> 2. Use the buttons at the bottom to save the result to your clipboard or a
-												Add init CLI and init config (#5854)

* Add init CLI and init config draft

* Improve config validation

* Auto-format

* Don't export anything in debug config

* Update docs
											
										
										
											2020-08-02 16:18:30 +03:00
+								>    file `base_config.cfg`.
-												Update quickstart, template and docs

											
										
										
											2020-08-15 15:50:29 +03:00
+								> 3. Run [`init fill-config`](/api/cli#init-fill-config) to create a full
 								>    config.
-												Add init CLI and init config (#5854)

* Add init CLI and init config draft

* Improve config validation

* Auto-format

* Don't export anything in debug config

* Update docs
											
										
										
											2020-08-02 16:18:30 +03:00
+								> 4. Run [`train`](/api/cli#train) with your config and data.
-												Update quickstart, template and docs

											
										
										
											2020-08-15 15:50:29 +03:00
+								>
 								> #### Instructions: CLI
 								>
 								> 1. Run the [`init config`](/api/cli#init-config) command and specify your
 								>    requirements and settings as CLI arguments.
 								> 2. Run [`train`](/api/cli#train) with the exported config and data.
-												Add table explaining training metrics [closes #2644]

											
										
										
											2019-02-25 12:03:43 +03:00
-												Update docs [ci skip]

											
										
										
											2020-07-31 14:26:39 +03:00
+								import QuickstartTraining from 'widgets/quickstart-training.js'
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
-												Update docs [ci skip]

											
										
										
											2020-09-12 18:05:10 +03:00
+								<QuickstartTraining />
-												Add init CLI and init config (#5854)

* Add init CLI and init config draft

* Improve config validation

* Auto-format

* Don't export anything in debug config

* Update docs
											
										
										
											2020-08-02 16:18:30 +03:00
 								After you've saved the starter config to a file `base_config.cfg`, you can use
-												Update quickstart, template and docs

											
										
										
											2020-08-15 15:50:29 +03:00
+								the [`init fill-config`](/api/cli#init-fill-config) command to fill in the
 								remaining defaults. Training configs should always be **complete and without
 								hidden defaults**, to keep your experiments reproducible.
-												Add init CLI and init config (#5854)

* Add init CLI and init config draft

* Improve config validation

* Auto-format

* Don't export anything in debug config

* Update docs
											
										
										
											2020-08-02 16:18:30 +03:00
-												Update docs [ci skip]

											
										
										
											2020-08-19 01:28:37 +03:00
+								```cli
-												Update quickstart, template and docs

											
										
										
											2020-08-15 15:50:29 +03:00
+								$ python -m spacy init fill-config base_config.cfg config.cfg
-												Add init CLI and init config (#5854)

* Add init CLI and init config draft

* Improve config validation

* Auto-format

* Don't export anything in debug config

* Update docs
											
										
										
											2020-08-02 16:18:30 +03:00
+								```
 								> #### Tip: Debug your data
 								>
-												Update docs

											
										
										
											2020-08-06 20:30:43 +03:00
+								> The [`debug data` command](/api/cli#debug-data) lets you analyze and validate
-												Add init CLI and init config (#5854)

* Add init CLI and init config draft

* Improve config validation

* Auto-format

* Don't export anything in debug config

* Update docs
											
										
										
											2020-08-02 16:18:30 +03:00
+								> your training and development data, get useful stats, and find problems like
 								> invalid entity annotations, cyclic dependencies, low data labels and more.
 								>
-												Update docs [ci skip]

											
										
										
											2020-08-19 01:28:37 +03:00
+								> ```cli
 								> $ python -m spacy debug data config.cfg
-												Add init CLI and init config (#5854)

* Add init CLI and init config draft

* Improve config validation

* Auto-format

* Don't export anything in debug config

* Update docs
											
										
										
											2020-08-02 16:18:30 +03:00
+								> ```
-												Update quickstart, template and docs

											
										
										
											2020-08-15 15:50:29 +03:00
+								Instead of exporting your starter config from the quickstart widget and
 								auto-filling it, you can also use the [`init config`](/api/cli#init-config)
-												small fixes

											
										
										
											2020-08-21 19:02:20 +03:00
+								command and specify your requirement and settings as CLI arguments. You can now
-												Update quickstart, template and docs

											
										
										
											2020-08-15 15:50:29 +03:00
+								add your data and run [`train`](/api/cli#train) with your config. See the
 								[`convert`](/api/cli#convert) command for details on how to convert your data to
 								spaCy's binary `.spacy` format. You can either include the data paths in the
 								`[paths]` section of your config, or pass them in via the command line.
-												Add init CLI and init config (#5854)

* Add init CLI and init config draft

* Improve config validation

* Auto-format

* Don't export anything in debug config

* Update docs
											
										
										
											2020-08-02 16:18:30 +03:00
-												Update docs [ci skip]

											
										
										
											2020-08-19 01:28:37 +03:00
+								```cli
-												Update docs

											
										
										
											2020-08-06 20:30:43 +03:00
+								$ python -m spacy train config.cfg --output ./output --paths.train ./train.spacy --paths.dev ./dev.spacy
-												Add init CLI and init config (#5854)

* Add init CLI and init config draft

* Improve config validation

* Auto-format

* Don't export anything in debug config

* Update docs
											
										
										
											2020-08-02 16:18:30 +03:00
+								```
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
-												Add tip about --gpu-id to training quickstart

											
										
										
											2021-02-19 16:07:51 +03:00
+								> #### Tip: Enable your GPU
 								>
 								> Use the `--gpu-id` option to select the GPU:
 								>
 								> ```cli
 								> $ python -m spacy train config.cfg --gpu-id 0
 								> ```
-												Update docs [ci skip]

											
										
										
											2020-09-20 18:44:58 +03:00
+								<Accordion title="How are the config recommendations generated?" id="quickstart-source" spaced>
-												Update docs [ci skip]

											
										
										
											2020-09-14 00:09:19 +03:00
 								The recommended config settings generated by the quickstart widget and the
 								[`init config`](/api/cli#init-config) command are based on some general **best
 								practices** and things we've found to work well in our experiments. The goal is
 								to provide you with the most **useful defaults**.
 								Under the hood, the
 								[`quickstart_training.jinja`](%%GITHUB_SPACY/spacy/cli/templates/quickstart_training.jinja)
 								template defines the different combinations – for example, which parameters to
 								change if the pipeline should optimize for efficiency vs. accuracy. The file
 								[`quickstart_training_recommendations.yml`](%%GITHUB_SPACY/spacy/cli/templates/quickstart_training_recommendations.yml)
 								collects the recommended settings and available resources for each language
 								including the different transformer weights. For some languages, we include
 								different transformer recommendations, depending on whether you want the model
 								to be more efficient or more accurate. The recommendations will be **evolving**
 								as we run more experiments.
 								</Accordion>
-												Update docs [ci skip]

											
										
										
											2020-09-20 18:44:58 +03:00
+								<Project id="pipelines/tagger_parser_ud">
 								The easiest way to get started is to clone a [project template](/usage/projects)
 								and run it – for example, this end-to-end template that lets you train a
 								**part-of-speech tagger** and **dependency parser** on a Universal Dependencies
 								treebank.
 								</Project>
-												Update docs [ci skip]

											
										
										
											2020-09-30 16:16:00 +03:00
+								## Training config system {#config}
-												Update v3 docs WIP [ci skip]

											
										
										
											2020-07-06 16:57:44 +03:00
 								Training config files include all **settings and hyperparameters** for training
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								your pipeline. Instead of providing lots of arguments on the command line, you
 								only need to pass your `config.cfg` file to [`spacy train`](/api/cli#train).
 								Under the hood, the training config uses the
-												Update docs and add keyword-only tag

											
										
										
											2020-07-06 19:14:57 +03:00
+								[configuration system](https://thinc.ai/docs/usage-config) provided by our
 								machine learning library [Thinc](https://thinc.ai). This also makes it easy to
 								integrate custom models and architectures, written in your framework of choice.
 								Some of the main advantages and features of spaCy's training config are:
 								- **Structured sections.** The config is grouped into sections, and nested
-												Update docs [ci skip]

											
										
										
											2020-07-25 19:51:12 +03:00
+								  sections are defined using the `.` notation. For example, `[components.ner]`
-												Update docs and add keyword-only tag

											
										
										
											2020-07-06 19:14:57 +03:00
+								  defines the settings for the pipeline's named entity recognizer. The config
 								  can be loaded as a Python dict.
-												Update v3 docs WIP [ci skip]

											
										
										
											2020-07-06 16:57:44 +03:00
+								- **References to registered functions.** Sections can refer to registered
 								  functions like [model architectures](/api/architectures),
 								  [optimizers](https://thinc.ai/docs/api-optimizers) or
 								  [schedules](https://thinc.ai/docs/api-schedules) and define arguments that are
-												small fixes

											
										
										
											2020-08-21 19:02:20 +03:00
+								  passed into them. You can also
 								  [register your own functions](#custom-functions) to define custom
 								  architectures or methods, reference them in your config and tweak their
 								  parameters.
-												Update docs [ci skip]

											
										
										
											2020-08-07 16:46:20 +03:00
+								- **Interpolation.** If you have hyperparameters or other settings used by
 								  multiple components, define them once and reference them as
 								  [variables](#config-interpolation).
-												Update docs and add keyword-only tag

											
										
										
											2020-07-06 19:14:57 +03:00
+								- **Reproducibility with no hidden defaults.** The config file is the "single
-												Update docs [ci skip]

											
										
										
											2020-08-18 01:49:19 +03:00
+								  source of truth" and includes all settings.
-												Update docs and add keyword-only tag

											
										
										
											2020-07-06 19:14:57 +03:00
+								- **Automated checks and validation.** When you load a config, spaCy checks if
 								  the settings are complete and if all values have the correct types. This lets
 								  you catch potential mistakes early. In your custom architectures, you can use
 								  Python [type hints](https://docs.python.org/3/library/typing.html) to tell the
 								  config which types of data to expect.
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
-												Update v3 docs

											
										
										
											2020-07-03 17:48:21 +03:00
+								```ini
-												Update docs [ci skip]

											
										
										
											2020-09-12 18:05:10 +03:00
+								%%GITHUB_SPACY/spacy/default_config.cfg
-												Update training.md

											
										
										
											2020-07-10 23:34:27 +03:00
+								```
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
-												Update training.md

											
										
										
											2020-07-10 23:34:27 +03:00
+								Under the hood, the config is parsed into a dictionary. It's divided into
 								sections and subsections, indicated by the square brackets and dot notation. For
-												typo fixes

											
										
										
											2020-08-17 18:10:15 +03:00
+								example, `[training]` is a section and `[training.batch_size]` a subsection.
-												Update training.md

											
										
										
											2020-07-10 23:34:27 +03:00
+								Subsections can define values, just like a dictionary, or use the `@` syntax to
 								refer to [registered functions](#config-functions). This allows the config to
 								not just define static settings, but also construct objects like architectures,
 								schedules, optimizers or any other custom components. The main top-level
 								sections of a config file are:
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
-												Update docs

											
										
										
											2020-08-06 20:30:43 +03:00
+								| Section       | Description                                                                                                                                                     |
 								| ------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------- |
 								| `nlp`         | Definition of the `nlp` object, its tokenizer and [processing pipeline](/usage/processing-pipelines) component names.                                           |
 								| `components`  | Definitions of the [pipeline components](/usage/processing-pipelines) and their models.                                                                         |
-												Merge branch 'develop' of https://github.com/explosion/spaCy into develop [ci skip]

											
										
										
											2020-08-20 12:20:58 +03:00
+								| `paths`       | Paths to data and other assets. Re-used across the config as variables, e.g. `${paths.train}`, and can be [overwritten](#config-overrides) on the CLI.          |
 								| `system`      | Settings related to system and hardware. Re-used across the config as variables, e.g. `${system.seed}`, and can be [overwritten](#config-overrides) on the CLI. |
-												Update docs

											
										
										
											2020-08-06 20:30:43 +03:00
+								| `training`    | Settings and controls for the training and evaluation process.                                                                                                  |
-												Fix pretraining in train script (#6143)

* update pretraining API in train CLI

* bump thinc to 8.0.0a35

* bump to 3.0.0a26

* doc fixes

* small doc fix
											
										
										
											2020-09-25 16:47:10 +03:00
+								| `pretraining` | Optional settings and controls for the [language model pretraining](/usage/embeddings-transformers#pretraining).                                                |
-												Update docs [ci skip]

											
										
										
											2020-09-30 16:16:00 +03:00
+								| `initialize`  | Data resources and arguments passed to components when [`nlp.initialize`](/api/language#initialize) is called before training (but not at runtime).             |
-												Update docs and add keyword-only tag

											
										
										
											2020-07-06 19:14:57 +03:00
-												Update WIP

											
										
										
											2020-07-06 23:22:37 +03:00
+								<Infobox title="Config format and settings" emoji="📖">
-												Update docs and add keyword-only tag

											
										
										
											2020-07-06 19:14:57 +03:00
 								For a full overview of spaCy's config format and settings, see the
-												Update docs

											
										
										
											2020-08-06 20:30:43 +03:00
+								[data format documentation](/api/data-formats#config) and
-												fix link

											
										
										
											2021-02-08 20:39:59 +03:00
+								[Thinc's config system docs](https://thinc.ai/docs/usage-config). The settings
-												Update docs and add keyword-only tag

											
										
										
											2020-07-06 19:14:57 +03:00
+								available for the different architectures are documented with the
 								[model architectures API](/api/architectures). See the Thinc documentation for
 								[optimizers](https://thinc.ai/docs/api-optimizers) and
 								[schedules](https://thinc.ai/docs/api-schedules).
 								</Infobox>
-												Update website for v3 launch

											
										
										
											2021-01-27 04:39:47 +03:00
+								<YouTube id="BWhh3r6W-qE"></YouTube>
-												Update docs [ci skip]

											
										
										
											2020-09-30 16:16:00 +03:00
+								### Config lifecycle at runtime and training {#config-lifecycle}
 								A pipeline's `config.cfg` is considered the "single source of truth", both at
 								**training** and **runtime**. Under the hood,
 								[`Language.from_config`](/api/language#from_config) takes care of constructing
 								the `nlp` object using the settings defined in the config. An `nlp` object's
 								config is available as [`nlp.config`](/api/language#config) and it includes all
 								information about the pipeline, as well as the settings used to train and
 								initialize it.
 								![Illustration of pipeline lifecycle](../images/lifecycle.svg)
-												Update docs [ci skip]

											
										
										
											2020-10-01 18:38:17 +03:00
+								At runtime spaCy will only use the `[nlp]` and `[components]` blocks of the
 								config and load all data, including tokenization rules, model weights and other
 								resources from the pipeline directory. The `[training]` block contains the
 								settings for training the model and is only used during training. Similarly, the
 								`[initialize]` block defines how the initial `nlp` object should be set up
 								before training and whether it should be initialized with vectors or pretrained
 								tok2vec weights, or any other data needed by the components.
 								The initialization settings are only loaded and used when
 								[`nlp.initialize`](/api/language#initialize) is called (typically right before
 								training). This allows you to set up your pipeline using local data resources
 								and custom functions, and preserve the information in your config – but without
-												Update docs [ci skip]

											
										
										
											2020-10-02 14:24:33 +03:00
+								requiring it to be available at runtime. You can also use this mechanism to
 								provide data paths to custom pipeline components and custom tokenizers – see the
 								section on [custom initialization](#initialization) for details.
-												Update docs [ci skip]

											
										
										
											2020-09-30 16:16:00 +03:00
-												Update docs

											
										
										
											2020-08-05 16:00:54 +03:00
+								### Overwriting config settings on the command line {#config-overrides}
-												Update training.md

											
										
										
											2020-07-10 23:34:27 +03:00
 								The config system means that you can define all settings **in one place** and in
 								a consistent format. There are no command-line arguments that need to be set,
 								and no hidden defaults. However, there can still be scenarios where you may want
 								to override config settings when you run [`spacy train`](/api/cli#train). This
 								includes **file paths** to vectors or other resources that shouldn't be
-												Update docs [ci skip]

											
										
										
											2020-07-25 19:51:12 +03:00
+								hard-code in a config file, or **system-dependent settings**.
-												Update training.md

											
										
										
											2020-07-10 23:34:27 +03:00
 								For cases like this, you can set additional command-line options starting with
 								`--` that correspond to the config section and value to override. For example,
-												Update docs

											
										
										
											2020-08-06 20:30:43 +03:00
+								`--paths.train ./corpus/train.spacy` sets the `train` value in the `[paths]`
 								block.
-												Update training.md

											
										
										
											2020-07-10 23:34:27 +03:00
-												Update docs [ci skip]

											
										
										
											2020-08-19 01:28:37 +03:00
+								```cli
 								$ python -m spacy train config.cfg --paths.train ./corpus/train.spacy --paths.dev ./corpus/dev.spacy --training.batch_size 128
-												Update training.md

											
										
										
											2020-07-10 23:34:27 +03:00
+								```
 								Only existing sections and values in the config can be overwritten. At the end
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								of the training, the final filled `config.cfg` is exported with your pipeline,
 								so you'll always have a record of the settings that were used, including your
-												Update docs

											
										
										
											2020-08-06 20:30:43 +03:00
+								overrides. Overrides are added before [variables](#config-interpolation) are
 								resolved, by the way – so if you need to use a value in multiple places,
 								reference it across your config and override it on the CLI once.
-												Update training.md

											
										
										
											2020-07-10 23:34:27 +03:00
-												Update docs [ci skip]

											
										
										
											2020-09-21 15:46:55 +03:00
+								> #### 💡 Tip: Verbose logging
 								>
 								> If you're using config overrides, you can set the `--verbose` flag on
 								> [`spacy train`](/api/cli#train) to make spaCy log more info, including which
 								> overrides were set via the CLI and environment variables.
 								#### Adding overrides via environment variables {#config-overrides-env}
 								Instead of defining the overrides as CLI arguments, you can also use the
 								`SPACY_CONFIG_OVERRIDES` environment variable using the same argument syntax.
 								This is especially useful if you're training models as part of an automated
 								process. Environment variables **take precedence** over CLI overrides and values
 								defined in the config file.
 								```cli
 								$ SPACY_CONFIG_OVERRIDES="--system.gpu_allocator pytorch --training.batch_size 128" ./your_script.sh
 								```
-												Update argument handling and documentation

											
										
										
											2020-12-08 12:41:18 +03:00
+								### Reading from standard input {#config-stdin}
 								Setting the config path to `-` on the command line lets you read the config from
 								standard input and pipe it forward from a different process, like
 								[`init config`](/api/cli#init-config) or your own custom script. This is
 								especially useful for quick experiments, as it lets you generate a config on the
 								fly without having to save to and load from disk.
 								> #### 💡 Tip: Writing to stdout
 								>
 								> When you run `init config`, you can set the output path to `-` to write to
 								> stdout. In a custom script, you can print the string config, e.g.
 								> `print(nlp.config.to_str())`.
 								```cli
 								$ python -m spacy init config - --lang en --pipeline ner,textcat --optimize accuracy | python -m spacy train - --paths.train ./corpus/train.spacy --paths.dev ./corpus/dev.spacy
 								```
 								<!-- TODO: add reference to Prodigy's commands once Prodigy nightly is available -->
-												Update docs [ci skip]

											
										
										
											2020-09-30 16:16:00 +03:00
+								### Using variable interpolation {#config-interpolation}
 								Another very useful feature of the config system is that it supports variable
 								interpolation for both **values and sections**. This means that you only need to
 								define a setting once and can reference it across your config using the
 								`${section.value}` syntax. In this example, the value of `seed` is reused within
 								the `[training]` block, and the whole block of `[training.optimizer]` is reused
 								in `[pretraining]` and will become `pretraining.optimizer`.
 								```ini
 								### config.cfg (excerpt) {highlight="5,18"}
 								[system]
 								seed = 0
 								[training]
 								seed = ${system.seed}
 								[training.optimizer]
 								@optimizers = "Adam.v1"
 								beta1 = 0.9
 								beta2 = 0.999
 								L2_is_weight_decay = true
 								L2 = 0.01
 								grad_clip = 1.0
 								use_averages = false
 								eps = 1e-8
 								[pretraining]
 								optimizer = ${training.optimizer}
 								```
 								You can also use variables inside strings. In that case, it works just like
 								f-strings in Python. If the value of a variable is not a string, it's converted
 								to a string.
 								```ini
 								[paths]
 								version = 5
 								root = "/Users/you/data"
 								train = "${paths.root}/train_${paths.version}.spacy"
 								# Result: /Users/you/data/train_5.spacy
 								```
 								<Infobox title="Tip: Override variables on the CLI" emoji="💡">
 								If you need to change certain values between training runs, you can define them
 								once, reference them as variables and then [override](#config-overrides) them on
 								the CLI. For example, `--paths.root /other/root` will change the value of `root`
 								in the block `[paths]` and the change will be reflected across all other values
 								that reference this variable.
 								</Infobox>
 								## Customizing the pipeline and training {#config-custom}
-												Update docs

											
										
										
											2020-08-05 16:00:54 +03:00
+								### Defining pipeline components {#config-components}
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								You typically train a [pipeline](/usage/processing-pipelines) of **one or more
 								components**. The `[components]` block in the config defines the available
 								pipeline components and how they should be created – either by a built-in or
 								custom [factory](/usage/processing-pipelines#built-in), or
-												Update docs

											
										
										
											2020-08-05 16:00:54 +03:00
+								[sourced](/usage/processing-pipelines#sourced-components) from an existing
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								trained pipeline. For example, `[components.parser]` defines the component named
-												Update docs

											
										
										
											2020-08-05 16:00:54 +03:00
+								`"parser"` in the pipeline. There are different ways you might want to treat
 								your components during training, and the most common scenarios are:
 . Train a **new component** from scratch on your data.
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+. Update an existing **trained component** with more examples.
 . Include an existing trained component without updating it.
-												Update docs

											
										
										
											2020-08-05 16:00:54 +03:00
+. Include a non-trainable component, like a rule-based
 								   [`EntityRuler`](/api/entityruler) or [`Sentencizer`](/api/sentencizer), or a
 								   fully [custom component](/usage/processing-pipelines#custom-components).
 								If a component block defines a `factory`, spaCy will look it up in the
 								[built-in](/usage/processing-pipelines#built-in) or
 								[custom](/usage/processing-pipelines#custom-components) components and create a
 								new component from scratch. All settings defined in the config block will be
 								passed to the component factory as arguments. This lets you configure the model
 								settings and hyperparameters. If a component block defines a `source`, the
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								component will be copied over from an existing trained pipeline, with its
-												Update docs

											
										
										
											2020-08-05 16:00:54 +03:00
+								existing weights. This lets you include an already trained component in your
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								pipeline, or update a trained component with more data specific to your use
 								case.
-												Update docs

											
										
										
											2020-08-05 16:00:54 +03:00
 								```ini
 								### config.cfg (excerpt)
 								[components]
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								# "parser" and "ner" are sourced from a trained pipeline
-												Update docs

											
										
										
											2020-08-05 16:00:54 +03:00
+								[components.parser]
 								source = "en_core_web_sm"
 								[components.ner]
 								source = "en_core_web_sm"
-												alphabetize registries

											
										
										
											2020-08-21 19:10:31 +03:00
+								# "textcat" and "custom" are created blank from a built-in / custom factory
-												Update docs

											
										
										
											2020-08-05 16:00:54 +03:00
+								[components.textcat]
 								factory = "textcat"
 								[components.custom]
 								factory = "your_custom_factory"
 								your_custom_setting = true
 								```
 								The `pipeline` setting in the `[nlp]` block defines the pipeline components
 								added to the pipeline, in order. For example, `"parser"` here references
 								`[components.parser]`. By default, spaCy will **update all components that can
 								be updated**. Trainable components that are created from scratch are initialized
 								with random weights. For sourced components, spaCy will keep the existing
 								weights and [resume training](/api/language#resume_training).
 								If you don't want a component to be updated, you can **freeze** it by adding it
 								to the `frozen_components` list in the `[training]` block. Frozen components are
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								**not updated** during training and are included in the final trained pipeline
-												Update argument handling and documentation

											
										
										
											2020-12-08 12:41:18 +03:00
+								as-is. They are also excluded when calling
 								[`nlp.initialize`](/api/language#initialize).
-												Update docs

											
										
										
											2020-08-05 16:00:54 +03:00
 								> #### Note on frozen components
 								>
 								> Even though frozen components are not **updated** during training, they will
 								> still **run** during training and evaluation. This is very important, because
 								> they may still impact your model's performance – for instance, a sentence
 								> boundary detector can impact what the parser or entity recognizer considers a
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								> valid parse. So the evaluation results should always reflect what your
 								> pipeline will produce at runtime.
-												Update docs

											
										
										
											2020-08-05 16:00:54 +03:00
 								```ini
 								[nlp]
 								lang = "en"
 								pipeline = ["parser", "ner", "textcat", "custom"]
 								[training]
 								frozen_components = ["parser", "custom"]
 								```
-												Update warning and mention replace_listeners

											
										
										
											2021-01-29 15:46:01 +03:00
+								<Infobox variant="warning" title="Shared Tok2Vec listener layer" id="config-components-listeners">
-												warn when frozen components break listener pattern (#6766)

* warn when frozen components break listener pattern

* few notes in the documentation

* update arg name

* formatting

* cleanup

* specify listeners return type
											
										
										
											2021-01-20 03:12:35 +03:00
 								When the components in your pipeline
 								[share an embedding layer](/usage/embeddings-transformers#embedding-layers), the
-												Update documentation

											
										
										
											2021-01-29 10:45:48 +03:00
+								**performance** of your frozen component will be **degraded** if you continue
 								training other layers with the same underlying `Tok2Vec` instance. As a rule of
 								thumb, ensure that your frozen components are truly **independent** in the
 								pipeline.
 								To automatically replace a shared token-to-vector listener with an independent
 								copy of the token-to-vector layer, you can use the `replace_listeners` setting
 								of a sourced component, pointing to the listener layer(s) in the config. For
 								more details on how this works under the hood, see
 								[`Language.replace_listeners`](/api/language#replace_listeners).
 								```ini
 								[training]
 								frozen_components = ["tagger"]
 								[components.tagger]
 								source = "en_core_web_sm"
 								replace_listeners = ["model.tok2vec"]
 								```
-												warn when frozen components break listener pattern (#6766)

* warn when frozen components break listener pattern

* few notes in the documentation

* update arg name

* formatting

* cleanup

* specify listeners return type
											
										
										
											2021-01-20 03:12:35 +03:00
 								</Infobox>
-												Update docs

											
										
										
											2020-08-05 16:00:54 +03:00
+								### Using registered functions {#config-functions}
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
-												Update WIP

											
										
										
											2020-07-06 23:22:37 +03:00
+								The training configuration defined in the config file doesn't have to only
 								consist of static values. Some settings can also be **functions**. For instance,
 								the `batch_size` can be a number that doesn't change, or a schedule, like a
 								sequence of compounding values, which has shown to be an effective trick (see
 								[Smith et al., 2017](https://arxiv.org/abs/1711.00489)).
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
-												Update WIP

											
										
										
											2020-07-06 23:22:37 +03:00
+								```ini
 								### With static value
 								[training]
 								batch_size = 128
 								```
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
-												Update WIP

											
										
										
											2020-07-06 23:22:37 +03:00
+								To refer to a function instead, you can make `[training.batch_size]` its own
-												several small updates

											
										
										
											2020-08-21 19:25:26 +03:00
+								section and use the `@` syntax to specify the function and its arguments – in
 								this case [`compounding.v1`](https://thinc.ai/docs/api-schedules#compounding)
 								defined in the [function registry](/api/top-level#registry). All other values
 								defined in the block are passed to the function as keyword arguments when it's
 								initialized. You can also use this mechanism to register
-												rename "custom models" to "custom functions"

											
										
										
											2020-08-19 17:53:51 +03:00
+								[custom implementations and architectures](#custom-functions) and reference them
-												Update WIP

											
										
										
											2020-07-06 23:22:37 +03:00
+								from your configs.
-												Update v3 docs

											
										
										
											2020-07-03 17:48:21 +03:00
-												Update training.md

											
										
										
											2020-07-10 23:34:27 +03:00
+								> #### How the config is resolved
-												Update WIP

											
										
										
											2020-07-06 23:22:37 +03:00
+								>
-												Update training.md

											
										
										
											2020-07-10 23:34:27 +03:00
+								> The config file is parsed into a regular dictionary and is resolved and
 								> validated **bottom-up**. Arguments provided for registered functions are
 								> checked against the function's signature and type annotations. The return
 								> value of a registered function can also be passed into another function – for
 								> instance, a learning rate schedule can be provided as the an argument of an
 								> optimizer.
-												Update v3 docs [ci skip]

											
										
										
											2020-07-05 17:11:16 +03:00
-												Update WIP

											
										
										
											2020-07-06 23:22:37 +03:00
+								```ini
 								### With registered function
 								[training.batch_size]
 								@schedules = "compounding.v1"
 								start = 100
 								stop = 1000
 								compound = 1.001
 								```
-												Update v3 docs [ci skip]

											
										
										
											2020-07-05 17:11:16 +03:00
-												Update WIP

											
										
										
											2020-07-06 23:22:37 +03:00
+								### Model architectures {#model-architectures}
-												Update v3 docs [ci skip]

											
										
										
											2020-07-05 17:11:16 +03:00
-												Update docs and consistency [ci skip]

											
										
										
											2020-08-18 15:39:40 +03:00
+								> #### 💡 Model type annotations
 								>
 								> In the documentation and code base, you may come across type annotations and
 								> descriptions of [Thinc](https://thinc.ai) model types, like ~~Model[List[Doc],
 								> List[Floats2d]]~~. This so-called generic type describes the layer and its
 								> input and output type – in this case, it takes a list of `Doc` objects as the
 								> input and list of 2-dimensional arrays of floats as the output. You can read
 								> more about defining Thinc models [here](https://thinc.ai/docs/usage-models).
 								> Also see the [type checking](https://thinc.ai/docs/usage-type-checking) for
 								> how to enable linting in your editor to see live feedback if your inputs and
 								> outputs don't match.
 								A **model architecture** is a function that wires up a Thinc
 								[`Model`](https://thinc.ai/docs/api-model) instance, which you can then use in a
 								component or as a layer of a larger network. You can use Thinc as a thin
 								[wrapper around frameworks](https://thinc.ai/docs/usage-frameworks) such as
 								PyTorch, TensorFlow or MXNet, or you can implement your logic in Thinc
-												Update docs [ci skip]

											
										
										
											2020-09-03 11:10:13 +03:00
+								[directly](https://thinc.ai/docs/usage-models). For more details and examples,
 								see the usage guide on [layers and architectures](/usage/layers-architectures).
-												Add model architectures intro

											
										
										
											2020-08-18 14:50:55 +03:00
 								spaCy's built-in components will never construct their `Model` instances
 								themselves, so you won't have to subclass the component to change its model
-												Update docs and consistency [ci skip]

											
										
										
											2020-08-18 15:39:40 +03:00
+								architecture. You can just **update the config** so that it refers to a
 								different registered function. Once the component has been created, its `Model`
 								instance has already been assigned, so you cannot change its model architecture.
 								The architecture is like a recipe for the network, and you can't change the
 								recipe once the dish has already been prepared. You have to make a new one.
 								spaCy includes a variety of built-in [architectures](/api/architectures) for
 								different tasks. For example:
-												Update docs [ci skip]

											
										
										
											2020-08-20 17:17:25 +03:00
+								| Architecture                                                      | Description                                                                                                                                                                                                                                               |
 								| ----------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
 								| [HashEmbedCNN](/api/architectures#HashEmbedCNN)                   | Build spaCy’s "standard" embedding layer, which uses hash embedding with subword features and a CNN with layer-normalized maxout. ~~Model[List[Doc], List[Floats2d]]~~                                                                                    |
 								| [TransitionBasedParser](/api/architectures#TransitionBasedParser) | Build a [transition-based parser](https://explosion.ai/blog/parsing-english-in-python) model used in the default [`EntityRecognizer`](/api/entityrecognizer) and [`DependencyParser`](/api/dependencyparser). ~~Model[List[Docs], List[List[Floats2d]]]~~ |
-												Update docs [ci skip]

											
										
										
											2020-08-22 14:52:52 +03:00
+								| [TextCatEnsemble](/api/architectures#TextCatEnsemble)             | Stacked ensemble of a bag-of-words model and a neural network model with an internal CNN embedding layer. Used in the default [`TextCategorizer`](/api/textcategorizer). ~~Model[List[Doc], Floats2d]~~                                                   |
-												Update docs [ci skip]

											
										
										
											2020-08-20 17:17:25 +03:00
-												Update docs [ci skip]

											
										
										
											2020-07-31 14:26:39 +03:00
+								### Metrics, training output and weighted scores {#metrics}
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								When you train a pipeline using the [`spacy train`](/api/cli#train) command,
 								you'll see a table showing the metrics after each pass over the data. The
 								available metrics **depend on the pipeline components**. Pipeline components
 								also define which scores are shown and how they should be **weighted in the
 								final score** that decides about the best model.
-												Update docs [ci skip]

											
										
										
											2020-07-31 14:26:39 +03:00
 								The `training.score_weights` setting in your `config.cfg` lets you customize the
 								scores shown in the table and how they should be weighted. In this example, the
 								labeled dependency accuracy and NER F-score count towards the final score with
 % each and the tagging accuracy makes up the remaining 20%. The tokenization
 								accuracy and speed are both shown in the table, but not counted towards the
 								score.
 								> #### Why do I need score weights?
 								>
 								> At the end of your training process, you typically want to select the **best
 								> model** – but what "best" means depends on the available components and your
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								> specific use case. For instance, you may prefer a pipeline with higher NER and
 								> lower POS tagging accuracy over a pipeline with lower NER and higher POS
-												Update docs [ci skip]

											
										
										
											2020-07-31 14:26:39 +03:00
+								> accuracy. You can express this preference in the score weights, e.g. by
 								> assigning `ents_f` (NER F-score) a higher weight.
 								```ini
 								[training.score_weights]
 								dep_las = 0.4
-												Fix handling of score_weights

											
										
										
											2020-09-24 11:27:33 +03:00
+								dep_uas = null
-												Update docs [ci skip]

											
										
										
											2020-07-31 14:26:39 +03:00
+								ents_f = 0.4
 								tag_acc = 0.2
 								token_acc = 0.0
 								speed = 0.0
 								```
 								The `score_weights` don't _have to_ sum to `1.0` – but it's recommended. When
 								you generate a config for a given pipeline, the score weights are generated by
 								combining and normalizing the default score weights of the pipeline components.
 								The default score weights are defined by each pipeline component via the
 								`default_score_weights` setting on the
-												Fix handling of score_weights

											
										
										
											2020-09-24 11:27:33 +03:00
+								[`@Language.factory`](/api/language#factory) decorator. By default, all pipeline
 								components are weighted equally. If a score weight is set to `null`, it will be
 								excluded from the logs and the score won't be weighted.
-												Update docs [ci skip]

											
										
										
											2020-07-31 14:26:39 +03:00
 								<Accordion title="Understanding the training output and score types" spaced>
 								| Name                       | Description                                                                                                             |
 								| -------------------------- | ----------------------------------------------------------------------------------------------------------------------- |
 								| **Loss**                   | The training loss representing the amount of work left for the optimizer. Should decrease, but usually not to `0`.      |
-												Update docs and consistency [ci skip]

											
										
										
											2020-08-18 15:39:40 +03:00
+								| **Precision** (P)          | Percentage of predicted annotations that were correct. Should increase.                                                 |
 								| **Recall** (R)             | Percentage of reference annotations recovered. Should increase.                                                         |
 								| **F-Score** (F)            | Harmonic mean of precision and recall. Should increase.                                                                 |
-												Update docs [ci skip]

											
										
										
											2020-07-31 14:26:39 +03:00
+								| **UAS** / **LAS**          | Unlabeled and labeled attachment score for the dependency parser, i.e. the percentage of correct arcs. Should increase. |
 								| **Words per second** (WPS) | Prediction speed in words per second. Should stay stable.                                                               |
 								Note that if the development data has raw text, some of the gold-standard
 								entities might not align to the predicted tokenization. These tokenization
 								errors are **excluded from the NER evaluation**. If your tokenization makes it
 								impossible for the model to predict 50% of your entities, your NER F-score might
 								still look good.
 								</Accordion>
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
-												Update docs [ci skip]

											
										
										
											2020-09-30 16:16:00 +03:00
+								## Custom functions {#custom-functions}
-												custom functions intro

											
										
										
											2020-08-19 18:32:35 +03:00
 								Registered functions in the training config files can refer to built-in
-												Update docs [ci skip]

											
										
										
											2020-08-19 21:37:54 +03:00
+								implementations, but you can also plug in fully **custom implementations**. All
 								you need to do is register your function using the `@spacy.registry` decorator
 								with the name of the respective [registry](/api/top-level#registry), e.g.
 								`@spacy.registry.architectures`, and a string name to assign to your function.
 								Registering custom functions allows you to **plug in models** defined in PyTorch
 								or TensorFlow, make **custom modifications** to the `nlp` object, create custom
 								optimizers or schedules, or **stream in data** and preprocesses it on the fly
 								while training.
-												Update argument handling and documentation

											
										
										
											2020-12-08 12:41:18 +03:00
+								Each custom function can have any number of arguments that are passed in via the
 								[config](#config), just the built-in functions. If your function defines
-												Update docs [ci skip]

											
										
										
											2020-08-19 21:37:54 +03:00
+								**default argument values**, spaCy is able to auto-fill your config when you run
 								[`init fill-config`](/api/cli#init-fill-config). If you want to make sure that a
-												Update docs [ci skip]

											
										
										
											2020-09-03 11:10:13 +03:00
+								given parameter is always explicitly set in the config, avoid setting a default
-												Update docs [ci skip]

											
										
										
											2020-08-19 21:37:54 +03:00
+								value for it.
-												Update WIP

											
										
										
											2020-07-06 23:22:37 +03:00
 								### Training with custom code {#custom-code}
-												Update docs [ci skip]

											
										
										
											2020-08-19 01:28:37 +03:00
+								> ```cli
-												Include custom code via spacy package command (#6531)


											
										
										
											2020-12-10 15:36:46 +03:00
+								> ### Training
-												Update docs

											
										
										
											2020-08-06 20:30:43 +03:00
+								> $ python -m spacy train config.cfg --code functions.py
-												Update docs [ci skip]

											
										
										
											2020-08-05 21:29:53 +03:00
+								> ```
-												Include custom code via spacy package command (#6531)


											
										
										
											2020-12-10 15:36:46 +03:00
+								>
 								> ```cli
 								> ### Packaging
 								> $ python -m spacy package ./model-best ./packages --code functions.py
 								> ```
-												Update docs [ci skip]

											
										
										
											2020-08-05 21:29:53 +03:00
-												Update WIP

											
										
										
											2020-07-06 23:22:37 +03:00
+								The [`spacy train`](/api/cli#train) recipe lets you specify an optional argument
 								`--code` that points to a Python file. The file is imported before training and
 								allows you to add custom functions and architectures to the function registry
 								that can then be referenced from your `config.cfg`. This lets you train spaCy
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								pipelines with custom components, without having to re-implement the whole
-												Include custom code via spacy package command (#6531)


											
										
										
											2020-12-10 15:36:46 +03:00
+								training workflow. When you package your trained pipeline later using
 								[`spacy package`](/api/cli#package), you can provide one or more Python files to
 								be included in the package and imported in its `__init__.py`. This means that
 								any custom architectures, functions or
 								[components](/usage/processing-pipelines#custom-components) will be shipped with
 								your pipeline and registered when it's loaded. See the documentation on
 								[saving and loading pipelines](/usage/saving-loading#models-custom) for details.
-												Update WIP

											
										
										
											2020-07-06 23:22:37 +03:00
-												Update docs [ci skip]

											
										
										
											2020-08-05 21:29:53 +03:00
+								#### Example: Modifying the nlp object {#custom-code-nlp-callbacks}
 								For many use cases, you don't necessarily want to implement the whole `Language`
 								subclass and language data from scratch – it's often enough to make a few small
 								modifications, like adjusting the
 								[tokenization rules](/usage/linguistic-features#native-tokenizer-additions) or
 								[language defaults](/api/language#defaults) like stop words. The config lets you
-												Add initialize.before_init and after_init callbacks

Add `initialize.before_init` and `initialize.after_init` callbacks to
the config. The `initialize.before_init` callback is a place to
implement one-time tokenizer customizations that are then saved with the
model.

											
										
										
											2021-01-12 13:29:31 +03:00
+								provide five optional **callback functions** that give you access to the
-												Update docs [ci skip]

											
										
										
											2020-08-05 21:29:53 +03:00
+								language class and `nlp` object at different points of the lifecycle:
-												Add initialize.before_init and after_init callbacks

Add `initialize.before_init` and `initialize.after_init` callbacks to
the config. The `initialize.before_init` callback is a place to
implement one-time tokenizer customizations that are then saved with the
model.

											
										
										
											2021-01-12 13:29:31 +03:00
+								| Callback                      | Description                                                                                                                                                                                                                |
 								| ----------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
 								| `nlp.before_creation`         | Called before the `nlp` object is created and receives the language subclass like `English` (not the instance). Useful for writing to the [`Language.Defaults`](/api/language#defaults) aside from the tokenizer settings. |
 								| `nlp.after_creation`          | Called right after the `nlp` object is created, but before the pipeline components are added to the pipeline and receives the `nlp` object.                                                                                |
 								| `nlp.after_pipeline_creation` | Called right after the pipeline components are created and added and receives the `nlp` object. Useful for modifying pipeline components.                                                                                  |
 								| `initialize.before_init`      | Called before the pipeline components are initialized and receives the `nlp` object for in-place modification. Useful for modifying the tokenizer settings, similar to the v2 base model option.                           |
 								| `initialize.after_init`       | Called after the pipeline components are initialized and receives the `nlp` object for in-place modification.                                                                                                              |
-												Update docs [ci skip]

											
										
										
											2020-08-05 21:29:53 +03:00
-												typo's and quick note on default values

											
										
										
											2020-08-18 11:23:27 +03:00
+								The `@spacy.registry.callbacks` decorator lets you register your custom function
 								in the `callbacks` [registry](/api/top-level#registry) under a given name. You
 								can then reference the function in a config block using the `@callbacks` key. If
 								a block contains a key starting with an `@`, it's interpreted as a reference to
 								a function. Because you've registered the function, spaCy knows how to create it
-												Update docs [ci skip]

											
										
										
											2020-08-05 21:29:53 +03:00
+								when you reference `"customize_language_data"` in your config. Here's an example
-												Add initialize.before_init and after_init callbacks

Add `initialize.before_init` and `initialize.after_init` callbacks to
the config. The `initialize.before_init` callback is a place to
implement one-time tokenizer customizations that are then saved with the
model.

											
										
										
											2021-01-12 13:29:31 +03:00
+								of a callback that runs before the `nlp` object is created and adds a custom
 								stop word to the defaults:
-												Update docs [ci skip]

											
										
										
											2020-08-05 21:29:53 +03:00
 								> #### config.cfg
 								>
 								> ```ini
 								> [nlp.before_creation]
 								> @callbacks = "customize_language_data"
 								> ```
 								```python
 								### functions.py {highlight="3,6"}
 								import spacy
 								@spacy.registry.callbacks("customize_language_data")
 								def create_callback():
 								    def customize_language_data(lang_cls):
-												Add initialize.before_init and after_init callbacks

Add `initialize.before_init` and `initialize.after_init` callbacks to
the config. The `initialize.before_init` callback is a place to
implement one-time tokenizer customizations that are then saved with the
model.

											
										
										
											2021-01-12 13:29:31 +03:00
+								        lang_cls.Defaults.stop_words.add("good")
-												Update docs [ci skip]

											
										
										
											2020-08-05 21:29:53 +03:00
+								        return lang_cls
 								    return customize_language_data
 								```
 								<Infobox variant="warning">
 								Remember that a registered function should always be a function that spaCy
 								**calls to create something**. In this case, it **creates a callback** – it's
 								not the callback itself.
 								</Infobox>
 								Any registered function – in this case `create_callback` – can also take
 								**arguments** that can be **set by the config**. This lets you implement and
 								keep track of different configurations, without having to hack at your code. You
 								can choose any arguments that make sense for your use case. In this example,
 								we're adding the arguments `extra_stop_words` (a list of strings) and `debug`
 								(boolean) for printing additional info when the function runs.
 								> #### config.cfg
 								>
 								> ```ini
 								> [nlp.before_creation]
 								> @callbacks = "customize_language_data"
 								> extra_stop_words = ["ooh", "aah"]
 								> debug = true
 								> ```
 								```python
-												Add initialize.before_init and after_init callbacks

Add `initialize.before_init` and `initialize.after_init` callbacks to
the config. The `initialize.before_init` callback is a place to
implement one-time tokenizer customizations that are then saved with the
model.

											
										
										
											2021-01-12 13:29:31 +03:00
+								### functions.py {highlight="5,7-9"}
-												Update docs [ci skip]

											
										
										
											2020-08-05 21:29:53 +03:00
+								from typing import List
 								import spacy
 								@spacy.registry.callbacks("customize_language_data")
 								def create_callback(extra_stop_words: List[str] = [], debug: bool = False):
 								    def customize_language_data(lang_cls):
-												Add initialize.before_init and after_init callbacks

Add `initialize.before_init` and `initialize.after_init` callbacks to
the config. The `initialize.before_init` callback is a place to
implement one-time tokenizer customizations that are then saved with the
model.

											
										
										
											2021-01-12 13:29:31 +03:00
+								        lang_cls.Defaults.stop_words.update(extra_stop_words)
-												Update docs [ci skip]

											
										
										
											2020-08-05 21:29:53 +03:00
+								        if debug:
-												Add initialize.before_init and after_init callbacks

Add `initialize.before_init` and `initialize.after_init` callbacks to
the config. The `initialize.before_init` callback is a place to
implement one-time tokenizer customizations that are then saved with the
model.

											
										
										
											2021-01-12 13:29:31 +03:00
+								            print("Updated stop words")
-												Update docs [ci skip]

											
										
										
											2020-08-05 21:29:53 +03:00
+								        return lang_cls
 								    return customize_language_data
 								```
 								<Infobox title="Tip: Use Python type hints" emoji="💡">
 								spaCy's configs are powered by our machine learning library Thinc's
 								[configuration system](https://thinc.ai/docs/usage-config), which supports
 								[type hints](https://docs.python.org/3/library/typing.html) and even
 								[advanced type annotations](https://thinc.ai/docs/usage-config#advanced-types)
 								using [`pydantic`](https://github.com/samuelcolvin/pydantic). If your registered
 								function provides type hints, the values that are passed in will be checked
 								against the expected types. For example, `debug: bool` in the example above will
-												typo's and quick note on default values

											
										
										
											2020-08-18 11:23:27 +03:00
+								ensure that the value received as the argument `debug` is a boolean. If the
-												Update docs [ci skip]

											
										
										
											2020-08-05 21:29:53 +03:00
+								value can't be coerced into a boolean, spaCy will raise an error.
-												typo's and quick note on default values

											
										
										
											2020-08-18 11:23:27 +03:00
+								`debug: pydantic.StrictBool` will force the value to be a boolean and raise an
-												Update docs [ci skip]

											
										
										
											2020-08-05 21:29:53 +03:00
+								error if it's not – for instance, if your config defines `1` instead of `true`.
 								</Infobox>
 								With your `functions.py` defining additional code and the updated `config.cfg`,
 								you can now run [`spacy train`](/api/cli#train) and point the argument `--code`
 								to your Python file. Before loading the config, spaCy will import the
 								`functions.py` module and your custom functions will be registered.
-												Update docs [ci skip]

											
										
										
											2020-08-19 01:28:37 +03:00
+								```cli
 								$ python -m spacy train config.cfg --output ./output --code ./functions.py
-												Update docs [ci skip]

											
										
										
											2020-08-05 21:29:53 +03:00
+								```
-												Add initialize.before_init and after_init callbacks

Add `initialize.before_init` and `initialize.after_init` callbacks to
the config. The `initialize.before_init` callback is a place to
implement one-time tokenizer customizations that are then saved with the
model.

											
										
										
											2021-01-12 13:29:31 +03:00
+								#### Example: Modifying tokenizer settings {#custom-tokenizer}
 								Use the `initialize.before_init` callback to modify the tokenizer settings when
 								training a new pipeline. Write a registered callback that modifies the tokenizer
 								settings and specify this callback in your config:
 								> #### config.cfg
 								>
 								> ```ini
 								> [initialize]
 								>
 								> [initialize.before_init]
 								> @callbacks = "customize_tokenizer"
 								> ```
 								```python
 								### functions.py
 								from spacy.util import registry, compile_suffix_regex
 								@registry.callbacks("customize_tokenizer")
 								def make_customize_tokenizer():
 								    def customize_tokenizer(nlp):
 								        # remove a suffix
 								        suffixes = list(nlp.Defaults.suffixes)
 								        suffixes.remove("\\[")
 								        suffix_regex = compile_suffix_regex(suffixes)
 								        nlp.tokenizer.suffix_search = suffix_regex.search
 								        # add a special case
 								        nlp.tokenizer.add_special_case("_SPECIAL_", [{"ORTH": "_SPECIAL_"}])
 								    return customize_tokenizer
 								```
 								When training, provide the function above with the `--code` option:
 								```cli
 								$ python -m spacy train config.cfg --code ./functions.py
 								```
 								Because this callback is only called in the one-time initialization step before
 								training, the callback code does not need to be packaged with the final pipeline
 								package. However, to make it easier for others to replicate your training setup,
 								you can choose to package the initialization callbacks with the pipeline package
 								or to publish them separately.
 								<Infobox variant="warning" title="nlp.before_creation vs. initialize.before_init">
 								- `nlp.before_creation` is the best place to modify language defaults other than
 								  the tokenizer settings.
 								- `initialize.before_init` is the best place to modify tokenizer settings when
 								  training a new pipeline.
 								Unlike the other language defaults, the tokenizer settings are saved with the
 								pipeline with `nlp.to_disk()`, so modifications made in `nlp.before_creation`
 								will be clobbered by the saved settings when the trained pipeline is loaded from
 								disk.
 								</Infobox>
-												add loggers registry & logger docs sections

											
										
										
											2020-08-28 22:44:04 +03:00
+								#### Example: Custom logging function {#custom-logging}
-												Update docs [ci skip]

											
										
										
											2020-08-31 17:39:53 +03:00
+								During training, the results of each step are passed to a logger function. By
 								default, these results are written to the console with the
 								[`ConsoleLogger`](/api/top-level#ConsoleLogger). There is also built-in support
 								for writing the log files to [Weights & Biases](https://www.wandb.com/) with the
-												Improve control of training progress and logging (#6184)

* Make logging and progress easier to control

* Update docs

* Cleanup errors

* Fix ConfigValidationError

* Pass stdout/stderr, not wasabi.Printer

* Fix type

* Upd logging example

* Fix logger example

* Fix type
											
										
										
											2020-10-03 15:57:46 +03:00
+								[`WandbLogger`](/api/top-level#WandbLogger). On each step, the logger function
 								receives a **dictionary** with the following keys:
-												add loggers registry & logger docs sections

											
										
										
											2020-08-28 22:44:04 +03:00
-												Update docs

											
										
										
											2020-10-03 17:08:24 +03:00
+								| Key            | Value                                                                                                 |
 								| -------------- | ----------------------------------------------------------------------------------------------------- |
 								| `epoch`        | How many passes over the data have been completed. ~~int~~                                            |
 								| `step`         | How many steps have been completed. ~~int~~                                                           |
 								| `score`        | The main score from the last evaluation, measured on the dev set. ~~float~~                           |
 								| `other_scores` | The other scores from the last evaluation, measured on the dev set. ~~Dict[str, Any]~~                |
 								| `losses`       | The accumulated training losses, keyed by component name. ~~Dict[str, float]~~                        |
 								| `checkpoints`  | A list of previous results, where each result is a `(score, step)` tuple. ~~List[Tuple[float, int]]~~ |
-												add loggers registry & logger docs sections

											
										
										
											2020-08-28 22:44:04 +03:00
-												Update docs [ci skip]

											
										
										
											2020-08-31 17:39:53 +03:00
+								You can easily implement and plug in your own logger that records the training
 								results in a custom way, or sends them to an experiment management tracker of
 								your choice. In this example, the function `my_custom_logger.v1` writes the
 								tabular results to a file:
 								> ```ini
 								> ### config.cfg (excerpt)
 								> [training.logger]
 								> @loggers = "my_custom_logger.v1"
-												fix config example [ci skip]
											
										
										
											2020-08-31 18:40:04 +03:00
+								> log_path = "my_file.tab"
-												Update docs [ci skip]

											
										
										
											2020-08-31 17:39:53 +03:00
+								> ```
-												add loggers registry & logger docs sections

											
										
										
											2020-08-28 22:44:04 +03:00
-												example of custom logger

											
										
										
											2020-08-31 15:24:41 +03:00
+								```python
 								### functions.py
-												Improve control of training progress and logging (#6184)

* Make logging and progress easier to control

* Update docs

* Cleanup errors

* Fix ConfigValidationError

* Pass stdout/stderr, not wasabi.Printer

* Fix type

* Upd logging example

* Fix logger example

* Fix type
											
										
										
											2020-10-03 15:57:46 +03:00
+								import sys
-												TextCat updates and fixes (#6263)

* small fix in example imports

* throw error when train_corpus or dev_corpus is not a string

* small fix in custom logger example

* limit macro_auc to labels with 2 annotations

* fix typo

* also create parents of output_dir if need be

* update documentation of textcat scores

* refactor TextCatEnsemble

* fix tests for new AUC definition

* bump to 3.0.0a42

* update docs

* rename to spacy.TextCatEnsemble.v2

* spacy.TextCatEnsemble.v1 in legacy

* cleanup

* small fix

* update to 3.0.0rc2

* fix import that got lost in merge

* cursed IDE

* fix two typos
											
										
										
											2020-10-18 15:50:41 +03:00
+								from typing import IO, Tuple, Callable, Dict, Any, Optional
-												example of custom logger

											
										
										
											2020-08-31 15:24:41 +03:00
+								import spacy
-												Improve control of training progress and logging (#6184)

* Make logging and progress easier to control

* Update docs

* Cleanup errors

* Fix ConfigValidationError

* Pass stdout/stderr, not wasabi.Printer

* Fix type

* Upd logging example

* Fix logger example

* Fix type
											
										
										
											2020-10-03 15:57:46 +03:00
+								from spacy import Language
-												example of custom logger

											
										
										
											2020-08-31 15:24:41 +03:00
+								from pathlib import Path
 								@spacy.registry.loggers("my_custom_logger.v1")
 								def custom_logger(log_path):
-												Improve control of training progress and logging (#6184)

* Make logging and progress easier to control

* Update docs

* Cleanup errors

* Fix ConfigValidationError

* Pass stdout/stderr, not wasabi.Printer

* Fix type

* Upd logging example

* Fix logger example

* Fix type
											
										
										
											2020-10-03 15:57:46 +03:00
+								    def setup_logger(
 								        nlp: Language,
 								        stdout: IO=sys.stdout,
 								        stderr: IO=sys.stderr
 								    ) -> Tuple[Callable, Callable]:
-												TextCat updates and fixes (#6263)

* small fix in example imports

* throw error when train_corpus or dev_corpus is not a string

* small fix in custom logger example

* limit macro_auc to labels with 2 annotations

* fix typo

* also create parents of output_dir if need be

* update documentation of textcat scores

* refactor TextCatEnsemble

* fix tests for new AUC definition

* bump to 3.0.0a42

* update docs

* rename to spacy.TextCatEnsemble.v2

* spacy.TextCatEnsemble.v1 in legacy

* cleanup

* small fix

* update to 3.0.0rc2

* fix import that got lost in merge

* cursed IDE

* fix two typos
											
										
										
											2020-10-18 15:50:41 +03:00
+								        stdout.write(f"Logging to {log_path}\\n")
-												Improve control of training progress and logging (#6184)

* Make logging and progress easier to control

* Update docs

* Cleanup errors

* Fix ConfigValidationError

* Pass stdout/stderr, not wasabi.Printer

* Fix type

* Upd logging example

* Fix logger example

* Fix type
											
										
										
											2020-10-03 15:57:46 +03:00
+								        log_file = Path(log_path).open("w", encoding="utf8")
 								        log_file.write("step\\t")
 								        log_file.write("score\\t")
 								        for pipe in nlp.pipe_names:
 								            log_file.write(f"loss_{pipe}\\t")
 								        log_file.write("\\n")
 								        def log_step(info: Optional[Dict[str, Any]]):
 								            if info:
 								                log_file.write(f"{info['step']}\\t")
 								                log_file.write(f"{info['score']}\\t")
-												example of custom logger

											
										
										
											2020-08-31 15:24:41 +03:00
+								                for pipe in nlp.pipe_names:
-												Improve control of training progress and logging (#6184)

* Make logging and progress easier to control

* Update docs

* Cleanup errors

* Fix ConfigValidationError

* Pass stdout/stderr, not wasabi.Printer

* Fix type

* Upd logging example

* Fix logger example

* Fix type
											
										
										
											2020-10-03 15:57:46 +03:00
+								                    log_file.write(f"{info['losses'][pipe]}\\t")
 								                log_file.write("\\n")
-												example of custom logger

											
										
										
											2020-08-31 15:24:41 +03:00
 								        def finalize():
-												Improve control of training progress and logging (#6184)

* Make logging and progress easier to control

* Update docs

* Cleanup errors

* Fix ConfigValidationError

* Pass stdout/stderr, not wasabi.Printer

* Fix type

* Upd logging example

* Fix logger example

* Fix type
											
										
										
											2020-10-03 15:57:46 +03:00
+								            log_file.close()
-												example of custom logger

											
										
										
											2020-08-31 15:24:41 +03:00
 								        return log_step, finalize
 								    return setup_logger
 								```
-												Update docs [ci skip]

											
										
										
											2020-08-05 21:29:53 +03:00
+								#### Example: Custom batch size schedule {#custom-code-schedule}
-												Update docs [ci skip]

											
										
										
											2020-09-03 11:07:45 +03:00
+								You can also implement your own batch size schedule to use during training. The
 								`@spacy.registry.schedules` decorator lets you register that function in the
 								`schedules` [registry](/api/top-level#registry) and assign it a string name:
-												Update WIP

											
										
										
											2020-07-06 23:22:37 +03:00
 								> #### Why the version in the name?
 								>
 								> A big benefit of the config system is that it makes your experiments
 								> reproducible. We recommend versioning the functions you register, especially
 								> if you expect them to change (like a new model architecture). This way, you
 								> know that a config referencing `v1` means a different function than a config
 								> referencing `v2`.
 								```python
 								### functions.py
 								import spacy
 								@spacy.registry.schedules("my_custom_schedule.v1")
-												Update docs [ci skip]

											
										
										
											2020-09-09 12:20:07 +03:00
+								def my_custom_schedule(start: int = 1, factor: float = 1.001):
-												Update WIP

											
										
										
											2020-07-06 23:22:37 +03:00
+								   while True:
 								      yield start
 								      start = start * factor
 								```
 								In your config, you can now reference the schedule in the
 								`[training.batch_size]` block via `@schedules`. If a block contains a key
 								starting with an `@`, it's interpreted as a reference to a function. All other
 								settings in the block will be passed to the function as keyword arguments. Keep
 								in mind that the config shouldn't have any hidden defaults and all arguments on
-												custom functions intro

											
										
										
											2020-08-19 18:32:35 +03:00
+								the functions need to be represented in the config.
-												Update WIP

											
										
										
											2020-07-06 23:22:37 +03:00
 								```ini
 								### config.cfg (excerpt)
 								[training.batch_size]
 								@schedules = "my_custom_schedule.v1"
 								start = 2
 								factor = 1.005
 								```
-												Update docs [ci skip]

											
										
										
											2020-09-30 16:16:00 +03:00
+								### Defining custom architectures {#custom-architectures}
 								Built-in pipeline components such as the tagger or named entity recognizer are
 								constructed with default neural network [models](/api/architectures). You can
 								change the model architecture entirely by implementing your own custom models
 								and providing those in the config when creating the pipeline component. See the
 								documentation on [layers and model architectures](/usage/layers-architectures)
 								for more details.
 								> ```ini
 								> ### config.cfg
 								> [components.tagger]
 								> factory = "tagger"
 								>
 								> [components.tagger.model]
 								> @architectures = "custom_neural_network.v1"
 								> output_width = 512
 								> ```
 								```python
 								### functions.py
 								from typing import List
 								from thinc.types import Floats2d
 								from thinc.api import Model
 								import spacy
 								from spacy.tokens import Doc
 								@spacy.registry.architectures("custom_neural_network.v1")
-												Update naming [ci skip]

											
										
										
											2021-02-03 04:48:31 +03:00
+								def custom_neural_network(output_width: int) -> Model[List[Doc], List[Floats2d]]:
-												Update docs [ci skip]

											
										
										
											2020-09-30 16:16:00 +03:00
+								    return create_model(output_width)
 								```
-												Update docs [ci skip]

											
										
										
											2020-10-04 15:14:55 +03:00
+								## Customizing the initialization {#initialization}
-												Update docs [ci skip]

											
										
										
											2020-10-02 14:24:33 +03:00
-												Update docs

											
										
										
											2020-10-03 17:08:24 +03:00
+								When you start training a new model from scratch,
 								[`spacy train`](/api/cli#train) will call
-												Update docs [ci skip]

											
										
										
											2020-10-04 15:14:55 +03:00
+								[`nlp.initialize`](/api/language#initialize) to initialize the pipeline and load
 								the required data. All settings for this are defined in the
 								[`[initialize]`](/api/data-formats#config-initialize) block of the config, so
 								you can keep track of how the initial `nlp` object was created. The
 								initialization process typically includes the following:
-												Update docs

											
										
										
											2020-10-03 17:08:24 +03:00
 								> #### config.cfg (excerpt)
 								>
 								> ```ini
 								> [initialize]
 								> vectors = ${paths.vectors}
 								> init_tok2vec = ${paths.init_tok2vec}
 								>
 								> [initialize.components]
 								> # Settings for components
 								> ```
 . Load in **data resources** defined in the `[initialize]` config, including
 								   **word vectors** and
 								   [pretrained](/usage/embeddings-transformers/#pretraining) **tok2vec
 								   weights**.
 . Call the `initialize` methods of the tokenizer (if implemented, e.g. for
 								   [Chinese](/usage/models#chinese)) and pipeline components with a callback to
 								   access the training data, the current `nlp` object and any **custom
 								   arguments** defined in the `[initialize]` config.
 . In **pipeline components**: if needed, use the data to
 								   [infer missing shapes](/usage/layers-architectures#thinc-shape-inference) and
 								   set up the label scheme if no labels are provided. Components may also load
 								   other data like lookup tables or dictionaries.
 								The initialization step allows the config to define **all settings** required
 								for the pipeline, while keeping a separation between settings and functions that
 								should only be used **before training** to set up the initial pipeline, and
 								logic and configuration that needs to be available **at runtime**. Without that
-												Three small typos

Some little typos since v3.0 is out.

											
										
										
											2020-10-15 19:06:37 +03:00
+								separation, it would be very difficult to use the same, reproducible config file
-												Update docs [ci skip]

											
										
										
											2020-10-04 15:14:55 +03:00
+								because the component settings required for training (load data from an external
 								file) wouldn't match the component settings required at runtime (load what's
 								included with the saved `nlp` object and don't depend on external file).
-												Update docs

											
										
										
											2020-10-03 17:08:24 +03:00
 								![Illustration of pipeline lifecycle](../images/lifecycle.svg)
-												Update docs [ci skip]

											
										
										
											2020-10-04 15:14:55 +03:00
+								<Infobox title="How components save and load data" emoji="📖">
 								For details and examples of how pipeline components can **save and load data
 								assets** like model weights or lookup tables, and how the component
 								initialization is implemented under the hood, see the usage guide on
 								[serializing and initializing component data](/usage/processing-pipelines#component-data-initialization).
 								</Infobox>
-												Update docs

											
										
										
											2020-10-03 17:08:24 +03:00
+								#### Initializing labels {#initialization-labels}
 								Built-in pipeline components like the
 								[`EntityRecognizer`](/api/entityrecognizer) or
 								[`DependencyParser`](/api/dependencyparser) need to know their available labels
 								and associated internal meta information to initialize their model weights.
 								Using the `get_examples` callback provided on initialization, they're able to
 								**read the labels off the training data** automatically, which is very
 								convenient – but it can also slow down the training process to compute this
 								information on every run.
 								The [`init labels`](/api/cli#init-labels) command lets you auto-generate JSON
 								files containing the label data for all supported components. You can then pass
 								in the labels in the `[initialize]` settings for the respective components to
 								allow them to initialize faster.
 								> #### config.cfg
 								>
 								> ```ini
 								> [initialize.components.ner]
 								>
 								> [initialize.components.ner.labels]
 								> @readers = "spacy.read_labels.v1"
 								> path = "corpus/labels/ner.json
 								> ```
 								```cli
 								$ python -m spacy init labels config.cfg ./corpus --paths.train ./corpus/train.spacy
 								```
 								Under the hood, the command delegates to the `label_data` property of the
 								pipeline components, for instance
 								[`EntityRecognizer.label_data`](/api/entityrecognizer#label_data).
 								<Infobox variant="warning" title="Important note">
 								The JSON format differs for each component and some components need additional
 								meta information about their labels. The format exported by
 								[`init labels`](/api/cli#init-labels) matches what the components need, so you
 								should always let spaCy **auto-generate the labels** for you.
-												Update docs [ci skip]

											
										
										
											2020-10-03 15:47:02 +03:00
+								</Infobox>
-												Update docs [ci skip]

											
										
										
											2020-10-01 18:38:17 +03:00
-												Update docs [ci skip]

											
										
										
											2020-09-30 16:16:00 +03:00
+								## Data utilities {#data}
-												Update augmenter lookups and docs

											
										
										
											2020-10-01 00:03:47 +03:00
+								spaCy includes various features and utilities to make it easy to train models
 								using your own data, manage training and evaluation corpora, convert existing
 								annotations and configure data augmentation strategies for more robust models.
 								### Converting existing corpora and annotations {#data-convert}
 								If you have training data in a standard format like `.conll` or `.conllu`, the
 								easiest way to convert it for use with spaCy is to run
 								[`spacy convert`](/api/cli#convert) and pass it a file and an output directory.
 								By default, the command will pick the converter based on the file extension.
-												Update docs [ci skip]

											
										
										
											2020-09-30 16:16:00 +03:00
 								```cli
 								$ python -m spacy convert ./train.gold.conll ./corpus
 								```
-												Update augmenter lookups and docs

											
										
										
											2020-10-01 00:03:47 +03:00
+								> #### 💡 Tip: Converting from Prodigy
 								>
 								> If you're using the [Prodigy](https://prodi.gy) annotation tool to create
 								> training data, you can run the
 								> [`data-to-spacy` command](https://prodi.gy/docs/recipes#data-to-spacy) to
 								> merge and export multiple datasets for use with
 								> [`spacy train`](/api/cli#train). Different types of annotations on the same
 								> text will be combined, giving you one corpus to train multiple components.
-												Update docs [ci skip]

											
										
										
											2020-09-30 16:16:00 +03:00
+								<Infobox title="Tip: Manage multi-step workflows with projects" emoji="💡">
 								Training workflows often consist of multiple steps, from preprocessing the data
 								all the way to packaging and deploying the trained model.
 								[spaCy projects](/usage/projects) let you define all steps in one file, manage
 								data assets, track changes and share your end-to-end processes with your team.
 								</Infobox>
-												Update augmenter lookups and docs

											
										
										
											2020-10-01 00:03:47 +03:00
+								The binary `.spacy` format is a serialized [`DocBin`](/api/docbin) containing
-												Update argument handling and documentation

											
										
										
											2020-12-08 12:41:18 +03:00
+								one or more [`Doc`](/api/doc) objects. It's extremely **efficient in storage**,
 								especially when packing multiple documents together. You can also create `Doc`
 								objects manually, so you can write your own custom logic to convert and store
 								existing annotations for use in spaCy.
-												Update augmenter lookups and docs

											
										
										
											2020-10-01 00:03:47 +03:00
 								```python
 								### Training data from Doc objects {highlight="6-9"}
 								import spacy
 								from spacy.tokens import Doc, DocBin
 								nlp = spacy.blank("en")
-												Fix DocBin init in training example (#6396)


											
										
										
											2020-11-17 16:36:44 +03:00
+								docbin = DocBin()
-												Update augmenter lookups and docs

											
										
										
											2020-10-01 00:03:47 +03:00
+								words = ["Apple", "is", "looking", "at", "buying", "U.K.", "startup", "."]
 								spaces = [True, True, True, True, True, True, True, False]
-												Update docs [ci skip]

											
										
										
											2020-10-01 18:38:17 +03:00
+								ents = ["B-ORG", "O", "O", "O", "O", "B-GPE", "O", "O"]
-												Update augmenter lookups and docs

											
										
										
											2020-10-01 00:03:47 +03:00
+								doc = Doc(nlp.vocab, words=words, spaces=spaces, ents=ents)
 								docbin.add(doc)
 								docbin.to_disk("./train.spacy")
 								```
-												Update docs [ci skip]

											
										
										
											2020-09-30 16:16:00 +03:00
+								### Working with corpora {#data-corpora}
 								> #### Example
 								>
 								> ```ini
 								> [corpora]
 								>
 								> [corpora.train]
 								> @readers = "spacy.Corpus.v1"
 								> path = ${paths.train}
 								> gold_preproc = false
 								> max_length = 0
 								> limit = 0
 								> augmenter = null
 								>
 								> [training]
 								> train_corpus = "corpora.train"
 								> ```
 								The [`[corpora]`](/api/data-formats#config-corpora) block in your config lets
 								you define **data resources** to use for training, evaluation, pretraining or
 								any other custom workflows. `corpora.train` and `corpora.dev` are used as
 								conventions within spaCy's default configs, but you can also define any other
 								custom blocks. Each section in the corpora config should resolve to a
 								[`Corpus`](/api/corpus) – for example, using spaCy's built-in
 								[corpus reader](/api/top-level#readers) that takes a path to a binary `.spacy`
 								file. The `train_corpus` and `dev_corpus` fields in the
 								[`[training]`](/api/data-formats#config-training) block specify where to find
 								the corpus in your config. This makes it easy to **swap out** different corpora
 								by only changing a single config setting.
 								Instead of making `[corpora]` a block with multiple subsections for each portion
 								of the data, you can also use a single function that returns a dictionary of
 								corpora, keyed by corpus name, e.g. `"train"` and `"dev"`. This can be
 								especially useful if you need to split a single file into corpora for training
 								and evaluation, without loading the same file twice.
 								### Custom data reading and batching {#custom-code-readers-batchers}
-												Update WIP

											
										
										
											2020-07-06 23:22:37 +03:00
-												Update docs [ci skip]

											
										
										
											2020-08-19 13:14:41 +03:00
+								Some use-cases require **streaming in data** or manipulating datasets on the
 								fly, rather than generating all data beforehand and storing it to file. Instead
 								of using the built-in [`Corpus`](/api/corpus) reader, which uses static file
 								paths, you can create and register a custom function that generates
-												example of custom reader and batcher

											
										
										
											2020-08-18 20:15:16 +03:00
+								[`Example`](/api/example) objects. The resulting generator can be infinite. When
-												clean up example

											
										
										
											2020-08-18 20:35:23 +03:00
+								using this dataset for training, stopping criteria such as maximum number of
 								steps, or stopping when the loss does not decrease further, can be used.
-												example of custom reader and batcher

											
										
										
											2020-08-18 20:15:16 +03:00
-												Update docs [ci skip]

											
										
										
											2020-08-19 13:14:41 +03:00
+								In this example we assume a custom function `read_custom_data` which loads or
 								generates texts with relevant text classification annotations. Then, small
 								lexical variations of the input text are created before generating the final
 								[`Example`](/api/example) objects. The `@spacy.registry.readers` decorator lets
-												Add initialize.before_init and after_init callbacks

Add `initialize.before_init` and `initialize.after_init` callbacks to
the config. The `initialize.before_init` callback is a place to
implement one-time tokenizer customizations that are then saved with the
model.

											
										
										
											2021-01-12 13:29:31 +03:00
+								you register the function creating the custom reader in the `readers`
-												Update docs [ci skip]

											
										
										
											2020-08-19 13:14:41 +03:00
+								[registry](/api/top-level#registry) and assign it a string name, so it can be
 								used in your config. All arguments on the registered function become available
 								as **config settings** – in this case, `source`.
-												clean up example

											
										
										
											2020-08-18 20:35:23 +03:00
-												Update docs [ci skip]

											
										
										
											2020-08-19 13:14:41 +03:00
+								> #### config.cfg
 								>
-												clean up example

											
										
										
											2020-08-18 20:35:23 +03:00
+								> ```ini
-												generalize corpora, dot notation for dev and train corpus

											
										
										
											2020-09-17 12:38:59 +03:00
+								> [corpora.train]
-												clean up example

											
										
										
											2020-08-18 20:35:23 +03:00
+								> @readers = "corpus_variants.v1"
-												Update docs [ci skip]

											
										
										
											2020-08-19 13:14:41 +03:00
+								> source = "s3://your_bucket/path/data.csv"
-												clean up example

											
										
										
											2020-08-18 20:35:23 +03:00
+								> ```
-												example of custom reader and batcher

											
										
										
											2020-08-18 20:15:16 +03:00
 								```python
-												Update docs [ci skip]

											
										
										
											2020-08-19 13:14:41 +03:00
+								### functions.py {highlight="7-8"}
 								from typing import Callable, Iterator, List
-												example of custom reader and batcher

											
										
										
											2020-08-18 20:15:16 +03:00
+								import spacy
-												Renaming gold & annotation_setter (#6042)

* version bump to 3.0.0a16

* rename "gold" folder to "training"

* rename 'annotation_setter' to 'set_extra_annotations'

* formatting
											
										
										
											2020-09-09 11:31:03 +03:00
+								from spacy.training import Example
-												Update docs [ci skip]

											
										
										
											2020-08-19 13:14:41 +03:00
+								from spacy.language import Language
-												example of custom reader and batcher

											
										
										
											2020-08-18 20:15:16 +03:00
+								import random
 								@spacy.registry.readers("corpus_variants.v1")
-												Update docs [ci skip]

											
										
										
											2020-08-19 13:14:41 +03:00
+								def stream_data(source: str) -> Callable[[Language], Iterator[Example]]:
-												example of custom reader and batcher

											
										
										
											2020-08-18 20:15:16 +03:00
+								    def generate_stream(nlp):
-												Update docs [ci skip]

											
										
										
											2020-08-19 13:14:41 +03:00
+								        for text, cats in read_custom_data(source):
 								            # Create a random variant of the example text
 								            i = random.randint(0, len(text) - 1)
 								            variant = text[:i] + text[i].upper() + text[i + 1:]
-												clean up example

											
										
										
											2020-08-18 20:35:23 +03:00
+								            doc = nlp.make_doc(variant)
-												example of custom reader and batcher

											
										
										
											2020-08-18 20:15:16 +03:00
+								            example = Example.from_dict(doc, {"cats": cats})
 								            yield example
-												Update docs [ci skip]

											
										
										
											2020-08-19 13:14:41 +03:00
-												example of custom reader and batcher

											
										
										
											2020-08-18 20:15:16 +03:00
+								    return generate_stream
-												Update docs [ci skip]

											
										
										
											2020-08-19 13:14:41 +03:00
+								```
 								<Infobox variant="warning">
 								Remember that a registered function should always be a function that spaCy
 								**calls to create something**. In this case, it **creates the reader function**
 								– it's not the reader itself.
 								</Infobox>
 								We can also customize the **batching strategy** by registering a new batcher
 								function in the `batchers` [registry](/api/top-level#registry). A batcher turns
 								a stream of items into a stream of batches. spaCy has several useful built-in
 								[batching strategies](/api/top-level#batchers) with customizable sizes, but it's
 								also easy to implement your own. For instance, the following function takes the
 								stream of generated [`Example`](/api/example) objects, and removes those which
-												several small updates

											
										
										
											2020-08-21 19:25:26 +03:00
+								have the same underlying raw text, to avoid duplicates within each batch. Note
 								that in a more realistic implementation, you'd also want to check whether the
 								annotations are the same.
-												Update docs [ci skip]

											
										
										
											2020-08-19 13:14:41 +03:00
 								> #### config.cfg
 								>
 								> ```ini
 								> [training.batcher]
 								> @batchers = "filtering_batch.v1"
 								> size = 150
 								> ```
-												example of custom reader and batcher

											
										
										
											2020-08-18 20:15:16 +03:00
-												Update docs [ci skip]

											
										
										
											2020-08-19 13:14:41 +03:00
+								```python
 								### functions.py
-												badgers intro

											
										
										
											2020-08-19 18:53:22 +03:00
+								from typing import Callable, Iterable, Iterator, List
-												Update docs [ci skip]

											
										
										
											2020-08-19 13:14:41 +03:00
+								import spacy
-												Renaming gold & annotation_setter (#6042)

* version bump to 3.0.0a16

* rename "gold" folder to "training"

* rename 'annotation_setter' to 'set_extra_annotations'

* formatting
											
										
										
											2020-09-09 11:31:03 +03:00
+								from spacy.training import Example
-												example of custom reader and batcher

											
										
										
											2020-08-18 20:15:16 +03:00
 								@spacy.registry.batchers("filtering_batch.v1")
-												Update docs [ci skip]

											
										
										
											2020-08-19 13:14:41 +03:00
+								def filter_batch(size: int) -> Callable[[Iterable[Example]], Iterator[List[Example]]]:
 								    def create_filtered_batches(examples):
-												example of custom reader and batcher

											
										
										
											2020-08-18 20:15:16 +03:00
+								        batch = []
 								        for eg in examples:
-												Update docs [ci skip]

											
										
										
											2020-08-19 13:14:41 +03:00
+								            # Remove duplicate examples with the same text from batch
-												example of custom reader and batcher

											
										
										
											2020-08-18 20:15:16 +03:00
+								            if eg.text not in [x.text for x in batch]:
 								                batch.append(eg)
 								            if len(batch) == size:
 								                yield batch
 								                batch = []
-												Update docs [ci skip]

											
										
										
											2020-08-19 13:14:41 +03:00
-												example of custom reader and batcher

											
										
										
											2020-08-18 20:15:16 +03:00
+								    return create_filtered_batches
 								```
-												Update WIP

											
										
										
											2020-07-06 23:22:37 +03:00
-												Update docs [ci skip]

											
										
										
											2020-09-30 16:16:00 +03:00
+								<!-- TODO:
 								* Custom corpus class
 								* Minibatching
 								-->
-												Update docs [ci skip]

											
										
										
											2020-07-27 01:29:45 +03:00
-												Update docs [ci skip]

											
										
										
											2020-10-02 12:38:03 +03:00
+								### Data augmentation {#data-augmentation}
-												custom-architectures section

											
										
										
											2020-09-02 12:14:06 +03:00
-												Update docs [ci skip]

											
										
										
											2020-10-02 12:38:03 +03:00
+								Data augmentation is the process of applying small **modifications** to the
 								training data. It can be especially useful for punctuation and case replacement
 								– for example, if your corpus only uses smart quotes and you want to include
 								variations using regular quotes, or to make the model less sensitive to
 								capitalization by including a mix of capitalized and lowercase examples.
 								The easiest way to use data augmentation during training is to provide an
 								`augmenter` to the training corpus, e.g. in the `[corpora.train]` section of
 								your config. The built-in [`orth_variants`](/api/top-level#orth_variants)
 								augmenter creates a data augmentation callback that uses orth-variant
 								replacement.
 								```ini
 								### config.cfg (excerpt) {highlight="8,14"}
 								[corpora.train]
 								@readers = "spacy.Corpus.v1"
 								path = ${paths.train}
 								gold_preproc = false
 								max_length = 0
 								limit = 0
 								[corpora.train.augmenter]
 								@augmenters = "spacy.orth_variants.v1"
 								# Percentage of texts that will be augmented / lowercased
 								level = 0.1
 								lower = 0.5
 								[corpora.train.augmenter.orth_variants]
 								@readers = "srsly.read_json.v1"
 								path = "corpus/orth_variants.json"
 								```
 								The `orth_variants` argument lets you pass in a dictionary of replacement rules,
 								typically loaded from a JSON file. There are two types of orth variant rules:
 								`"single"` for single tokens that should be replaced (e.g. hyphens) and
 								`"paired"` for pairs of tokens (e.g. quotes).
 								<!-- prettier-ignore -->
 								```json
 								### orth_variants.json
 								{
 								  "single": [{ "tags": ["NFP"], "variants": ["…", "..."] }],
 								  "paired": [{ "tags": ["``", "''"], "variants": [["'", "'"], ["‘", "’"]] }]
 								}
 								```
 								<Accordion title="Full examples for English and German" spaced>
 								```json
 								https://github.com/explosion/spacy-lookups-data/blob/master/spacy_lookups_data/data/en_orth_variants.json
 								```
 								```json
 								https://github.com/explosion/spacy-lookups-data/blob/master/spacy_lookups_data/data/de_orth_variants.json
 								```
 								</Accordion>
 								<Infobox title="Important note" variant="warning">
 								When adding data augmentation, keep in mind that it typically only makes sense
 								to apply it to the **training corpus**, not the development data.
 								</Infobox>
 								#### Writing custom data augmenters {#data-augmentation-custom}
 								Using the [`@spacy.augmenters`](/api/top-level#registry) registry, you can also
 								register your own data augmentation callbacks. The callback should be a function
 								that takes the current `nlp` object and a training [`Example`](/api/example) and
 								yields `Example` objects. Keep in mind that the augmenter should yield **all
 								examples** you want to use in your corpus, not only the augmented examples
 								(unless you want to augment all examples).
 								Here'a an example of a custom augmentation callback that produces text variants
 								in ["SpOnGeBoB cAsE"](https://knowyourmeme.com/memes/mocking-spongebob). The
 								registered function takes one argument `randomize` that can be set via the
 								config and decides whether the uppercase/lowercase transformation is applied
 								randomly or not. The augmenter yields two `Example` objects: the original
 								example and the augmented example.
 								> #### config.cfg
-												Update docs [ci skip]

											
										
										
											2020-09-03 11:07:45 +03:00
+								>
-												Update docs [ci skip]

											
										
										
											2020-10-02 12:38:03 +03:00
+								> ```ini
 								> [corpora.train.augmenter]
 								> @augmenters = "spongebob_augmenter.v1"
 								> randomize = false
-												Update docs [ci skip]

											
										
										
											2020-09-03 11:07:45 +03:00
+								> ```
-												custom-architectures section

											
										
										
											2020-09-02 12:14:06 +03:00
 								```python
 								import spacy
-												Update docs [ci skip]

											
										
										
											2020-10-02 12:38:03 +03:00
+								import random
-												custom-architectures section

											
										
										
											2020-09-02 12:14:06 +03:00
-												Update docs [ci skip]

											
										
										
											2020-10-02 12:38:03 +03:00
+								@spacy.registry.augmenters("spongebob_augmenter.v1")
 								def create_augmenter(randomize: bool = False):
 								    def augment(nlp, example):
 								        text = example.text
 								        if randomize:
 								            # Randomly uppercase/lowercase characters
 								            chars = [c.lower() if random.random() < 0.5 else c.upper() for c in text]
 								        else:
 								            # Uppercase followed by lowercase
 								            chars = [c.lower() if i % 2 else c.upper() for i, c in enumerate(text)]
 								        # Create augmented training example
 								        example_dict = example.to_dict()
 								        doc = nlp.make_doc("".join(chars))
 								        example_dict["token_annotation"]["ORTH"] = [t.text for t in doc]
 								        # Original example followed by augmented example
 								        yield example
 								        yield example.from_dict(doc, example_dict)
 								    return augment
-												custom-architectures section

											
										
										
											2020-09-02 12:14:06 +03:00
+								```
-												Update docs [ci skip]

											
										
										
											2020-10-02 12:38:03 +03:00
+								An easy way to create modified `Example` objects is to use the
 								[`Example.from_dict`](/api/example#from_dict) method with a new reference
 								[`Doc`](/api/doc) created from the modified text. In this case, only the
 								capitalization changes, so only the `ORTH` values of the tokens will be
 								different between the original and augmented examples.
-												Update docs [ci skip]

											
										
										
											2020-10-02 14:24:33 +03:00
+								Note that if your data augmentation strategy involves changing the tokenization
 								(for instance, removing or adding tokens) and your training examples include
 								token-based annotations like the dependency parse or entity labels, you'll need
 								to take care to adjust the `Example` object so its annotations match and remain
 								valid.
-												Update docs [ci skip]

											
										
										
											2020-10-02 12:38:03 +03:00
-												Update docs [ci skip]

											
										
										
											2020-09-13 23:30:33 +03:00
+								## Parallel & distributed training with Ray {#parallel-training}
 								> #### Installation
 								>
 								> ```cli
-												Update docs and install extras [ci skip]

											
										
										
											2020-10-08 11:58:50 +03:00
+								> $ pip install -U %%SPACY_PKG_NAME[ray]%%SPACY_PKG_FLAGS
-												Update docs [ci skip]

											
										
										
											2020-09-13 23:30:33 +03:00
+								> # Check that the CLI is registered
 								> $ python -m spacy ray --help
 								> ```
 								[Ray](https://ray.io/) is a fast and simple framework for building and running
 								**distributed applications**. You can use Ray to train spaCy on one or more
 								remote machines, potentially speeding up your training process. Parallel
 								training won't always be faster though – it depends on your batch size, models,
 								and hardware.
 								<Infobox variant="warning">
 								To use Ray with spaCy, you need the
 								[`spacy-ray`](https://github.com/explosion/spacy-ray) package installed.
 								Installing the package will automatically add the `ray` command to the spaCy
 								CLI.
 								</Infobox>
 								The [`spacy ray train`](/api/cli#ray-train) command follows the same API as
 								[`spacy train`](/api/cli#train), with a few extra options to configure the Ray
 								setup. You can optionally set the `--address` option to point to your Ray
 								cluster. If it's not set, Ray will run locally.
 								```cli
 								python -m spacy ray train config.cfg --n-workers 2
 								```
-												Update docs [ci skip]

											
										
										
											2020-09-21 11:55:36 +03:00
+								<Project id="integrations/ray">
-												Update docs [ci skip]

											
										
										
											2020-09-13 23:30:33 +03:00
-												Update docs [ci skip]

											
										
										
											2020-09-21 11:55:36 +03:00
+								Get started with parallel training using our project template. It trains a
 								simple model on a Universal Dependencies Treebank and lets you parallelize the
 								training with Ray.
-												Update docs [ci skip]

											
										
										
											2020-09-13 23:30:33 +03:00
-												Update docs [ci skip]

											
										
										
											2020-09-21 11:55:36 +03:00
+								</Project>
-												Update docs [ci skip]

											
										
										
											2020-09-13 23:30:33 +03:00
 								### How parallel training works {#parallel-training-details}
 								Each worker receives a shard of the **data** and builds a copy of the **model
 								and optimizer** from the [`config.cfg`](#config). It also has a communication
 								channel to **pass gradients and parameters** to the other workers. Additionally,
 								each worker is given ownership of a subset of the parameter arrays. Every
 								parameter array is owned by exactly one worker, and the workers are given a
 								mapping so they know which worker owns which parameter.
 								![Illustration of setup](../images/spacy-ray.svg)
 								As training proceeds, every worker will be computing gradients for **all** of
 								the model parameters. When they compute gradients for parameters they don't own,
 								they'll **send them to the worker** that does own that parameter, along with a
-												Update argument handling and documentation

											
										
										
											2020-12-08 12:41:18 +03:00
+								version identifier so that the owner can decide whether to discard the gradient.
 								Workers use the gradients they receive and the ones they compute locally to
 								update the parameters they own, and then broadcast the updated array and a new
 								version ID to the other workers.
-												Update docs [ci skip]

											
										
										
											2020-09-13 23:30:33 +03:00
 								This training procedure is **asynchronous** and **non-blocking**. Workers always
 								push their gradient increments and parameter updates, they do not have to pull
 								them and block on the result, so the transfers can happen in the background,
 								overlapped with the actual training work. The workers also do not have to stop
 								and wait for each other ("synchronize") at the start of each batch. This is very
 								useful for spaCy, because spaCy is often trained on long documents, which means
 								**batches can vary in size** significantly. Uneven workloads make synchronous
 								gradient descent inefficient, because if one batch is slow, all of the other
 								workers are stuck waiting for it to complete before they can continue.
-												Update v3 docs

											
										
										
											2020-07-03 17:48:21 +03:00
+								## Internal training API {#api}
-												Update example and training docs

											
										
										
											2020-07-07 21:30:12 +03:00
+								<Infobox variant="warning">
 								spaCy gives you full control over the training loop. However, for most use
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								cases, it's recommended to train your pipelines via the
-												Update example and training docs

											
										
										
											2020-07-07 21:30:12 +03:00
+								[`spacy train`](/api/cli#train) command with a [`config.cfg`](#config) to keep
 								track of your settings and hyperparameters, instead of writing your own training
-												Update docs [ci skip]

											
										
										
											2020-08-19 17:04:21 +03:00
+								scripts from scratch. [Custom registered functions](#custom-code) should
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								typically give you everything you need to train fully custom pipelines with
-												Update training.md

											
										
										
											2020-07-10 23:34:27 +03:00
+								[`spacy train`](/api/cli#train).
-												Update example and training docs

											
										
										
											2020-07-07 21:30:12 +03:00
 								</Infobox>
-												fix component constructors, update, begin_training, reference to GoldParse

											
										
										
											2020-07-07 20:17:19 +03:00
+								The [`Example`](/api/example) object contains annotated training data, also
 								called the **gold standard**. It's initialized with a [`Doc`](/api/doc) object
 								that will hold the predictions, and another `Doc` object that holds the
-												Update docs [ci skip]

											
										
										
											2020-08-11 21:57:23 +03:00
+								gold-standard annotations. It also includes the **alignment** between those two
 								documents if they differ in tokenization. The `Example` class ensures that spaCy
-												several small updates

											
										
										
											2020-08-21 19:25:26 +03:00
+								can rely on one **standardized format** that's passed through the pipeline. For
 								instance, let's say we want to define gold-standard part-of-speech tags:
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
-												fix component constructors, update, begin_training, reference to GoldParse

											
										
										
											2020-07-07 20:17:19 +03:00
+								```python
 								words = ["I", "like", "stuff"]
 								predicted = Doc(vocab, words=words)
 								# create the reference Doc with gold-standard TAG annotations
 								tags = ["NOUN", "VERB", "NOUN"]
 								tag_ids = [vocab.strings.add(tag) for tag in tags]
 								reference = Doc(vocab, words=words).from_array("TAG", numpy.array(tag_ids, dtype="uint64"))
 								example = Example(predicted, reference)
 								```
-												several small updates

											
										
										
											2020-08-21 19:25:26 +03:00
+								As this is quite verbose, there's an alternative way to create the reference
 								`Doc` with the gold-standard annotations. The function `Example.from_dict` takes
 								a dictionary with keyword arguments specifying the annotations, like `tags` or
 								`entities`. Using the resulting `Example` object and its gold-standard
-												Update example and training docs

											
										
										
											2020-07-07 21:30:12 +03:00
+								annotations, the model can be updated to learn a sentence of three words with
 								their assigned part-of-speech tags.
-												fix component constructors, update, begin_training, reference to GoldParse

											
										
										
											2020-07-07 20:17:19 +03:00
+								```python
 								words = ["I", "like", "stuff"]
 								tags = ["NOUN", "VERB", "NOUN"]
-												Update example and training docs

											
										
										
											2020-07-07 21:30:12 +03:00
+								predicted = Doc(nlp.vocab, words=words)
-												fix component constructors, update, begin_training, reference to GoldParse

											
										
										
											2020-07-07 20:17:19 +03:00
+								example = Example.from_dict(predicted, {"tags": tags})
 								```
-												Update example and training docs

											
										
										
											2020-07-07 21:30:12 +03:00
+								Here's another example that shows how to define gold-standard named entities.
 								The letters added before the labels refer to the tags of the
 								[BILUO scheme](/usage/linguistic-features#updating-biluo) – `O` is a token
-												Update docs [ci skip]

											
										
										
											2020-08-22 14:52:52 +03:00
+								outside an entity, `U` a single entity unit, `B` the beginning of an entity, `I`
 								a token inside an entity and `L` the last token of an entity.
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
 								```python
-												Update example and training docs

											
										
										
											2020-07-07 21:30:12 +03:00
+								doc = Doc(nlp.vocab, words=["Facebook", "released", "React", "in", "2014"])
 								example = Example.from_dict(doc, {"entities": ["U-ORG", "O", "U-TECHNOLOGY", "O", "U-DATE"]})
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								```
-												Update example and training docs

											
										
										
											2020-07-07 21:30:12 +03:00
+								<Infobox title="Migrating from v2.x" variant="warning">
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
-												Update example and training docs

											
										
										
											2020-07-07 21:30:12 +03:00
+								As of v3.0, the [`Example`](/api/example) object replaces the `GoldParse` class.
-												Three small typos

Some little typos since v3.0 is out.

											
										
										
											2020-10-15 19:06:37 +03:00
+								It can be constructed in a very similar way – from a `Doc` and a dictionary of
-												Update docs [ci skip]

											
										
										
											2020-08-11 21:57:23 +03:00
+								annotations. For more details, see the
 								[migration guide](/usage/v3#migrating-training).
-												Update example and training docs

											
										
										
											2020-07-07 21:30:12 +03:00
 								```diff
 								- gold = GoldParse(doc, entities=entities)
 								+ example = Example.from_dict(doc, {"entities": entities})
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								```
-												Update example and training docs

											
										
										
											2020-07-07 21:30:12 +03:00
+								</Infobox>
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
 								Of course, it's not enough to only show a model a single example once.
 								Especially if you only have few examples, you'll want to train for a **number of
 								iterations**. At each iteration, the training data is **shuffled** to ensure the
 								model doesn't make any generalizations based on the order of examples. Another
 								technique to improve the learning results is to set a **dropout rate**, a rate
 								at which to randomly "drop" individual features and representations. This makes
 								it harder for the model to memorize the training data. For example, a `0.25`
 								dropout means that each feature or internal representation has a 1/4 likelihood
 								of being dropped.
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								> - [`nlp`](/api/language): The `nlp` object with the pipeline components and
 								>   their models.
-												Update docs

											
										
										
											2020-10-03 17:08:24 +03:00
+								> - [`nlp.initialize`](/api/language#initialize): Initialize the pipeline and
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								>   return an optimizer to update the component model weights.
-												API docs, docstrings and argument consistency

											
										
										
											2020-07-27 19:11:45 +03:00
+								> - [`Optimizer`](https://thinc.ai/docs/api-optimizers): Function that holds
 								>   state between updates.
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								> - [`nlp.update`](/api/language#update): Update component models with examples.
-												API docs, docstrings and argument consistency

											
										
										
											2020-07-27 19:11:45 +03:00
+								> - [`Example`](/api/example): object holding predictions and gold-standard
 								>   annotations.
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								> - [`nlp.to_disk`](/api/language#to_disk): Save the updated pipeline to a
-												API docs, docstrings and argument consistency

											
										
										
											2020-07-27 19:11:45 +03:00
+								>   directory.
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
 								```python
 								### Example training loop
-												begin_training -> initialize

											
										
										
											2020-09-28 22:35:09 +03:00
+								optimizer = nlp.initialize()
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								for itn in range(100):
 								    random.shuffle(train_data)
 								    for raw_text, entity_offsets in train_data:
 								        doc = nlp.make_doc(raw_text)
-												fix component constructors, update, begin_training, reference to GoldParse

											
										
										
											2020-07-07 20:17:19 +03:00
+								        example = Example.from_dict(doc, {"entities": entity_offsets})
 								        nlp.update([example], sgd=optimizer)
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								nlp.to_disk("/output")
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								```
 								The [`nlp.update`](/api/language#update) method takes the following arguments:
-												fix component constructors, update, begin_training, reference to GoldParse

											
										
										
											2020-07-07 20:17:19 +03:00
+								| Name       | Description                                                                                                                                                            |
 								| ---------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
 								| `examples` | [`Example`](/api/example) objects. The `update` method takes a sequence of them, so you can batch up your training examples.                                           |
 								| `drop`     | Dropout rate. Makes it harder for the model to just memorize the data.                                                                                                 |
-												Three small typos

Some little typos since v3.0 is out.

											
										
										
											2020-10-15 19:06:37 +03:00
+								| `sgd`      | An [`Optimizer`](https://thinc.ai/docs/api-optimizers) object, which updates the model's weights. If not set, spaCy will create a new one and save it for further use. |
-												fix component constructors, update, begin_training, reference to GoldParse

											
										
										
											2020-07-07 20:17:19 +03:00
-												Update example and training docs

											
										
										
											2020-07-07 21:30:12 +03:00
+								<Infobox title="Migrating from v2.x" variant="warning">
 								As of v3.0, the [`Example`](/api/example) object replaces the `GoldParse` class
 								and the "simple training style" of calling `nlp.update` with a text and a
 								dictionary of annotations. Updating your code to use the `Example` object should
 								be very straightforward: you can call
 								[`Example.from_dict`](/api/example#from_dict) with a [`Doc`](/api/doc) and the
 								dictionary of annotations:
 								```diff
 								text = "Facebook released React in 2014"
 								annotations = {"entities": ["U-ORG", "O", "U-TECHNOLOGY", "O", "U-DATE"]}
-												several small updates

											
										
										
											2020-08-21 19:25:26 +03:00
+								+ example = Example.from_dict(nlp.make_doc(text), annotations)
-												Update example and training docs

											
										
										
											2020-07-07 21:30:12 +03:00
+								- nlp.update([text], [annotations])
 								+ nlp.update([example])
 								```
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
-												Update example and training docs

											
										
										
											2020-07-07 21:30:12 +03:00
+								</Infobox>