mirror of
https://github.com/explosion/spaCy.git
synced 2024-11-16 14:47:16 +03:00
554df9ef20
* Rename all MDX file to `.mdx`
* Lock current node version (#11885)
* Apply Prettier (#11996)
* Minor website fixes (#11974) [ci skip]
* fix table
* Migrate to Next WEB-17 (#12005)
* Initial commit
* Run `npx create-next-app@13 next-blog`
* Install MDX packages
Following: 77b5f79a4d/packages/next-mdx/readme.md
* Add MDX to Next
* Allow Next to handle `.md` and `.mdx` files.
* Add VSCode extension recommendation
* Disabled TypeScript strict mode for now
* Add prettier
* Apply Prettier to all files
* Make sure to use correct Node version
* Add basic implementation for `MDXRemote`
* Add experimental Rust MDX parser
* Add `/public`
* Add SASS support
* Remove default pages and styling
* Convert to module
This allows to use `import/export` syntax
* Add import for custom components
* Add ability to load plugins
* Extract function
This will make the next commit easier to read
* Allow to handle directories for page creation
* Refactoring
* Allow to parse subfolders for pages
* Extract logic
* Redirect `index.mdx` to parent directory
* Disabled ESLint during builds
* Disabled typescript during build
* Remove Gatsby from `README.md`
* Rephrase Docker part of `README.md`
* Update project structure in `README.md`
* Move and rename plugins
* Update plugin for wrapping sections
* Add dependencies for plugin
* Use plugin
* Rename wrapper type
* Simplify unnessary adding of id to sections
The slugified section ids are useless, because they can not be referenced anywhere anyway. The navigation only works if the section has the same id as the heading.
* Add plugin for custom attributes on Markdown elements
* Add plugin to readd support for tables
* Add plugin to fix problem with wrapped images
For more details see this issue: https://github.com/mdx-js/mdx/issues/1798
* Add necessary meta data to pages
* Install necessary dependencies
* Remove outdated MDX handling
* Remove reliance on `InlineList`
* Use existing Remark components
* Remove unallowed heading
Before `h1` components where not overwritten and would never have worked and they aren't used anywhere either.
* Add missing components to MDX
* Add correct styling
* Fix broken list
* Fix broken CSS classes
* Implement layout
* Fix links
* Fix broken images
* Fix pattern image
* Fix heading attributes
* Rename heading attribute
`new` was causing some weird issue, so renaming it to `version`
* Update comment syntax in MDX
* Merge imports
* Fix markdown rendering inside components
* Add model pages
* Simplify anchors
* Fix default value for theme
* Add Universe index page
* Add Universe categories
* Add Universe projects
* Fix Next problem with copy
Next complains when the server renders something different then the client, therfor we move the differing logic to `useEffect`
* Fix improper component nesting
Next doesn't allow block elements inside a `<p>`
* Replace landing page MDX with page component
* Remove inlined iframe content
* Remove ability to inline HTML content in iFrames
* Remove MDX imports
* Fix problem with image inside link in MDX
* Escape character for MDX
* Fix unescaped characters in MDX
* Fix headings with logo
* Allow to export static HTML pages
* Add prebuild script
This command is automatically run by Next
* Replace `svg-loader` with `react-inlinesvg`
`svg-loader` is no longer maintained
* Fix ESLint `react-hooks/exhaustive-deps`
* Fix dropdowns
* Change code language from `cli` to `bash`
* Remove unnessary language `none`
* Fix invalid code language
`markdown_` with an underscore was used to basically turn of syntax highlighting, but using unknown languages know throws an error.
* Enable code blocks plugin
* Readd `InlineCode` component
MDX2 removed the `inlineCode` component
> The special component name `inlineCode` was removed, we recommend to use `pre` for the block version of code, and code for both the block and inline versions
Source: https://mdxjs.com/migrating/v2/#update-mdx-content
* Remove unused code
* Extract function to own file
* Fix code syntax highlighting
* Update syntax for code block meta data
* Remove unused prop
* Fix internal link recognition
There is a problem with regex between Node and browser, and since Next runs the component on both, this create an error.
`Prop `rel` did not match. Server: "null" Client: "noopener nofollow noreferrer"`
This simplifies the implementation and fixes the above error.
* Replace `react-helmet` with `next/head`
* Fix `className` problem for JSX component
* Fix broken bold markdown
* Convert file to `.mjs` to be used by Node process
* Add plugin to replace strings
* Fix custom table row styling
* Fix problem with `span` inside inline `code`
React doesn't allow a `span` inside an inline `code` element and throws an error in dev mode.
* Add `_document` to be able to customize `<html>` and `<body>`
* Add `lang="en"`
* Store Netlify settings in file
This way we don't need to update via Netlify UI, which can be tricky if changing build settings.
* Add sitemap
* Add Smartypants
* Add PWA support
* Add `manifest.webmanifest`
* Fix bug with anchor links after reloading
There was no need for the previous implementation, since the browser handles this nativly. Additional the manual scrolling into view was actually broken, because the heading would disappear behind the menu bar.
* Rename custom event
I was googeling for ages to find out what kind of event `inview` is, only to figure out it was a custom event with a name that sounds pretty much like a native one. 🫠
* Fix missing comment syntax highlighting
* Refactor Quickstart component
The previous implementation was hidding the irrelevant lines via data-props and dynamically generated CSS. This created problems with Next and was also hard to follow. CSS was used to do what React is supposed to handle.
The new implementation simplfy filters the list of children (React elements) via their props.
* Fix syntax highlighting for Training Quickstart
* Unify code rendering
* Improve error logging in Juniper
* Fix Juniper component
* Automatically generate "Read Next" link
* Add Plausible
* Use recent DocSearch component and adjust styling
* Fix images
* Turn of image optimization
> Image Optimization using Next.js' default loader is not compatible with `next export`.
We currently deploy to Netlify via `next export`
* Dont build pages starting with `_`
* Remove unused files
* Add Next plugin to Netlify
* Fix button layout
MDX automatically adds `p` tags around text on a new line and Prettier wants to put the text on a new line. Hacking with JSX string.
* Add 404 page
* Apply Prettier
* Update Prettier for `package.json`
Next sometimes wants to patch `package-lock.json`. The old Prettier setting indended with 4 spaces, but Next always indends with 2 spaces. Since `npm install` automatically uses the indendation from `package.json` for `package-lock.json` and to avoid the format switching back and forth, both files are now set to 2 spaces.
* Apply Next patch to `package-lock.json`
When starting the dev server Next would warn `warn - Found lockfile missing swc dependencies, patching...` and update the `package-lock.json`. These are the patched changes.
* fix link
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* small backslash fixes
* adjust to new style
Co-authored-by: Marcus Blättermann <marcus@essenmitsosse.de>
1786 lines
77 KiB
Plaintext
1786 lines
77 KiB
Plaintext
---
|
||
title: Training Pipelines & Models
|
||
teaser: Train and update components on your own data and integrate custom models
|
||
next: /usage/layers-architectures
|
||
menu:
|
||
- ['Introduction', 'basics']
|
||
- ['Quickstart', 'quickstart']
|
||
- ['Config System', 'config']
|
||
- ['Training Data', 'training-data']
|
||
- ['Custom Training', 'config-custom']
|
||
- ['Custom Functions', 'custom-functions']
|
||
- ['Initialization', 'initialization']
|
||
- ['Data Utilities', 'data']
|
||
- ['Parallel Training', 'parallel-training']
|
||
- ['Internal API', 'api']
|
||
---
|
||
|
||
## Introduction to training {id="basics",hidden="true"}
|
||
|
||
<Training101 />
|
||
|
||
<Infobox title="Tip: Try the Prodigy annotation tool">
|
||
|
||
<Image
|
||
src="/images/prodigy.jpg"
|
||
href="https://prodi.gy"
|
||
alt="Prodigy: Radically efficient machine teaching"
|
||
/>
|
||
|
||
If you need to label a lot of data, check out [Prodigy](https://prodi.gy), a
|
||
new, active learning-powered annotation tool we've developed. Prodigy is fast
|
||
and extensible, and comes with a modern **web application** that helps you
|
||
collect training data faster. It integrates seamlessly with spaCy, pre-selects
|
||
the **most relevant examples** for annotation, and lets you train and evaluate
|
||
ready-to-use spaCy pipelines.
|
||
|
||
</Infobox>
|
||
|
||
## Quickstart {id="quickstart",tag="new"}
|
||
|
||
The recommended way to train your spaCy pipelines is via the
|
||
[`spacy train`](/api/cli#train) command on the command line. It only needs a
|
||
single [`config.cfg`](#config) **configuration file** that includes all settings
|
||
and hyperparameters. You can optionally [overwrite](#config-overrides) settings
|
||
on the command line, and load in a Python file to register
|
||
[custom functions](#custom-code) and architectures. This quickstart widget helps
|
||
you generate a starter config with the **recommended settings** for your
|
||
specific use case. It's also available in spaCy as the
|
||
[`init config`](/api/cli#init-config) command.
|
||
|
||
<Infobox variant="warning">
|
||
|
||
Upgrade to the [latest version of spaCy](/usage) to use the quickstart widget.
|
||
For earlier releases, follow the CLI instructions to generate a compatible
|
||
config.
|
||
|
||
</Infobox>
|
||
|
||
> #### Instructions: widget
|
||
>
|
||
> 1. Select your requirements and settings.
|
||
> 2. Use the buttons at the bottom to save the result to your clipboard or a
|
||
> file `base_config.cfg`.
|
||
> 3. Run [`init fill-config`](/api/cli#init-fill-config) to create a full
|
||
> config.
|
||
> 4. Run [`train`](/api/cli#train) with your config and data.
|
||
>
|
||
> #### Instructions: CLI
|
||
>
|
||
> 1. Run the [`init config`](/api/cli#init-config) command and specify your
|
||
> requirements and settings as CLI arguments.
|
||
> 2. Run [`train`](/api/cli#train) with the exported config and data.
|
||
|
||
<QuickstartTraining />
|
||
|
||
After you've saved the starter config to a file `base_config.cfg`, you can use
|
||
the [`init fill-config`](/api/cli#init-fill-config) command to fill in the
|
||
remaining defaults. Training configs should always be **complete and without
|
||
hidden defaults**, to keep your experiments reproducible.
|
||
|
||
```bash
|
||
$ python -m spacy init fill-config base_config.cfg config.cfg
|
||
```
|
||
|
||
> #### Tip: Debug your data
|
||
>
|
||
> The [`debug data` command](/api/cli#debug-data) lets you analyze and validate
|
||
> your training and development data, get useful stats, and find problems like
|
||
> invalid entity annotations, cyclic dependencies, low data labels and more.
|
||
>
|
||
> ```bash
|
||
> $ python -m spacy debug data config.cfg
|
||
> ```
|
||
|
||
Instead of exporting your starter config from the quickstart widget and
|
||
auto-filling it, you can also use the [`init config`](/api/cli#init-config)
|
||
command and specify your requirement and settings as CLI arguments. You can now
|
||
add your data and run [`train`](/api/cli#train) with your config. See the
|
||
[`convert`](/api/cli#convert) command for details on how to convert your data to
|
||
spaCy's binary `.spacy` format. You can either include the data paths in the
|
||
`[paths]` section of your config, or pass them in via the command line.
|
||
|
||
```bash
|
||
$ python -m spacy train config.cfg --output ./output --paths.train ./train.spacy --paths.dev ./dev.spacy
|
||
```
|
||
|
||
> #### Tip: Enable your GPU
|
||
>
|
||
> Use the `--gpu-id` option to select the GPU:
|
||
>
|
||
> ```bash
|
||
> $ python -m spacy train config.cfg --gpu-id 0
|
||
> ```
|
||
|
||
<Accordion title="How are the config recommendations generated?" id="quickstart-source" spaced>
|
||
|
||
The recommended config settings generated by the quickstart widget and the
|
||
[`init config`](/api/cli#init-config) command are based on some general **best
|
||
practices** and things we've found to work well in our experiments. The goal is
|
||
to provide you with the most **useful defaults**.
|
||
|
||
Under the hood, the
|
||
[`quickstart_training.jinja`](%%GITHUB_SPACY/spacy/cli/templates/quickstart_training.jinja)
|
||
template defines the different combinations – for example, which parameters to
|
||
change if the pipeline should optimize for efficiency vs. accuracy. The file
|
||
[`quickstart_training_recommendations.yml`](%%GITHUB_SPACY/spacy/cli/templates/quickstart_training_recommendations.yml)
|
||
collects the recommended settings and available resources for each language
|
||
including the different transformer weights. For some languages, we include
|
||
different transformer recommendations, depending on whether you want the model
|
||
to be more efficient or more accurate. The recommendations will be **evolving**
|
||
as we run more experiments.
|
||
|
||
</Accordion>
|
||
|
||
<Project id="pipelines/tagger_parser_ud">
|
||
|
||
The easiest way to get started is to clone a [project template](/usage/projects)
|
||
and run it – for example, this end-to-end template that lets you train a
|
||
**part-of-speech tagger** and **dependency parser** on a Universal Dependencies
|
||
treebank.
|
||
|
||
</Project>
|
||
|
||
## Training config system {id="config"}
|
||
|
||
Training config files include all **settings and hyperparameters** for training
|
||
your pipeline. Instead of providing lots of arguments on the command line, you
|
||
only need to pass your `config.cfg` file to [`spacy train`](/api/cli#train).
|
||
Under the hood, the training config uses the
|
||
[configuration system](https://thinc.ai/docs/usage-config) provided by our
|
||
machine learning library [Thinc](https://thinc.ai). This also makes it easy to
|
||
integrate custom models and architectures, written in your framework of choice.
|
||
Some of the main advantages and features of spaCy's training config are:
|
||
|
||
- **Structured sections.** The config is grouped into sections, and nested
|
||
sections are defined using the `.` notation. For example, `[components.ner]`
|
||
defines the settings for the pipeline's named entity recognizer. The config
|
||
can be loaded as a Python dict.
|
||
- **References to registered functions.** Sections can refer to registered
|
||
functions like [model architectures](/api/architectures),
|
||
[optimizers](https://thinc.ai/docs/api-optimizers) or
|
||
[schedules](https://thinc.ai/docs/api-schedules) and define arguments that are
|
||
passed into them. You can also
|
||
[register your own functions](#custom-functions) to define custom
|
||
architectures or methods, reference them in your config and tweak their
|
||
parameters.
|
||
- **Interpolation.** If you have hyperparameters or other settings used by
|
||
multiple components, define them once and reference them as
|
||
[variables](#config-interpolation).
|
||
- **Reproducibility with no hidden defaults.** The config file is the "single
|
||
source of truth" and includes all settings.
|
||
- **Automated checks and validation.** When you load a config, spaCy checks if
|
||
the settings are complete and if all values have the correct types. This lets
|
||
you catch potential mistakes early. In your custom architectures, you can use
|
||
Python [type hints](https://docs.python.org/3/library/typing.html) to tell the
|
||
config which types of data to expect.
|
||
|
||
```ini
|
||
%%GITHUB_SPACY/spacy/default_config.cfg
|
||
```
|
||
|
||
Under the hood, the config is parsed into a dictionary. It's divided into
|
||
sections and subsections, indicated by the square brackets and dot notation. For
|
||
example, `[training]` is a section and `[training.batch_size]` a subsection.
|
||
Subsections can define values, just like a dictionary, or use the `@` syntax to
|
||
refer to [registered functions](#config-functions). This allows the config to
|
||
not just define static settings, but also construct objects like architectures,
|
||
schedules, optimizers or any other custom components. The main top-level
|
||
sections of a config file are:
|
||
|
||
| Section | Description |
|
||
| ------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||
| `nlp` | Definition of the `nlp` object, its tokenizer and [processing pipeline](/usage/processing-pipelines) component names. |
|
||
| `components` | Definitions of the [pipeline components](/usage/processing-pipelines) and their models. |
|
||
| `paths` | Paths to data and other assets. Re-used across the config as variables, e.g. `${paths.train}`, and can be [overwritten](#config-overrides) on the CLI. |
|
||
| `system` | Settings related to system and hardware. Re-used across the config as variables, e.g. `${system.seed}`, and can be [overwritten](#config-overrides) on the CLI. |
|
||
| `training` | Settings and controls for the training and evaluation process. |
|
||
| `pretraining` | Optional settings and controls for the [language model pretraining](/usage/embeddings-transformers#pretraining). |
|
||
| `initialize` | Data resources and arguments passed to components when [`nlp.initialize`](/api/language#initialize) is called before training (but not at runtime). |
|
||
|
||
<Infobox title="Config format and settings" emoji="📖">
|
||
|
||
For a full overview of spaCy's config format and settings, see the
|
||
[data format documentation](/api/data-formats#config) and
|
||
[Thinc's config system docs](https://thinc.ai/docs/usage-config). The settings
|
||
available for the different architectures are documented with the
|
||
[model architectures API](/api/architectures). See the Thinc documentation for
|
||
[optimizers](https://thinc.ai/docs/api-optimizers) and
|
||
[schedules](https://thinc.ai/docs/api-schedules).
|
||
|
||
</Infobox>
|
||
|
||
<YouTube id="BWhh3r6W-qE"></YouTube>
|
||
|
||
### Config lifecycle at runtime and training {id="config-lifecycle"}
|
||
|
||
A pipeline's `config.cfg` is considered the "single source of truth", both at
|
||
**training** and **runtime**. Under the hood,
|
||
[`Language.from_config`](/api/language#from_config) takes care of constructing
|
||
the `nlp` object using the settings defined in the config. An `nlp` object's
|
||
config is available as [`nlp.config`](/api/language#config) and it includes all
|
||
information about the pipeline, as well as the settings used to train and
|
||
initialize it.
|
||
|
||
![Illustration of pipeline lifecycle](/images/lifecycle.svg)
|
||
|
||
At runtime spaCy will only use the `[nlp]` and `[components]` blocks of the
|
||
config and load all data, including tokenization rules, model weights and other
|
||
resources from the pipeline directory. The `[training]` block contains the
|
||
settings for training the model and is only used during training. Similarly, the
|
||
`[initialize]` block defines how the initial `nlp` object should be set up
|
||
before training and whether it should be initialized with vectors or pretrained
|
||
tok2vec weights, or any other data needed by the components.
|
||
|
||
The initialization settings are only loaded and used when
|
||
[`nlp.initialize`](/api/language#initialize) is called (typically right before
|
||
training). This allows you to set up your pipeline using local data resources
|
||
and custom functions, and preserve the information in your config – but without
|
||
requiring it to be available at runtime. You can also use this mechanism to
|
||
provide data paths to custom pipeline components and custom tokenizers – see the
|
||
section on [custom initialization](#initialization) for details.
|
||
|
||
### Overwriting config settings on the command line {id="config-overrides"}
|
||
|
||
The config system means that you can define all settings **in one place** and in
|
||
a consistent format. There are no command-line arguments that need to be set,
|
||
and no hidden defaults. However, there can still be scenarios where you may want
|
||
to override config settings when you run [`spacy train`](/api/cli#train). This
|
||
includes **file paths** to vectors or other resources that shouldn't be
|
||
hard-coded in a config file, or **system-dependent settings**.
|
||
|
||
For cases like this, you can set additional command-line options starting with
|
||
`--` that correspond to the config section and value to override. For example,
|
||
`--paths.train ./corpus/train.spacy` sets the `train` value in the `[paths]`
|
||
block.
|
||
|
||
```bash
|
||
$ python -m spacy train config.cfg --paths.train ./corpus/train.spacy --paths.dev ./corpus/dev.spacy --training.batch_size 128
|
||
```
|
||
|
||
Only existing sections and values in the config can be overwritten. At the end
|
||
of the training, the final filled `config.cfg` is exported with your pipeline,
|
||
so you'll always have a record of the settings that were used, including your
|
||
overrides. Overrides are added before [variables](#config-interpolation) are
|
||
resolved, by the way – so if you need to use a value in multiple places,
|
||
reference it across your config and override it on the CLI once.
|
||
|
||
> #### 💡 Tip: Verbose logging
|
||
>
|
||
> If you're using config overrides, you can set the `--verbose` flag on
|
||
> [`spacy train`](/api/cli#train) to make spaCy log more info, including which
|
||
> overrides were set via the CLI and environment variables.
|
||
|
||
#### Adding overrides via environment variables {id="config-overrides-env"}
|
||
|
||
Instead of defining the overrides as CLI arguments, you can also use the
|
||
`SPACY_CONFIG_OVERRIDES` environment variable using the same argument syntax.
|
||
This is especially useful if you're training models as part of an automated
|
||
process. Environment variables **take precedence** over CLI overrides and values
|
||
defined in the config file.
|
||
|
||
```bash
|
||
$ SPACY_CONFIG_OVERRIDES="--system.gpu_allocator pytorch --training.batch_size 128" ./your_script.sh
|
||
```
|
||
|
||
### Reading from standard input {id="config-stdin"}
|
||
|
||
Setting the config path to `-` on the command line lets you read the config from
|
||
standard input and pipe it forward from a different process, like
|
||
[`init config`](/api/cli#init-config) or your own custom script. This is
|
||
especially useful for quick experiments, as it lets you generate a config on the
|
||
fly without having to save to and load from disk.
|
||
|
||
> #### 💡 Tip: Writing to stdout
|
||
>
|
||
> When you run `init config`, you can set the output path to `-` to write to
|
||
> stdout. In a custom script, you can print the string config, e.g.
|
||
> `print(nlp.config.to_str())`.
|
||
|
||
```bash
|
||
$ python -m spacy init config - --lang en --pipeline ner,textcat --optimize accuracy | python -m spacy train - --paths.train ./corpus/train.spacy --paths.dev ./corpus/dev.spacy
|
||
```
|
||
|
||
### Using variable interpolation {id="config-interpolation"}
|
||
|
||
Another very useful feature of the config system is that it supports variable
|
||
interpolation for both **values and sections**. This means that you only need to
|
||
define a setting once and can reference it across your config using the
|
||
`${section.value}` syntax. In this example, the value of `seed` is reused within
|
||
the `[training]` block, and the whole block of `[training.optimizer]` is reused
|
||
in `[pretraining]` and will become `pretraining.optimizer`.
|
||
|
||
```ini {title="config.cfg (excerpt)",highlight="5,18"}
|
||
[system]
|
||
seed = 0
|
||
|
||
[training]
|
||
seed = ${system.seed}
|
||
|
||
[training.optimizer]
|
||
@optimizers = "Adam.v1"
|
||
beta1 = 0.9
|
||
beta2 = 0.999
|
||
L2_is_weight_decay = true
|
||
L2 = 0.01
|
||
grad_clip = 1.0
|
||
use_averages = false
|
||
eps = 1e-8
|
||
|
||
[pretraining]
|
||
optimizer = ${training.optimizer}
|
||
```
|
||
|
||
You can also use variables inside strings. In that case, it works just like
|
||
f-strings in Python. If the value of a variable is not a string, it's converted
|
||
to a string.
|
||
|
||
```ini
|
||
[paths]
|
||
version = 5
|
||
root = "/Users/you/data"
|
||
train = "${paths.root}/train_${paths.version}.spacy"
|
||
# Result: /Users/you/data/train_5.spacy
|
||
```
|
||
|
||
<Infobox title="Tip: Override variables on the CLI" emoji="💡">
|
||
|
||
If you need to change certain values between training runs, you can define them
|
||
once, reference them as variables and then [override](#config-overrides) them on
|
||
the CLI. For example, `--paths.root /other/root` will change the value of `root`
|
||
in the block `[paths]` and the change will be reflected across all other values
|
||
that reference this variable.
|
||
|
||
</Infobox>
|
||
|
||
## Preparing Training Data {id="training-data"}
|
||
|
||
Training data for NLP projects comes in many different formats. For some common
|
||
formats such as CoNLL, spaCy provides [converters](/api/cli#convert) you can use
|
||
from the command line. In other cases you'll have to prepare the training data
|
||
yourself.
|
||
|
||
When converting training data for use in spaCy, the main thing is to create
|
||
[`Doc`](/api/doc) objects just like the results you want as output from the
|
||
pipeline. For example, if you're creating an NER pipeline, loading your
|
||
annotations and setting them as the `.ents` property on a `Doc` is all you need
|
||
to worry about. On disk the annotations will be saved as a
|
||
[`DocBin`](/api/docbin) in the
|
||
[`.spacy` format](/api/data-formats#binary-training), but the details of that
|
||
are handled automatically.
|
||
|
||
Here's an example of creating a `.spacy` file from some NER annotations.
|
||
|
||
```python {title="preprocess.py"}
|
||
import spacy
|
||
from spacy.tokens import DocBin
|
||
|
||
nlp = spacy.blank("en")
|
||
training_data = [
|
||
("Tokyo Tower is 333m tall.", [(0, 11, "BUILDING")]),
|
||
]
|
||
# the DocBin will store the example documents
|
||
db = DocBin()
|
||
for text, annotations in training_data:
|
||
doc = nlp(text)
|
||
ents = []
|
||
for start, end, label in annotations:
|
||
span = doc.char_span(start, end, label=label)
|
||
ents.append(span)
|
||
doc.ents = ents
|
||
db.add(doc)
|
||
db.to_disk("./train.spacy")
|
||
```
|
||
|
||
For more examples of how to convert training data from a wide variety of formats
|
||
for use with spaCy, look at the preprocessing steps in the
|
||
[tutorial projects](https://github.com/explosion/projects/tree/v3/tutorials).
|
||
|
||
<Accordion title="What about the spaCy JSON format?" id="json-annotations" spaced>
|
||
|
||
In spaCy v2, the recommended way to store training data was in
|
||
[a particular JSON format](/api/data-formats#json-input), but in v3 this format
|
||
is deprecated. It's fine as a readable storage format, but there's no need to
|
||
convert your data to JSON before creating a `.spacy` file.
|
||
|
||
</Accordion>
|
||
|
||
## Customizing the pipeline and training {id="config-custom"}
|
||
|
||
### Defining pipeline components {id="config-components"}
|
||
|
||
You typically train a [pipeline](/usage/processing-pipelines) of **one or more
|
||
components**. The `[components]` block in the config defines the available
|
||
pipeline components and how they should be created – either by a built-in or
|
||
custom [factory](/usage/processing-pipelines#built-in), or
|
||
[sourced](/usage/processing-pipelines#sourced-components) from an existing
|
||
trained pipeline. For example, `[components.parser]` defines the component named
|
||
`"parser"` in the pipeline. There are different ways you might want to treat
|
||
your components during training, and the most common scenarios are:
|
||
|
||
1. Train a **new component** from scratch on your data.
|
||
2. Update an existing **trained component** with more examples.
|
||
3. Include an existing trained component without updating it.
|
||
4. Include a non-trainable component, like a rule-based
|
||
[`EntityRuler`](/api/entityruler) or [`Sentencizer`](/api/sentencizer), or a
|
||
fully [custom component](/usage/processing-pipelines#custom-components).
|
||
|
||
If a component block defines a `factory`, spaCy will look it up in the
|
||
[built-in](/usage/processing-pipelines#built-in) or
|
||
[custom](/usage/processing-pipelines#custom-components) components and create a
|
||
new component from scratch. All settings defined in the config block will be
|
||
passed to the component factory as arguments. This lets you configure the model
|
||
settings and hyperparameters. If a component block defines a `source`, the
|
||
component will be copied over from an existing trained pipeline, with its
|
||
existing weights. This lets you include an already trained component in your
|
||
pipeline, or update a trained component with more data specific to your use
|
||
case.
|
||
|
||
```ini {title="config.cfg (excerpt)"}
|
||
[components]
|
||
|
||
# "parser" and "ner" are sourced from a trained pipeline
|
||
[components.parser]
|
||
source = "en_core_web_sm"
|
||
|
||
[components.ner]
|
||
source = "en_core_web_sm"
|
||
|
||
# "textcat" and "custom" are created blank from a built-in / custom factory
|
||
[components.textcat]
|
||
factory = "textcat"
|
||
|
||
[components.custom]
|
||
factory = "your_custom_factory"
|
||
your_custom_setting = true
|
||
```
|
||
|
||
The `pipeline` setting in the `[nlp]` block defines the pipeline components
|
||
added to the pipeline, in order. For example, `"parser"` here references
|
||
`[components.parser]`. By default, spaCy will **update all components that can
|
||
be updated**. Trainable components that are created from scratch are initialized
|
||
with random weights. For sourced components, spaCy will keep the existing
|
||
weights and [resume training](/api/language#resume_training).
|
||
|
||
If you don't want a component to be updated, you can **freeze** it by adding it
|
||
to the `frozen_components` list in the `[training]` block. Frozen components are
|
||
**not updated** during training and are included in the final trained pipeline
|
||
as-is. They are also excluded when calling
|
||
[`nlp.initialize`](/api/language#initialize).
|
||
|
||
> #### Note on frozen components
|
||
>
|
||
> Even though frozen components are not **updated** during training, they will
|
||
> still **run** during evaluation. This is very important, because they may
|
||
> still impact your model's performance – for instance, a sentence boundary
|
||
> detector can impact what the parser or entity recognizer considers a valid
|
||
> parse. So the evaluation results should always reflect what your pipeline will
|
||
> produce at runtime. If you want a frozen component to run (without updating)
|
||
> during training as well, so that downstream components can use its
|
||
> **predictions**, you should add it to the list of
|
||
> [`annotating_components`](/usage/training#annotating-components).
|
||
|
||
```ini
|
||
[nlp]
|
||
lang = "en"
|
||
pipeline = ["parser", "ner", "textcat", "custom"]
|
||
|
||
[training]
|
||
frozen_components = ["parser", "custom"]
|
||
```
|
||
|
||
<Infobox variant="warning" title="Shared Tok2Vec listener layer" id="config-components-listeners">
|
||
|
||
When the components in your pipeline
|
||
[share an embedding layer](/usage/embeddings-transformers#embedding-layers), the
|
||
**performance** of your frozen component will be **degraded** if you continue
|
||
training other layers with the same underlying `Tok2Vec` instance. As a rule of
|
||
thumb, ensure that your frozen components are truly **independent** in the
|
||
pipeline.
|
||
|
||
To automatically replace a shared token-to-vector listener with an independent
|
||
copy of the token-to-vector layer, you can use the `replace_listeners` setting
|
||
of a sourced component, pointing to the listener layer(s) in the config. For
|
||
more details on how this works under the hood, see
|
||
[`Language.replace_listeners`](/api/language#replace_listeners).
|
||
|
||
```ini
|
||
[training]
|
||
frozen_components = ["tagger"]
|
||
|
||
[components.tagger]
|
||
source = "en_core_web_sm"
|
||
replace_listeners = ["model.tok2vec"]
|
||
```
|
||
|
||
</Infobox>
|
||
|
||
### Using predictions from preceding components {id="annotating-components",version="3.1"}
|
||
|
||
By default, components are updated in isolation during training, which means
|
||
that they don't see the predictions of any earlier components in the pipeline. A
|
||
component receives [`Example.predicted`](/api/example) as input and compares its
|
||
predictions to [`Example.reference`](/api/example) without saving its
|
||
annotations in the `predicted` doc.
|
||
|
||
Instead, if certain components should **set their annotations** during training,
|
||
use the setting `annotating_components` in the `[training]` block to specify a
|
||
list of components. For example, the feature `DEP` from the parser could be used
|
||
as a tagger feature by including `DEP` in the tok2vec `attrs` and including
|
||
`parser` in `annotating_components`:
|
||
|
||
```ini {title="config.cfg (excerpt)",highlight="7,12"}
|
||
[nlp]
|
||
pipeline = ["parser", "tagger"]
|
||
|
||
[components.tagger.model.tok2vec.embed]
|
||
@architectures = "spacy.MultiHashEmbed.v1"
|
||
width = ${components.tagger.model.tok2vec.encode.width}
|
||
attrs = ["NORM","DEP"]
|
||
rows = [5000,2500]
|
||
include_static_vectors = false
|
||
|
||
[training]
|
||
annotating_components = ["parser"]
|
||
```
|
||
|
||
Any component in the pipeline can be included as an annotating component,
|
||
including frozen components. Frozen components can set annotations during
|
||
training just as they would set annotations during evaluation or when the final
|
||
pipeline is run. The config excerpt below shows how a frozen `ner` component and
|
||
a `sentencizer` can provide the required `doc.sents` and `doc.ents` for the
|
||
entity linker during training:
|
||
|
||
```ini {title="config.cfg (excerpt)"}
|
||
[nlp]
|
||
pipeline = ["sentencizer", "ner", "entity_linker"]
|
||
|
||
[components.ner]
|
||
source = "en_core_web_sm"
|
||
|
||
[training]
|
||
frozen_components = ["ner"]
|
||
annotating_components = ["sentencizer", "ner"]
|
||
```
|
||
|
||
Similarly, a pretrained `tok2vec` layer can be frozen and specified in the list
|
||
of `annotating_components` to ensure that a downstream component can use the
|
||
embedding layer without updating it.
|
||
|
||
<Infobox variant="warning" title="Training speed with annotating components" id="annotating-components-speed">
|
||
|
||
Be aware that non-frozen annotating components with statistical models will
|
||
**run twice** on each batch, once to update the model and once to apply the
|
||
now-updated model to the predicted docs.
|
||
|
||
</Infobox>
|
||
|
||
### Using registered functions {id="config-functions"}
|
||
|
||
The training configuration defined in the config file doesn't have to only
|
||
consist of static values. Some settings can also be **functions**. For instance,
|
||
the `batch_size` can be a number that doesn't change, or a schedule, like a
|
||
sequence of compounding values, which has shown to be an effective trick (see
|
||
[Smith et al., 2017](https://arxiv.org/abs/1711.00489)).
|
||
|
||
```ini {title="With static value"}
|
||
[training]
|
||
batch_size = 128
|
||
```
|
||
|
||
To refer to a function instead, you can make `[training.batch_size]` its own
|
||
section and use the `@` syntax to specify the function and its arguments – in
|
||
this case [`compounding.v1`](https://thinc.ai/docs/api-schedules#compounding)
|
||
defined in the [function registry](/api/top-level#registry). All other values
|
||
defined in the block are passed to the function as keyword arguments when it's
|
||
initialized. You can also use this mechanism to register
|
||
[custom implementations and architectures](#custom-functions) and reference them
|
||
from your configs.
|
||
|
||
> #### How the config is resolved
|
||
>
|
||
> The config file is parsed into a regular dictionary and is resolved and
|
||
> validated **bottom-up**. Arguments provided for registered functions are
|
||
> checked against the function's signature and type annotations. The return
|
||
> value of a registered function can also be passed into another function – for
|
||
> instance, a learning rate schedule can be provided as the an argument of an
|
||
> optimizer.
|
||
|
||
```ini {title="With registered function"}
|
||
[training.batch_size]
|
||
@schedules = "compounding.v1"
|
||
start = 100
|
||
stop = 1000
|
||
compound = 1.001
|
||
```
|
||
|
||
### Model architectures {id="model-architectures"}
|
||
|
||
> #### 💡 Model type annotations
|
||
>
|
||
> In the documentation and code base, you may come across type annotations and
|
||
> descriptions of [Thinc](https://thinc.ai) model types, like ~~Model[List[Doc],
|
||
> List[Floats2d]]~~. This so-called generic type describes the layer and its
|
||
> input and output type – in this case, it takes a list of `Doc` objects as the
|
||
> input and list of 2-dimensional arrays of floats as the output. You can read
|
||
> more about defining Thinc models [here](https://thinc.ai/docs/usage-models).
|
||
> Also see the [type checking](https://thinc.ai/docs/usage-type-checking) for
|
||
> how to enable linting in your editor to see live feedback if your inputs and
|
||
> outputs don't match.
|
||
|
||
A **model architecture** is a function that wires up a Thinc
|
||
[`Model`](https://thinc.ai/docs/api-model) instance, which you can then use in a
|
||
component or as a layer of a larger network. You can use Thinc as a thin
|
||
[wrapper around frameworks](https://thinc.ai/docs/usage-frameworks) such as
|
||
PyTorch, TensorFlow or MXNet, or you can implement your logic in Thinc
|
||
[directly](https://thinc.ai/docs/usage-models). For more details and examples,
|
||
see the usage guide on [layers and architectures](/usage/layers-architectures).
|
||
|
||
spaCy's built-in components will never construct their `Model` instances
|
||
themselves, so you won't have to subclass the component to change its model
|
||
architecture. You can just **update the config** so that it refers to a
|
||
different registered function. Once the component has been created, its `Model`
|
||
instance has already been assigned, so you cannot change its model architecture.
|
||
The architecture is like a recipe for the network, and you can't change the
|
||
recipe once the dish has already been prepared. You have to make a new one.
|
||
spaCy includes a variety of built-in [architectures](/api/architectures) for
|
||
different tasks. For example:
|
||
|
||
| Architecture | Description |
|
||
| ----------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||
| [HashEmbedCNN](/api/architectures#HashEmbedCNN) | Build spaCy’s "standard" embedding layer, which uses hash embedding with subword features and a CNN with layer-normalized maxout. ~~Model[List[Doc], List[Floats2d]]~~ |
|
||
| [TransitionBasedParser](/api/architectures#TransitionBasedParser) | Build a [transition-based parser](https://explosion.ai/blog/parsing-english-in-python) model used in the default [`EntityRecognizer`](/api/entityrecognizer) and [`DependencyParser`](/api/dependencyparser). ~~Model[List[Docs], List[List[Floats2d]]]~~ |
|
||
| [TextCatEnsemble](/api/architectures#TextCatEnsemble) | Stacked ensemble of a bag-of-words model and a neural network model with an internal CNN embedding layer. Used in the default [`TextCategorizer`](/api/textcategorizer). ~~Model[List[Doc], Floats2d]~~ |
|
||
|
||
### Metrics, training output and weighted scores {id="metrics"}
|
||
|
||
When you train a pipeline using the [`spacy train`](/api/cli#train) command,
|
||
you'll see a table showing the metrics after each pass over the data. The
|
||
available metrics **depend on the pipeline components**. Pipeline components
|
||
also define which scores are shown and how they should be **weighted in the
|
||
final score** that decides about the best model.
|
||
|
||
The `training.score_weights` setting in your `config.cfg` lets you customize the
|
||
scores shown in the table and how they should be weighted. In this example, the
|
||
labeled dependency accuracy and NER F-score count towards the final score with
|
||
40% each and the tagging accuracy makes up the remaining 20%. The tokenization
|
||
accuracy and speed are both shown in the table, but not counted towards the
|
||
score.
|
||
|
||
> #### Why do I need score weights?
|
||
>
|
||
> At the end of your training process, you typically want to select the **best
|
||
> model** – but what "best" means depends on the available components and your
|
||
> specific use case. For instance, you may prefer a pipeline with higher NER and
|
||
> lower POS tagging accuracy over a pipeline with lower NER and higher POS
|
||
> accuracy. You can express this preference in the score weights, e.g. by
|
||
> assigning `ents_f` (NER F-score) a higher weight.
|
||
|
||
```ini
|
||
[training.score_weights]
|
||
dep_las = 0.4
|
||
dep_uas = null
|
||
ents_f = 0.4
|
||
tag_acc = 0.2
|
||
token_acc = 0.0
|
||
speed = 0.0
|
||
```
|
||
|
||
The `score_weights` don't _have to_ sum to `1.0` – but it's recommended. When
|
||
you generate a config for a given pipeline, the score weights are generated by
|
||
combining and normalizing the default score weights of the pipeline components.
|
||
The default score weights are defined by each pipeline component via the
|
||
`default_score_weights` setting on the
|
||
[`@Language.factory`](/api/language#factory) decorator. By default, all pipeline
|
||
components are weighted equally. If a score weight is set to `null`, it will be
|
||
excluded from the logs and the score won't be weighted.
|
||
|
||
<Accordion title="Understanding the training output and score types" spaced id="score-types">
|
||
|
||
| Name | Description |
|
||
| ----------------- | ----------------------------------------------------------------------------------------------------------------------- |
|
||
| **Loss** | The training loss representing the amount of work left for the optimizer. Should decrease, but usually not to `0`. |
|
||
| **Precision** (P) | Percentage of predicted annotations that were correct. Should increase. |
|
||
| **Recall** (R) | Percentage of reference annotations recovered. Should increase. |
|
||
| **F-Score** (F) | Harmonic mean of precision and recall. Should increase. |
|
||
| **UAS** / **LAS** | Unlabeled and labeled attachment score for the dependency parser, i.e. the percentage of correct arcs. Should increase. |
|
||
| **Speed** | Prediction speed in words per second (WPS). Should stay stable. |
|
||
|
||
Note that if the development data has raw text, some of the gold-standard
|
||
entities might not align to the predicted tokenization. These tokenization
|
||
errors are **excluded from the NER evaluation**. If your tokenization makes it
|
||
impossible for the model to predict 50% of your entities, your NER F-score might
|
||
still look good.
|
||
|
||
</Accordion>
|
||
|
||
## Custom functions {id="custom-functions"}
|
||
|
||
Registered functions in the training config files can refer to built-in
|
||
implementations, but you can also plug in fully **custom implementations**. All
|
||
you need to do is register your function using the `@spacy.registry` decorator
|
||
with the name of the respective [registry](/api/top-level#registry), e.g.
|
||
`@spacy.registry.architectures`, and a string name to assign to your function.
|
||
Registering custom functions allows you to **plug in models** defined in PyTorch
|
||
or TensorFlow, make **custom modifications** to the `nlp` object, create custom
|
||
optimizers or schedules, or **stream in data** and preprocess it on the fly
|
||
while training.
|
||
|
||
Each custom function can have any number of arguments that are passed in via the
|
||
[config](#config), just the built-in functions. If your function defines
|
||
**default argument values**, spaCy is able to auto-fill your config when you run
|
||
[`init fill-config`](/api/cli#init-fill-config). If you want to make sure that a
|
||
given parameter is always explicitly set in the config, avoid setting a default
|
||
value for it.
|
||
|
||
### Training with custom code {id="custom-code"}
|
||
|
||
> ```bash
|
||
> ### Training
|
||
> $ python -m spacy train config.cfg --code functions.py
|
||
> ```
|
||
>
|
||
> ```bash
|
||
> ### Packaging
|
||
> $ python -m spacy package ./model-best ./packages --code functions.py
|
||
> ```
|
||
|
||
The [`spacy train`](/api/cli#train) recipe lets you specify an optional argument
|
||
`--code` that points to a Python file. The file is imported before training and
|
||
allows you to add custom functions and architectures to the function registry
|
||
that can then be referenced from your `config.cfg`. This lets you train spaCy
|
||
pipelines with custom components, without having to re-implement the whole
|
||
training workflow. When you package your trained pipeline later using
|
||
[`spacy package`](/api/cli#package), you can provide one or more Python files to
|
||
be included in the package and imported in its `__init__.py`. This means that
|
||
any custom architectures, functions or
|
||
[components](/usage/processing-pipelines#custom-components) will be shipped with
|
||
your pipeline and registered when it's loaded. See the documentation on
|
||
[saving and loading pipelines](/usage/saving-loading#models-custom) for details.
|
||
|
||
#### Example: Modifying the nlp object {id="custom-code-nlp-callbacks"}
|
||
|
||
For many use cases, you don't necessarily want to implement the whole `Language`
|
||
subclass and language data from scratch – it's often enough to make a few small
|
||
modifications, like adjusting the
|
||
[tokenization rules](/usage/linguistic-features#native-tokenizer-additions) or
|
||
[language defaults](/api/language#defaults) like stop words. The config lets you
|
||
provide five optional **callback functions** that give you access to the
|
||
language class and `nlp` object at different points of the lifecycle:
|
||
|
||
| Callback | Description |
|
||
| ----------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||
| `nlp.before_creation` | Called before the `nlp` object is created and receives the language subclass like `English` (not the instance). Useful for writing to the [`Language.Defaults`](/api/language#defaults) aside from the tokenizer settings. |
|
||
| `nlp.after_creation` | Called right after the `nlp` object is created, but before the pipeline components are added to the pipeline and receives the `nlp` object. |
|
||
| `nlp.after_pipeline_creation` | Called right after the pipeline components are created and added and receives the `nlp` object. Useful for modifying pipeline components. |
|
||
| `initialize.before_init` | Called before the pipeline components are initialized and receives the `nlp` object for in-place modification. Useful for modifying the tokenizer settings, similar to the v2 base model option. |
|
||
| `initialize.after_init` | Called after the pipeline components are initialized and receives the `nlp` object for in-place modification. |
|
||
|
||
The `@spacy.registry.callbacks` decorator lets you register your custom function
|
||
in the `callbacks` [registry](/api/top-level#registry) under a given name. You
|
||
can then reference the function in a config block using the `@callbacks` key. If
|
||
a block contains a key starting with an `@`, it's interpreted as a reference to
|
||
a function. Because you've registered the function, spaCy knows how to create it
|
||
when you reference `"customize_language_data"` in your config. Here's an example
|
||
of a callback that runs before the `nlp` object is created and adds a custom
|
||
stop word to the defaults:
|
||
|
||
> #### config.cfg
|
||
>
|
||
> ```ini
|
||
> [nlp.before_creation]
|
||
> @callbacks = "customize_language_data"
|
||
> ```
|
||
|
||
```python {title="functions.py",highlight="3,6"}
|
||
import spacy
|
||
|
||
@spacy.registry.callbacks("customize_language_data")
|
||
def create_callback():
|
||
def customize_language_data(lang_cls):
|
||
lang_cls.Defaults.stop_words.add("good")
|
||
return lang_cls
|
||
|
||
return customize_language_data
|
||
```
|
||
|
||
<Infobox variant="warning">
|
||
|
||
Remember that a registered function should always be a function that spaCy
|
||
**calls to create something**. In this case, it **creates a callback** – it's
|
||
not the callback itself.
|
||
|
||
</Infobox>
|
||
|
||
Any registered function – in this case `create_callback` – can also take
|
||
**arguments** that can be **set by the config**. This lets you implement and
|
||
keep track of different configurations, without having to hack at your code. You
|
||
can choose any arguments that make sense for your use case. In this example,
|
||
we're adding the arguments `extra_stop_words` (a list of strings) and `debug`
|
||
(boolean) for printing additional info when the function runs.
|
||
|
||
> #### config.cfg
|
||
>
|
||
> ```ini
|
||
> [nlp.before_creation]
|
||
> @callbacks = "customize_language_data"
|
||
> extra_stop_words = ["ooh", "aah"]
|
||
> debug = true
|
||
> ```
|
||
|
||
```python {title="functions.py",highlight="5,7-9"}
|
||
from typing import List
|
||
import spacy
|
||
|
||
@spacy.registry.callbacks("customize_language_data")
|
||
def create_callback(extra_stop_words: List[str] = [], debug: bool = False):
|
||
def customize_language_data(lang_cls):
|
||
lang_cls.Defaults.stop_words.update(extra_stop_words)
|
||
if debug:
|
||
print("Updated stop words")
|
||
return lang_cls
|
||
|
||
return customize_language_data
|
||
```
|
||
|
||
<Infobox title="Tip: Use Python type hints" emoji="💡">
|
||
|
||
spaCy's configs are powered by our machine learning library Thinc's
|
||
[configuration system](https://thinc.ai/docs/usage-config), which supports
|
||
[type hints](https://docs.python.org/3/library/typing.html) and even
|
||
[advanced type annotations](https://thinc.ai/docs/usage-config#advanced-types)
|
||
using [`pydantic`](https://github.com/samuelcolvin/pydantic). If your registered
|
||
function provides type hints, the values that are passed in will be checked
|
||
against the expected types. For example, `debug: bool` in the example above will
|
||
ensure that the value received as the argument `debug` is a boolean. If the
|
||
value can't be coerced into a boolean, spaCy will raise an error.
|
||
`debug: pydantic.StrictBool` will force the value to be a boolean and raise an
|
||
error if it's not – for instance, if your config defines `1` instead of `true`.
|
||
|
||
</Infobox>
|
||
|
||
With your `functions.py` defining additional code and the updated `config.cfg`,
|
||
you can now run [`spacy train`](/api/cli#train) and point the argument `--code`
|
||
to your Python file. Before loading the config, spaCy will import the
|
||
`functions.py` module and your custom functions will be registered.
|
||
|
||
```bash
|
||
$ python -m spacy train config.cfg --output ./output --code ./functions.py
|
||
```
|
||
|
||
#### Example: Modifying tokenizer settings {id="custom-tokenizer"}
|
||
|
||
Use the `initialize.before_init` callback to modify the tokenizer settings when
|
||
training a new pipeline. Write a registered callback that modifies the tokenizer
|
||
settings and specify this callback in your config:
|
||
|
||
> #### config.cfg
|
||
>
|
||
> ```ini
|
||
> [initialize]
|
||
>
|
||
> [initialize.before_init]
|
||
> @callbacks = "customize_tokenizer"
|
||
> ```
|
||
|
||
```python {title="functions.py"}
|
||
from spacy.util import registry, compile_suffix_regex
|
||
|
||
@registry.callbacks("customize_tokenizer")
|
||
def make_customize_tokenizer():
|
||
def customize_tokenizer(nlp):
|
||
# remove a suffix
|
||
suffixes = list(nlp.Defaults.suffixes)
|
||
suffixes.remove("\\[")
|
||
suffix_regex = compile_suffix_regex(suffixes)
|
||
nlp.tokenizer.suffix_search = suffix_regex.search
|
||
|
||
# add a special case
|
||
nlp.tokenizer.add_special_case("_SPECIAL_", [{"ORTH": "_SPECIAL_"}])
|
||
return customize_tokenizer
|
||
```
|
||
|
||
When training, provide the function above with the `--code` option:
|
||
|
||
```bash
|
||
$ python -m spacy train config.cfg --code ./functions.py
|
||
```
|
||
|
||
Because this callback is only called in the one-time initialization step before
|
||
training, the callback code does not need to be packaged with the final pipeline
|
||
package. However, to make it easier for others to replicate your training setup,
|
||
you can choose to package the initialization callbacks with the pipeline package
|
||
or to publish them separately.
|
||
|
||
<Infobox variant="warning" title="nlp.before_creation vs. initialize.before_init">
|
||
|
||
- `nlp.before_creation` is the best place to modify language defaults other than
|
||
the tokenizer settings.
|
||
- `initialize.before_init` is the best place to modify tokenizer settings when
|
||
training a new pipeline.
|
||
|
||
Unlike the other language defaults, the tokenizer settings are saved with the
|
||
pipeline with `nlp.to_disk()`, so modifications made in `nlp.before_creation`
|
||
will be clobbered by the saved settings when the trained pipeline is loaded from
|
||
disk.
|
||
|
||
</Infobox>
|
||
|
||
#### Example: Custom logging function {id="custom-logging"}
|
||
|
||
During training, the results of each step are passed to a logger function. By
|
||
default, these results are written to the console with the
|
||
[`ConsoleLogger`](/api/top-level#ConsoleLogger). There is also built-in support
|
||
for writing the log files to [Weights & Biases](https://www.wandb.com/) with the
|
||
[`WandbLogger`](https://github.com/explosion/spacy-loggers#wandblogger). On each
|
||
step, the logger function receives a **dictionary** with the following keys:
|
||
|
||
| Key | Value |
|
||
| -------------- | ----------------------------------------------------------------------------------------------------- |
|
||
| `epoch` | How many passes over the data have been completed. ~~int~~ |
|
||
| `step` | How many steps have been completed. ~~int~~ |
|
||
| `score` | The main score from the last evaluation, measured on the dev set. ~~float~~ |
|
||
| `other_scores` | The other scores from the last evaluation, measured on the dev set. ~~Dict[str, Any]~~ |
|
||
| `losses` | The accumulated training losses, keyed by component name. ~~Dict[str, float]~~ |
|
||
| `checkpoints` | A list of previous results, where each result is a `(score, step)` tuple. ~~List[Tuple[float, int]]~~ |
|
||
|
||
You can easily implement and plug in your own logger that records the training
|
||
results in a custom way, or sends them to an experiment management tracker of
|
||
your choice. In this example, the function `my_custom_logger.v1` writes the
|
||
tabular results to a file:
|
||
|
||
> ```ini
|
||
> ### config.cfg (excerpt)
|
||
> [training.logger]
|
||
> @loggers = "my_custom_logger.v1"
|
||
> log_path = "my_file.tab"
|
||
> ```
|
||
|
||
```python {title="functions.py"}
|
||
import sys
|
||
from typing import IO, Tuple, Callable, Dict, Any, Optional
|
||
import spacy
|
||
from spacy import Language
|
||
from pathlib import Path
|
||
|
||
@spacy.registry.loggers("my_custom_logger.v1")
|
||
def custom_logger(log_path):
|
||
def setup_logger(
|
||
nlp: Language,
|
||
stdout: IO=sys.stdout,
|
||
stderr: IO=sys.stderr
|
||
) -> Tuple[Callable, Callable]:
|
||
stdout.write(f"Logging to {log_path}\\n")
|
||
log_file = Path(log_path).open("w", encoding="utf8")
|
||
log_file.write("step\\t")
|
||
log_file.write("score\\t")
|
||
for pipe in nlp.pipe_names:
|
||
log_file.write(f"loss_{pipe}\\t")
|
||
log_file.write("\\n")
|
||
|
||
def log_step(info: Optional[Dict[str, Any]]):
|
||
if info:
|
||
log_file.write(f"{info['step']}\\t")
|
||
log_file.write(f"{info['score']}\\t")
|
||
for pipe in nlp.pipe_names:
|
||
log_file.write(f"{info['losses'][pipe]}\\t")
|
||
log_file.write("\\n")
|
||
|
||
def finalize():
|
||
log_file.close()
|
||
|
||
return log_step, finalize
|
||
|
||
return setup_logger
|
||
```
|
||
|
||
#### Example: Custom batch size schedule {id="custom-code-schedule"}
|
||
|
||
You can also implement your own batch size schedule to use during training. The
|
||
`@spacy.registry.schedules` decorator lets you register that function in the
|
||
`schedules` [registry](/api/top-level#registry) and assign it a string name:
|
||
|
||
> #### Why the version in the name?
|
||
>
|
||
> A big benefit of the config system is that it makes your experiments
|
||
> reproducible. We recommend versioning the functions you register, especially
|
||
> if you expect them to change (like a new model architecture). This way, you
|
||
> know that a config referencing `v1` means a different function than a config
|
||
> referencing `v2`.
|
||
|
||
```python {title="functions.py"}
|
||
import spacy
|
||
|
||
@spacy.registry.schedules("my_custom_schedule.v1")
|
||
def my_custom_schedule(start: int = 1, factor: float = 1.001):
|
||
while True:
|
||
yield start
|
||
start = start * factor
|
||
```
|
||
|
||
In your config, you can now reference the schedule in the
|
||
`[training.batch_size]` block via `@schedules`. If a block contains a key
|
||
starting with an `@`, it's interpreted as a reference to a function. All other
|
||
settings in the block will be passed to the function as keyword arguments. Keep
|
||
in mind that the config shouldn't have any hidden defaults and all arguments on
|
||
the functions need to be represented in the config.
|
||
|
||
```ini {title="config.cfg (excerpt)"}
|
||
[training.batch_size]
|
||
@schedules = "my_custom_schedule.v1"
|
||
start = 2
|
||
factor = 1.005
|
||
```
|
||
|
||
### Defining custom architectures {id="custom-architectures"}
|
||
|
||
Built-in pipeline components such as the tagger or named entity recognizer are
|
||
constructed with default neural network [models](/api/architectures). You can
|
||
change the model architecture entirely by implementing your own custom models
|
||
and providing those in the config when creating the pipeline component. See the
|
||
documentation on [layers and model architectures](/usage/layers-architectures)
|
||
for more details.
|
||
|
||
> ```ini
|
||
> ### config.cfg
|
||
> [components.tagger]
|
||
> factory = "tagger"
|
||
>
|
||
> [components.tagger.model]
|
||
> @architectures = "custom_neural_network.v1"
|
||
> output_width = 512
|
||
> ```
|
||
|
||
```python {title="functions.py"}
|
||
from typing import List
|
||
from thinc.types import Floats2d
|
||
from thinc.api import Model
|
||
import spacy
|
||
from spacy.tokens import Doc
|
||
|
||
@spacy.registry.architectures("custom_neural_network.v1")
|
||
def custom_neural_network(output_width: int) -> Model[List[Doc], List[Floats2d]]:
|
||
return create_model(output_width)
|
||
```
|
||
|
||
## Customizing the initialization {id="initialization"}
|
||
|
||
When you start training a new model from scratch,
|
||
[`spacy train`](/api/cli#train) will call
|
||
[`nlp.initialize`](/api/language#initialize) to initialize the pipeline and load
|
||
the required data. All settings for this are defined in the
|
||
[`[initialize]`](/api/data-formats#config-initialize) block of the config, so
|
||
you can keep track of how the initial `nlp` object was created. The
|
||
initialization process typically includes the following:
|
||
|
||
> #### config.cfg (excerpt)
|
||
>
|
||
> ```ini
|
||
> [initialize]
|
||
> vectors = ${paths.vectors}
|
||
> init_tok2vec = ${paths.init_tok2vec}
|
||
>
|
||
> [initialize.components]
|
||
> # Settings for components
|
||
> ```
|
||
|
||
1. Load in **data resources** defined in the `[initialize]` config, including
|
||
**word vectors** and
|
||
[pretrained](/usage/embeddings-transformers/#pretraining) **tok2vec
|
||
weights**.
|
||
2. Call the `initialize` methods of the tokenizer (if implemented, e.g. for
|
||
[Chinese](/usage/models#chinese)) and pipeline components with a callback to
|
||
access the training data, the current `nlp` object and any **custom
|
||
arguments** defined in the `[initialize]` config.
|
||
3. In **pipeline components**: if needed, use the data to
|
||
[infer missing shapes](/usage/layers-architectures#thinc-shape-inference) and
|
||
set up the label scheme if no labels are provided. Components may also load
|
||
other data like lookup tables or dictionaries.
|
||
|
||
The initialization step allows the config to define **all settings** required
|
||
for the pipeline, while keeping a separation between settings and functions that
|
||
should only be used **before training** to set up the initial pipeline, and
|
||
logic and configuration that needs to be available **at runtime**. Without that
|
||
separation, it would be very difficult to use the same, reproducible config file
|
||
because the component settings required for training (load data from an external
|
||
file) wouldn't match the component settings required at runtime (load what's
|
||
included with the saved `nlp` object and don't depend on external file).
|
||
|
||
![Illustration of pipeline lifecycle](/images/lifecycle.svg)
|
||
|
||
<Infobox title="How components save and load data" emoji="📖">
|
||
|
||
For details and examples of how pipeline components can **save and load data
|
||
assets** like model weights or lookup tables, and how the component
|
||
initialization is implemented under the hood, see the usage guide on
|
||
[serializing and initializing component data](/usage/processing-pipelines#component-data-initialization).
|
||
|
||
</Infobox>
|
||
|
||
#### Initializing labels {id="initialization-labels"}
|
||
|
||
Built-in pipeline components like the
|
||
[`EntityRecognizer`](/api/entityrecognizer) or
|
||
[`DependencyParser`](/api/dependencyparser) need to know their available labels
|
||
and associated internal meta information to initialize their model weights.
|
||
Using the `get_examples` callback provided on initialization, they're able to
|
||
**read the labels off the training data** automatically, which is very
|
||
convenient – but it can also slow down the training process to compute this
|
||
information on every run.
|
||
|
||
The [`init labels`](/api/cli#init-labels) command lets you auto-generate JSON
|
||
files containing the label data for all supported components. You can then pass
|
||
in the labels in the `[initialize]` settings for the respective components to
|
||
allow them to initialize faster.
|
||
|
||
> #### config.cfg
|
||
>
|
||
> ```ini
|
||
> [initialize.components.ner]
|
||
>
|
||
> [initialize.components.ner.labels]
|
||
> @readers = "spacy.read_labels.v1"
|
||
> path = "corpus/labels/ner.json
|
||
> ```
|
||
|
||
```bash
|
||
$ python -m spacy init labels config.cfg ./corpus --paths.train ./corpus/train.spacy
|
||
```
|
||
|
||
Under the hood, the command delegates to the `label_data` property of the
|
||
pipeline components, for instance
|
||
[`EntityRecognizer.label_data`](/api/entityrecognizer#label_data).
|
||
|
||
<Infobox variant="warning" title="Important note">
|
||
|
||
The JSON format differs for each component and some components need additional
|
||
meta information about their labels. The format exported by
|
||
[`init labels`](/api/cli#init-labels) matches what the components need, so you
|
||
should always let spaCy **auto-generate the labels** for you.
|
||
|
||
</Infobox>
|
||
|
||
## Data utilities {id="data"}
|
||
|
||
spaCy includes various features and utilities to make it easy to train models
|
||
using your own data, manage training and evaluation corpora, convert existing
|
||
annotations and configure data augmentation strategies for more robust models.
|
||
|
||
### Converting existing corpora and annotations {id="data-convert"}
|
||
|
||
If you have training data in a standard format like `.conll` or `.conllu`, the
|
||
easiest way to convert it for use with spaCy is to run
|
||
[`spacy convert`](/api/cli#convert) and pass it a file and an output directory.
|
||
By default, the command will pick the converter based on the file extension.
|
||
|
||
```bash
|
||
$ python -m spacy convert ./train.gold.conll ./corpus
|
||
```
|
||
|
||
> #### 💡 Tip: Converting from Prodigy
|
||
>
|
||
> If you're using the [Prodigy](https://prodi.gy) annotation tool to create
|
||
> training data, you can run the
|
||
> [`data-to-spacy` command](https://prodi.gy/docs/recipes#data-to-spacy) to
|
||
> merge and export multiple datasets for use with
|
||
> [`spacy train`](/api/cli#train). Different types of annotations on the same
|
||
> text will be combined, giving you one corpus to train multiple components.
|
||
|
||
<Infobox title="Tip: Manage multi-step workflows with projects" emoji="💡">
|
||
|
||
Training workflows often consist of multiple steps, from preprocessing the data
|
||
all the way to packaging and deploying the trained model.
|
||
[spaCy projects](/usage/projects) let you define all steps in one file, manage
|
||
data assets, track changes and share your end-to-end processes with your team.
|
||
|
||
</Infobox>
|
||
|
||
The binary `.spacy` format is a serialized [`DocBin`](/api/docbin) containing
|
||
one or more [`Doc`](/api/doc) objects. It's extremely **efficient in storage**,
|
||
especially when packing multiple documents together. You can also create `Doc`
|
||
objects manually, so you can write your own custom logic to convert and store
|
||
existing annotations for use in spaCy.
|
||
|
||
```python {title="Training data from Doc objects",highlight="6-9"}
|
||
import spacy
|
||
from spacy.tokens import Doc, DocBin
|
||
|
||
nlp = spacy.blank("en")
|
||
docbin = DocBin()
|
||
words = ["Apple", "is", "looking", "at", "buying", "U.K.", "startup", "."]
|
||
spaces = [True, True, True, True, True, True, True, False]
|
||
ents = ["B-ORG", "O", "O", "O", "O", "B-GPE", "O", "O"]
|
||
doc = Doc(nlp.vocab, words=words, spaces=spaces, ents=ents)
|
||
docbin.add(doc)
|
||
docbin.to_disk("./train.spacy")
|
||
```
|
||
|
||
### Working with corpora {id="data-corpora"}
|
||
|
||
> #### Example
|
||
>
|
||
> ```ini
|
||
> [corpora]
|
||
>
|
||
> [corpora.train]
|
||
> @readers = "spacy.Corpus.v1"
|
||
> path = ${paths.train}
|
||
> gold_preproc = false
|
||
> max_length = 0
|
||
> limit = 0
|
||
> augmenter = null
|
||
>
|
||
> [training]
|
||
> train_corpus = "corpora.train"
|
||
> ```
|
||
|
||
The [`[corpora]`](/api/data-formats#config-corpora) block in your config lets
|
||
you define **data resources** to use for training, evaluation, pretraining or
|
||
any other custom workflows. `corpora.train` and `corpora.dev` are used as
|
||
conventions within spaCy's default configs, but you can also define any other
|
||
custom blocks. Each section in the corpora config should resolve to a
|
||
[`Corpus`](/api/corpus) – for example, using spaCy's built-in
|
||
[corpus reader](/api/top-level#corpus-readers) that takes a path to a binary
|
||
`.spacy` file. The `train_corpus` and `dev_corpus` fields in the
|
||
[`[training]`](/api/data-formats#config-training) block specify where to find
|
||
the corpus in your config. This makes it easy to **swap out** different corpora
|
||
by only changing a single config setting.
|
||
|
||
Instead of making `[corpora]` a block with multiple subsections for each portion
|
||
of the data, you can also use a single function that returns a dictionary of
|
||
corpora, keyed by corpus name, e.g. `"train"` and `"dev"`. This can be
|
||
especially useful if you need to split a single file into corpora for training
|
||
and evaluation, without loading the same file twice.
|
||
|
||
By default, the training data is loaded into memory and shuffled before each
|
||
epoch. If the corpus is **too large to fit into memory** during training, stream
|
||
the corpus using a custom reader as described in the next section.
|
||
|
||
### Custom data reading and batching {id="custom-code-readers-batchers"}
|
||
|
||
Some use-cases require **streaming in data** or manipulating datasets on the
|
||
fly, rather than generating all data beforehand and storing it to disk. Instead
|
||
of using the built-in [`Corpus`](/api/corpus) reader, which uses static file
|
||
paths, you can create and register a custom function that generates
|
||
[`Example`](/api/example) objects.
|
||
|
||
In the following example we assume a custom function `read_custom_data` which
|
||
loads or generates texts with relevant text classification annotations. Then,
|
||
small lexical variations of the input text are created before generating the
|
||
final [`Example`](/api/example) objects. The `@spacy.registry.readers` decorator
|
||
lets you register the function creating the custom reader in the `readers`
|
||
[registry](/api/top-level#registry) and assign it a string name, so it can be
|
||
used in your config. All arguments on the registered function become available
|
||
as **config settings** – in this case, `source`.
|
||
|
||
> #### config.cfg
|
||
>
|
||
> ```ini
|
||
> [corpora.train]
|
||
> @readers = "corpus_variants.v1"
|
||
> source = "s3://your_bucket/path/data.csv"
|
||
> ```
|
||
|
||
```python {title="functions.py",highlight="7-8"}
|
||
from typing import Callable, Iterator, List
|
||
import spacy
|
||
from spacy.training import Example
|
||
from spacy.language import Language
|
||
import random
|
||
|
||
@spacy.registry.readers("corpus_variants.v1")
|
||
def stream_data(source: str) -> Callable[[Language], Iterator[Example]]:
|
||
def generate_stream(nlp):
|
||
for text, cats in read_custom_data(source):
|
||
# Create a random variant of the example text
|
||
i = random.randint(0, len(text) - 1)
|
||
variant = text[:i] + text[i].upper() + text[i + 1:]
|
||
doc = nlp.make_doc(variant)
|
||
example = Example.from_dict(doc, {"cats": cats})
|
||
yield example
|
||
|
||
return generate_stream
|
||
```
|
||
|
||
<Infobox variant="warning">
|
||
|
||
Remember that a registered function should always be a function that spaCy
|
||
**calls to create something**. In this case, it **creates the reader function**
|
||
– it's not the reader itself.
|
||
|
||
</Infobox>
|
||
|
||
If the corpus is **too large to load into memory** or the corpus reader is an
|
||
**infinite generator**, use the setting `max_epochs = -1` to indicate that the
|
||
train corpus should be streamed. With this setting the train corpus is merely
|
||
streamed and batched, not shuffled, so any shuffling needs to be implemented in
|
||
the corpus reader itself. In the example below, a corpus reader that generates
|
||
sentences containing even or odd numbers is used with an unlimited number of
|
||
examples for the train corpus and a limited number of examples for the dev
|
||
corpus. The dev corpus should always be finite and fit in memory during the
|
||
evaluation step. `max_steps` and/or `patience` are used to determine when the
|
||
training should stop.
|
||
|
||
> #### config.cfg
|
||
>
|
||
> ```ini
|
||
> [corpora.dev]
|
||
> @readers = "even_odd.v1"
|
||
> limit = 100
|
||
>
|
||
> [corpora.train]
|
||
> @readers = "even_odd.v1"
|
||
> limit = -1
|
||
>
|
||
> [training]
|
||
> max_epochs = -1
|
||
> patience = 500
|
||
> max_steps = 2000
|
||
> ```
|
||
|
||
```python {title="functions.py"}
|
||
from typing import Callable, Iterable, Iterator
|
||
from spacy import util
|
||
import random
|
||
from spacy.training import Example
|
||
from spacy import Language
|
||
|
||
|
||
@util.registry.readers("even_odd.v1")
|
||
def create_even_odd_corpus(limit: int = -1) -> Callable[[Language], Iterable[Example]]:
|
||
return EvenOddCorpus(limit)
|
||
|
||
|
||
class EvenOddCorpus:
|
||
def __init__(self, limit):
|
||
self.limit = limit
|
||
|
||
def __call__(self, nlp: Language) -> Iterator[Example]:
|
||
i = 0
|
||
while i < self.limit or self.limit < 0:
|
||
r = random.randint(0, 1000)
|
||
cat = r % 2 == 0
|
||
text = "This is sentence " + str(r)
|
||
yield Example.from_dict(
|
||
nlp.make_doc(text), {"cats": {"EVEN": cat, "ODD": not cat}}
|
||
)
|
||
i += 1
|
||
```
|
||
|
||
> #### config.cfg
|
||
>
|
||
> ```ini
|
||
> [initialize.components.textcat.labels]
|
||
> @readers = "spacy.read_labels.v1"
|
||
> path = "labels/textcat.json"
|
||
> require = true
|
||
> ```
|
||
|
||
If the train corpus is streamed, the initialize step peeks at the first 100
|
||
examples in the corpus to find the labels for each component. If this isn't
|
||
sufficient, you'll need to [provide the labels](#initialization-labels) for each
|
||
component in the `[initialize]` block. [`init labels`](/api/cli#init-labels) can
|
||
be used to generate JSON files in the correct format, which you can extend with
|
||
the full label set.
|
||
|
||
We can also customize the **batching strategy** by registering a new batcher
|
||
function in the `batchers` [registry](/api/top-level#registry). A batcher turns
|
||
a stream of items into a stream of batches. spaCy has several useful built-in
|
||
[batching strategies](/api/top-level#batchers) with customizable sizes, but it's
|
||
also easy to implement your own. For instance, the following function takes the
|
||
stream of generated [`Example`](/api/example) objects, and removes those which
|
||
have the same underlying raw text, to avoid duplicates within each batch. Note
|
||
that in a more realistic implementation, you'd also want to check whether the
|
||
annotations are the same.
|
||
|
||
> #### config.cfg
|
||
>
|
||
> ```ini
|
||
> [training.batcher]
|
||
> @batchers = "filtering_batch.v1"
|
||
> size = 150
|
||
> ```
|
||
|
||
```python {title="functions.py"}
|
||
from typing import Callable, Iterable, Iterator, List
|
||
import spacy
|
||
from spacy.training import Example
|
||
|
||
@spacy.registry.batchers("filtering_batch.v1")
|
||
def filter_batch(size: int) -> Callable[[Iterable[Example]], Iterator[List[Example]]]:
|
||
def create_filtered_batches(examples):
|
||
batch = []
|
||
for eg in examples:
|
||
# Remove duplicate examples with the same text from batch
|
||
if eg.text not in [x.text for x in batch]:
|
||
batch.append(eg)
|
||
if len(batch) == size:
|
||
yield batch
|
||
batch = []
|
||
|
||
return create_filtered_batches
|
||
```
|
||
|
||
{/* TODO: Custom corpus class, Minibatching */}
|
||
|
||
### Data augmentation {id="data-augmentation"}
|
||
|
||
Data augmentation is the process of applying small **modifications** to the
|
||
training data. It can be especially useful for punctuation and case replacement
|
||
– for example, if your corpus only uses smart quotes and you want to include
|
||
variations using regular quotes, or to make the model less sensitive to
|
||
capitalization by including a mix of capitalized and lowercase examples.
|
||
|
||
The easiest way to use data augmentation during training is to provide an
|
||
`augmenter` to the training corpus, e.g. in the `[corpora.train]` section of
|
||
your config. The built-in [`orth_variants`](/api/top-level#orth_variants)
|
||
augmenter creates a data augmentation callback that uses orth-variant
|
||
replacement.
|
||
|
||
```ini {title="config.cfg (excerpt)",highlight="8,14"}
|
||
[corpora.train]
|
||
@readers = "spacy.Corpus.v1"
|
||
path = ${paths.train}
|
||
gold_preproc = false
|
||
max_length = 0
|
||
limit = 0
|
||
|
||
[corpora.train.augmenter]
|
||
@augmenters = "spacy.orth_variants.v1"
|
||
# Percentage of texts that will be augmented / lowercased
|
||
level = 0.1
|
||
lower = 0.5
|
||
|
||
[corpora.train.augmenter.orth_variants]
|
||
@readers = "srsly.read_json.v1"
|
||
path = "corpus/orth_variants.json"
|
||
```
|
||
|
||
The `orth_variants` argument lets you pass in a dictionary of replacement rules,
|
||
typically loaded from a JSON file. There are two types of orth variant rules:
|
||
`"single"` for single tokens that should be replaced (e.g. hyphens) and
|
||
`"paired"` for pairs of tokens (e.g. quotes).
|
||
|
||
```json {title="orth_variants.json"}
|
||
{
|
||
"single": [{ "tags": ["NFP"], "variants": ["…", "..."] }],
|
||
"paired": [
|
||
{
|
||
"tags": ["``", "''"],
|
||
"variants": [
|
||
["'", "'"],
|
||
["‘", "’"]
|
||
]
|
||
}
|
||
]
|
||
}
|
||
```
|
||
|
||
<Accordion title="Full examples for English and German" spaced>
|
||
|
||
```json
|
||
https://github.com/explosion/spacy-lookups-data/blob/master/spacy_lookups_data/data/en_orth_variants.json
|
||
```
|
||
|
||
```json
|
||
https://github.com/explosion/spacy-lookups-data/blob/master/spacy_lookups_data/data/de_orth_variants.json
|
||
```
|
||
|
||
</Accordion>
|
||
|
||
<Infobox title="Important note" variant="warning">
|
||
|
||
When adding data augmentation, keep in mind that it typically only makes sense
|
||
to apply it to the **training corpus**, not the development data.
|
||
|
||
</Infobox>
|
||
|
||
#### Writing custom data augmenters {id="data-augmentation-custom"}
|
||
|
||
Using the [`@spacy.augmenters`](/api/top-level#registry) registry, you can also
|
||
register your own data augmentation callbacks. The callback should be a function
|
||
that takes the current `nlp` object and a training [`Example`](/api/example) and
|
||
yields `Example` objects. Keep in mind that the augmenter should yield **all
|
||
examples** you want to use in your corpus, not only the augmented examples
|
||
(unless you want to augment all examples).
|
||
|
||
Here'a an example of a custom augmentation callback that produces text variants
|
||
in ["SpOnGeBoB cAsE"](https://knowyourmeme.com/memes/mocking-spongebob). The
|
||
registered function takes one argument `randomize` that can be set via the
|
||
config and decides whether the uppercase/lowercase transformation is applied
|
||
randomly or not. The augmenter yields two `Example` objects: the original
|
||
example and the augmented example.
|
||
|
||
> #### config.cfg
|
||
>
|
||
> ```ini
|
||
> [corpora.train.augmenter]
|
||
> @augmenters = "spongebob_augmenter.v1"
|
||
> randomize = false
|
||
> ```
|
||
|
||
```python
|
||
import spacy
|
||
import random
|
||
|
||
@spacy.registry.augmenters("spongebob_augmenter.v1")
|
||
def create_augmenter(randomize: bool = False):
|
||
def augment(nlp, example):
|
||
text = example.text
|
||
if randomize:
|
||
# Randomly uppercase/lowercase characters
|
||
chars = [c.lower() if random.random() < 0.5 else c.upper() for c in text]
|
||
else:
|
||
# Uppercase followed by lowercase
|
||
chars = [c.lower() if i % 2 else c.upper() for i, c in enumerate(text)]
|
||
# Create augmented training example
|
||
example_dict = example.to_dict()
|
||
doc = nlp.make_doc("".join(chars))
|
||
example_dict["token_annotation"]["ORTH"] = [t.text for t in doc]
|
||
# Original example followed by augmented example
|
||
yield example
|
||
yield example.from_dict(doc, example_dict)
|
||
|
||
return augment
|
||
```
|
||
|
||
An easy way to create modified `Example` objects is to use the
|
||
[`Example.from_dict`](/api/example#from_dict) method with a new reference
|
||
[`Doc`](/api/doc) created from the modified text. In this case, only the
|
||
capitalization changes, so only the `ORTH` values of the tokens will be
|
||
different between the original and augmented examples.
|
||
|
||
Note that if your data augmentation strategy involves changing the tokenization
|
||
(for instance, removing or adding tokens) and your training examples include
|
||
token-based annotations like the dependency parse or entity labels, you'll need
|
||
to take care to adjust the `Example` object so its annotations match and remain
|
||
valid.
|
||
|
||
## Parallel & distributed training with Ray {id="parallel-training"}
|
||
|
||
> #### Installation
|
||
>
|
||
> ```bash
|
||
> $ pip install -U %%SPACY_PKG_NAME[ray]%%SPACY_PKG_FLAGS
|
||
> # Check that the CLI is registered
|
||
> $ python -m spacy ray --help
|
||
> ```
|
||
|
||
[Ray](https://ray.io/) is a fast and simple framework for building and running
|
||
**distributed applications**. You can use Ray to train spaCy on one or more
|
||
remote machines, potentially speeding up your training process. Parallel
|
||
training won't always be faster though – it depends on your batch size, models,
|
||
and hardware.
|
||
|
||
<Infobox variant="warning">
|
||
|
||
To use Ray with spaCy, you need the
|
||
[`spacy-ray`](https://github.com/explosion/spacy-ray) package installed.
|
||
Installing the package will automatically add the `ray` command to the spaCy
|
||
CLI.
|
||
|
||
</Infobox>
|
||
|
||
The [`spacy ray train`](/api/cli#ray-train) command follows the same API as
|
||
[`spacy train`](/api/cli#train), with a few extra options to configure the Ray
|
||
setup. You can optionally set the `--address` option to point to your Ray
|
||
cluster. If it's not set, Ray will run locally.
|
||
|
||
```bash
|
||
python -m spacy ray train config.cfg --n-workers 2
|
||
```
|
||
|
||
<Project id="integrations/ray">
|
||
|
||
Get started with parallel training using our project template. It trains a
|
||
simple model on a Universal Dependencies Treebank and lets you parallelize the
|
||
training with Ray.
|
||
|
||
</Project>
|
||
|
||
### How parallel training works {id="parallel-training-details"}
|
||
|
||
Each worker receives a shard of the **data** and builds a copy of the **model
|
||
and optimizer** from the [`config.cfg`](#config). It also has a communication
|
||
channel to **pass gradients and parameters** to the other workers. Additionally,
|
||
each worker is given ownership of a subset of the parameter arrays. Every
|
||
parameter array is owned by exactly one worker, and the workers are given a
|
||
mapping so they know which worker owns which parameter.
|
||
|
||
![Illustration of setup](/images/spacy-ray.svg)
|
||
|
||
As training proceeds, every worker will be computing gradients for **all** of
|
||
the model parameters. When they compute gradients for parameters they don't own,
|
||
they'll **send them to the worker** that does own that parameter, along with a
|
||
version identifier so that the owner can decide whether to discard the gradient.
|
||
Workers use the gradients they receive and the ones they compute locally to
|
||
update the parameters they own, and then broadcast the updated array and a new
|
||
version ID to the other workers.
|
||
|
||
This training procedure is **asynchronous** and **non-blocking**. Workers always
|
||
push their gradient increments and parameter updates, they do not have to pull
|
||
them and block on the result, so the transfers can happen in the background,
|
||
overlapped with the actual training work. The workers also do not have to stop
|
||
and wait for each other ("synchronize") at the start of each batch. This is very
|
||
useful for spaCy, because spaCy is often trained on long documents, which means
|
||
**batches can vary in size** significantly. Uneven workloads make synchronous
|
||
gradient descent inefficient, because if one batch is slow, all of the other
|
||
workers are stuck waiting for it to complete before they can continue.
|
||
|
||
## Internal training API {id="api"}
|
||
|
||
<Infobox variant="danger">
|
||
|
||
spaCy gives you full control over the training loop. However, for most use
|
||
cases, it's recommended to train your pipelines via the
|
||
[`spacy train`](/api/cli#train) command with a [`config.cfg`](#config) to keep
|
||
track of your settings and hyperparameters, instead of writing your own training
|
||
scripts from scratch. [Custom registered functions](#custom-code) should
|
||
typically give you everything you need to train fully custom pipelines with
|
||
[`spacy train`](/api/cli#train).
|
||
|
||
</Infobox>
|
||
|
||
### Training from a Python script {id="api-train",version="3.2"}
|
||
|
||
If you want to run the training from a Python script instead of using the
|
||
[`spacy train`](/api/cli#train) CLI command, you can call into the
|
||
[`train`](/api/cli#train-function) helper function directly. It takes the path
|
||
to the config file, an optional output directory and an optional dictionary of
|
||
[config overrides](#config-overrides).
|
||
|
||
```python
|
||
from spacy.cli.train import train
|
||
|
||
train("./config.cfg", overrides={"paths.train": "./train.spacy", "paths.dev": "./dev.spacy"})
|
||
```
|
||
|
||
### Internal training loop API {id="api-loop"}
|
||
|
||
<Infobox variant="warning">
|
||
|
||
This section documents how the training loop and updates to the `nlp` object
|
||
work internally. You typically shouldn't have to implement this in Python unless
|
||
you're writing your own trainable components. To train a pipeline, use
|
||
[`spacy train`](/api/cli#train) or the [`train`](/api/cli#train-function) helper
|
||
function instead.
|
||
|
||
</Infobox>
|
||
|
||
The [`Example`](/api/example) object contains annotated training data, also
|
||
called the **gold standard**. It's initialized with a [`Doc`](/api/doc) object
|
||
that will hold the predictions, and another `Doc` object that holds the
|
||
gold-standard annotations. It also includes the **alignment** between those two
|
||
documents if they differ in tokenization. The `Example` class ensures that spaCy
|
||
can rely on one **standardized format** that's passed through the pipeline. For
|
||
instance, let's say we want to define gold-standard part-of-speech tags:
|
||
|
||
```python
|
||
words = ["I", "like", "stuff"]
|
||
predicted = Doc(vocab, words=words)
|
||
# create the reference Doc with gold-standard TAG annotations
|
||
tags = ["NOUN", "VERB", "NOUN"]
|
||
tag_ids = [vocab.strings.add(tag) for tag in tags]
|
||
reference = Doc(vocab, words=words).from_array("TAG", numpy.array(tag_ids, dtype="uint64"))
|
||
example = Example(predicted, reference)
|
||
```
|
||
|
||
As this is quite verbose, there's an alternative way to create the reference
|
||
`Doc` with the gold-standard annotations. The function `Example.from_dict` takes
|
||
a dictionary with keyword arguments specifying the annotations, like `tags` or
|
||
`entities`. Using the resulting `Example` object and its gold-standard
|
||
annotations, the model can be updated to learn a sentence of three words with
|
||
their assigned part-of-speech tags.
|
||
|
||
```python
|
||
words = ["I", "like", "stuff"]
|
||
tags = ["NOUN", "VERB", "NOUN"]
|
||
predicted = Doc(nlp.vocab, words=words)
|
||
example = Example.from_dict(predicted, {"tags": tags})
|
||
```
|
||
|
||
Here's another example that shows how to define gold-standard named entities.
|
||
The letters added before the labels refer to the tags of the
|
||
[BILUO scheme](/usage/linguistic-features#updating-biluo) – `O` is a token
|
||
outside an entity, `U` a single entity unit, `B` the beginning of an entity, `I`
|
||
a token inside an entity and `L` the last token of an entity.
|
||
|
||
```python
|
||
doc = Doc(nlp.vocab, words=["Facebook", "released", "React", "in", "2014"])
|
||
example = Example.from_dict(doc, {"entities": ["U-ORG", "O", "U-TECHNOLOGY", "O", "U-DATE"]})
|
||
```
|
||
|
||
<Infobox title="Migrating from v2.x" variant="warning">
|
||
|
||
As of v3.0, the [`Example`](/api/example) object replaces the `GoldParse` class.
|
||
It can be constructed in a very similar way – from a `Doc` and a dictionary of
|
||
annotations. For more details, see the
|
||
[migration guide](/usage/v3#migrating-training).
|
||
|
||
```diff
|
||
- gold = GoldParse(doc, entities=entities)
|
||
+ example = Example.from_dict(doc, {"entities": entities})
|
||
```
|
||
|
||
</Infobox>
|
||
|
||
Of course, it's not enough to only show a model a single example once.
|
||
Especially if you only have few examples, you'll want to train for a **number of
|
||
iterations**. At each iteration, the training data is **shuffled** to ensure the
|
||
model doesn't make any generalizations based on the order of examples. Another
|
||
technique to improve the learning results is to set a **dropout rate**, a rate
|
||
at which to randomly "drop" individual features and representations. This makes
|
||
it harder for the model to memorize the training data. For example, a `0.25`
|
||
dropout means that each feature or internal representation has a 1/4 likelihood
|
||
of being dropped.
|
||
|
||
> - [`nlp`](/api/language): The `nlp` object with the pipeline components and
|
||
> their models.
|
||
> - [`nlp.initialize`](/api/language#initialize): Initialize the pipeline and
|
||
> return an optimizer to update the component model weights.
|
||
> - [`Optimizer`](https://thinc.ai/docs/api-optimizers): Function that holds
|
||
> state between updates.
|
||
> - [`nlp.update`](/api/language#update): Update component models with examples.
|
||
> - [`Example`](/api/example): object holding predictions and gold-standard
|
||
> annotations.
|
||
> - [`nlp.to_disk`](/api/language#to_disk): Save the updated pipeline to a
|
||
> directory.
|
||
|
||
```python {title="Example training loop"}
|
||
optimizer = nlp.initialize()
|
||
for itn in range(100):
|
||
random.shuffle(train_data)
|
||
for raw_text, entity_offsets in train_data:
|
||
doc = nlp.make_doc(raw_text)
|
||
example = Example.from_dict(doc, {"entities": entity_offsets})
|
||
nlp.update([example], sgd=optimizer)
|
||
nlp.to_disk("/output")
|
||
```
|
||
|
||
The [`nlp.update`](/api/language#update) method takes the following arguments:
|
||
|
||
| Name | Description |
|
||
| ---------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||
| `examples` | [`Example`](/api/example) objects. The `update` method takes a sequence of them, so you can batch up your training examples. |
|
||
| `drop` | Dropout rate. Makes it harder for the model to just memorize the data. |
|
||
| `sgd` | An [`Optimizer`](https://thinc.ai/docs/api-optimizers) object, which updates the model's weights. If not set, spaCy will create a new one and save it for further use. |
|
||
|
||
<Infobox title="Migrating from v2.x" variant="warning">
|
||
|
||
As of v3.0, the [`Example`](/api/example) object replaces the `GoldParse` class
|
||
and the "simple training style" of calling `nlp.update` with a text and a
|
||
dictionary of annotations. Updating your code to use the `Example` object should
|
||
be very straightforward: you can call
|
||
[`Example.from_dict`](/api/example#from_dict) with a [`Doc`](/api/doc) and the
|
||
dictionary of annotations:
|
||
|
||
```diff
|
||
text = "Facebook released React in 2014"
|
||
annotations = {"entities": ["U-ORG", "O", "U-TECHNOLOGY", "O", "U-DATE"]}
|
||
+ example = Example.from_dict(nlp.make_doc(text), annotations)
|
||
- nlp.update([text], [annotations])
|
||
+ nlp.update([example])
|
||
```
|
||
|
||
</Infobox>
|