mirror of
				https://github.com/explosion/spaCy.git
				synced 2025-11-04 01:48:04 +03:00 
			
		
		
		
	
		
			
				
	
	
		
			790 lines
		
	
	
		
			40 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
			
		
		
	
	
			790 lines
		
	
	
		
			40 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
---
 | 
						||
title: Training spaCy's Statistical Models
 | 
						||
next: /usage/adding-languages
 | 
						||
menu:
 | 
						||
  - ['Basics', 'basics']
 | 
						||
  - ['NER', 'ner']
 | 
						||
  - ['Tagger & Parser', 'tagger-parser']
 | 
						||
  - ['Text Classification', 'textcat']
 | 
						||
  - ['Entity Linking', 'entity-linker']
 | 
						||
  - ['Tips and Advice', 'tips']
 | 
						||
---
 | 
						||
 | 
						||
This guide describes how to train new statistical models for spaCy's
 | 
						||
part-of-speech tagger, named entity recognizer, dependency parser, text
 | 
						||
classifier and entity linker. Once the model is trained, you can then
 | 
						||
[save and load](/usage/saving-loading#models) it.
 | 
						||
 | 
						||
## Training basics {#basics}
 | 
						||
 | 
						||
import Training101 from 'usage/101/\_training.md'
 | 
						||
 | 
						||
<Training101 />
 | 
						||
 | 
						||
### Training via the command-line interface {#spacy-train-cli}
 | 
						||
 | 
						||
For most purposes, the best way to train spaCy is via the command-line
 | 
						||
interface. The [`spacy train`](/api/cli#train) command takes care of many
 | 
						||
details for you, including making sure that the data is minibatched and shuffled
 | 
						||
correctly, progress is printed, and models are saved after each epoch. You can
 | 
						||
prepare your data for use in [`spacy train`](/api/cli#train) using the
 | 
						||
[`spacy convert`](/api/cli#convert) command, which accepts many common NLP data
 | 
						||
formats, including `.iob` for named entities, and the CoNLL format for
 | 
						||
dependencies:
 | 
						||
 | 
						||
```bash
 | 
						||
git clone https://github.com/UniversalDependencies/UD_Spanish-AnCora
 | 
						||
mkdir ancora-json
 | 
						||
python -m spacy convert UD_Spanish-AnCora/es_ancora-ud-train.conllu ancora-json
 | 
						||
python -m spacy convert UD_Spanish-AnCora/es_ancora-ud-dev.conllu ancora-json
 | 
						||
mkdir models
 | 
						||
python -m spacy train es models ancora-json/es_ancora-ud-train.json ancora-json/es_ancora-ud-dev.json
 | 
						||
```
 | 
						||
 | 
						||
<Infobox title="Tip: Debug your data">
 | 
						||
 | 
						||
If you're running spaCy v2.2 or above, you can use the
 | 
						||
[`debug-data` command](/api/cli#debug-data) to analyze and validate your
 | 
						||
training and development data, get useful stats, and find problems like invalid
 | 
						||
entity annotations, cyclic dependencies, low data labels and more.
 | 
						||
 | 
						||
```bash
 | 
						||
$ python -m spacy debug-data en train.json dev.json --verbose
 | 
						||
```
 | 
						||
 | 
						||
</Infobox>
 | 
						||
 | 
						||
You can also use the [`gold.docs_to_json`](/api/goldparse#docs_to_json) helper
 | 
						||
to convert a list of `Doc` objects to spaCy's JSON training format.
 | 
						||
 | 
						||
#### Understanding the training output
 | 
						||
 | 
						||
When you train a model using the [`spacy train`](/api/cli#train) command, you'll
 | 
						||
see a table showing metrics after each pass over the data. Here's what those
 | 
						||
metrics means:
 | 
						||
 | 
						||
> #### Tokenization metrics
 | 
						||
>
 | 
						||
> Note that if the development data has raw text, some of the gold-standard
 | 
						||
> entities might not align to the predicted tokenization. These tokenization
 | 
						||
> errors are **excluded from the NER evaluation**. If your tokenization makes it
 | 
						||
> impossible for the model to predict 50% of your entities, your NER F-score
 | 
						||
> might still look good.
 | 
						||
 | 
						||
| Name       | Description                                                                                       |
 | 
						||
| ---------- | ------------------------------------------------------------------------------------------------- |
 | 
						||
| `Dep Loss` | Training loss for dependency parser. Should decrease, but usually not to 0.                       |
 | 
						||
| `NER Loss` | Training loss for named entity recognizer. Should decrease, but usually not to 0.                 |
 | 
						||
| `UAS`      | Unlabeled attachment score for parser. The percentage of unlabeled correct arcs. Should increase. |
 | 
						||
| `NER P.`   | NER precision on development data. Should increase.                                               |
 | 
						||
| `NER R.`   | NER recall on development data. Should increase.                                                  |
 | 
						||
| `NER F.`   | NER F-score on development data. Should increase.                                                 |
 | 
						||
| `Tag %`    | Fine-grained part-of-speech tag accuracy on development data. Should increase.                    |
 | 
						||
| `Token %`  | Tokenization accuracy on development data.                                                        |
 | 
						||
| `CPU WPS`  | Prediction speed on CPU in words per second, if available. Should stay stable.                    |
 | 
						||
| `GPU WPS`  | Prediction speed on GPU in words per second, if available. Should stay stable.                    |
 | 
						||
 | 
						||
### Improving accuracy with transfer learning {#transfer-learning new="2.1"}
 | 
						||
 | 
						||
In most projects, you'll usually have a small amount of labelled data, and
 | 
						||
access to a much bigger sample of raw text. The raw text contains a lot of
 | 
						||
information about the language in general. Learning this general information
 | 
						||
from the raw text can help your model use the smaller labelled data more
 | 
						||
efficiently.
 | 
						||
 | 
						||
The two main ways to use raw text in your spaCy models are **word vectors** and
 | 
						||
**language model pretraining**. Word vectors provide information about the
 | 
						||
definitions of words. The vectors are a look-up table, so each word only has one
 | 
						||
representation, regardless of its context. Language model pretraining lets you
 | 
						||
learn contextualized word representations. Instead of initializing spaCy's
 | 
						||
convolutional neural network layers with random weights, the `spacy pretrain`
 | 
						||
command trains a language model to predict each word's word vector based on the
 | 
						||
surrounding words. The information used to predict this task is a good starting
 | 
						||
point for other tasks such as named entity recognition, text classification or
 | 
						||
dependency parsing.
 | 
						||
 | 
						||
<Infobox title="📖 Vectors and pretraining">
 | 
						||
 | 
						||
For more details, see the documentation on
 | 
						||
[vectors and similarity](/usage/vectors-similarity) and the
 | 
						||
[`spacy pretrain`](/api/cli#pretrain) command.
 | 
						||
 | 
						||
</Infobox>
 | 
						||
 | 
						||
### How do I get training data? {#training-data}
 | 
						||
 | 
						||
Collecting training data may sound incredibly painful – and it can be, if you're
 | 
						||
planning a large-scale annotation project. However, if your main goal is to
 | 
						||
update an existing model's predictions – for example, spaCy's named entity
 | 
						||
recognition – the hard part is usually not creating the actual annotations. It's
 | 
						||
finding representative examples and **extracting potential candidates**. The
 | 
						||
good news is, if you've been noticing bad performance on your data, you likely
 | 
						||
already have some relevant text, and you can use spaCy to **bootstrap a first
 | 
						||
set of training examples**. For example, after processing a few sentences, you
 | 
						||
may end up with the following entities, some correct, some incorrect.
 | 
						||
 | 
						||
> #### How many examples do I need?
 | 
						||
>
 | 
						||
> As a rule of thumb, you should allocate at least 10% of your project resources
 | 
						||
> to creating training and evaluation data. If you're looking to improve an
 | 
						||
> existing model, you might be able to start off with only a handful of
 | 
						||
> examples. Keep in mind that you'll always want a lot more than that for
 | 
						||
> **evaluation** – especially previous errors the model has made. Otherwise, you
 | 
						||
> won't be able to sufficiently verify that the model has actually made the
 | 
						||
> **correct generalizations** required for your use case.
 | 
						||
 | 
						||
| Text                               |  Entity | Start | End  | Label    |     |
 | 
						||
| ---------------------------------- | ------- | ----- | ---- | -------- | --- |
 | 
						||
| Uber blew through 1 million a week | Uber    | `0`   | `4`  | `ORG`    | ✅  |
 | 
						||
| Android Pay expands to Canada      | Android | `0`   | `7`  | `PERSON` | ❌  |
 | 
						||
| Android Pay expands to Canada      | Canada  | `23`  | `30` | `GPE`    | ✅  |
 | 
						||
| Spotify steps up Asia expansion    | Spotify | `0`   | `8`  | `ORG`    | ✅  |
 | 
						||
| Spotify steps up Asia expansion    | Asia    | `17`  | `21` | `NORP`   | ❌  |
 | 
						||
 | 
						||
Alternatively, the [rule-based matcher](/usage/rule-based-matching) can be a
 | 
						||
useful tool to extract tokens or combinations of tokens, as well as their start
 | 
						||
and end index in a document. In this case, we'll extract mentions of Google and
 | 
						||
assume they're an `ORG`.
 | 
						||
 | 
						||
| Text                                  |  Entity | Start | End  | Label |     |
 | 
						||
| ------------------------------------- | ------- | ----- | ---- | ----- | --- |
 | 
						||
| let me google this for you            | google  | `7`   | `13` | `ORG` | ❌  |
 | 
						||
| Google Maps launches location sharing | Google  | `0`   | `6`  | `ORG` | ❌  |
 | 
						||
| Google rebrands its business apps     | Google  | `0`   | `6`  | `ORG` | ✅  |
 | 
						||
| look what i found on google! 😂       | google  | `21`  | `27` | `ORG` | ✅  |
 | 
						||
 | 
						||
Based on the few examples above, you can already create six training sentences
 | 
						||
with eight entities in total. Of course, what you consider a "correct
 | 
						||
annotation" will always depend on **what you want the model to learn**. While
 | 
						||
there are some entity annotations that are more or less universally correct –
 | 
						||
like Canada being a geopolitical entity – your application may have its very own
 | 
						||
definition of the [NER annotation scheme](/api/annotation#named-entities).
 | 
						||
 | 
						||
```python
 | 
						||
train_data = [
 | 
						||
    ("Uber blew through $1 million a week", [(0, 4, 'ORG')]),
 | 
						||
    ("Android Pay expands to Canada", [(0, 11, 'PRODUCT'), (23, 30, 'GPE')]),
 | 
						||
    ("Spotify steps up Asia expansion", [(0, 8, "ORG"), (17, 21, "LOC")]),
 | 
						||
    ("Google Maps launches location sharing", [(0, 11, "PRODUCT")]),
 | 
						||
    ("Google rebrands its business apps", [(0, 6, "ORG")]),
 | 
						||
    ("look what i found on google! 😂", [(21, 27, "PRODUCT")])]
 | 
						||
```
 | 
						||
 | 
						||
<Infobox title="Tip: Try the Prodigy annotation tool">
 | 
						||
 | 
						||
[](https://prodi.gy)
 | 
						||
 | 
						||
If you need to label a lot of data, check out [Prodigy](https://prodi.gy), a
 | 
						||
new, active learning-powered annotation tool we've developed. Prodigy is fast
 | 
						||
and extensible, and comes with a modern **web application** that helps you
 | 
						||
collect training data faster. It integrates seamlessly with spaCy, pre-selects
 | 
						||
the **most relevant examples** for annotation, and lets you train and evaluate
 | 
						||
ready-to-use spaCy models.
 | 
						||
 | 
						||
</Infobox>
 | 
						||
 | 
						||
### Training with annotations {#annotations}
 | 
						||
 | 
						||
The [`GoldParse`](/api/goldparse) object collects the annotated training
 | 
						||
examples, also called the **gold standard**. It's initialized with the
 | 
						||
[`Doc`](/api/doc) object it refers to, and keyword arguments specifying the
 | 
						||
annotations, like `tags` or `entities`. Its job is to encode the annotations,
 | 
						||
keep them aligned and create the C-level data structures required for efficient
 | 
						||
access. Here's an example of a simple `GoldParse` for part-of-speech tags:
 | 
						||
 | 
						||
```python
 | 
						||
vocab = Vocab(tag_map={"N": {"pos": "NOUN"}, "V": {"pos": "VERB"}})
 | 
						||
doc = Doc(vocab, words=["I", "like", "stuff"])
 | 
						||
gold = GoldParse(doc, tags=["N", "V", "N"])
 | 
						||
```
 | 
						||
 | 
						||
Using the `Doc` and its gold-standard annotations, the model can be updated to
 | 
						||
learn a sentence of three words with their assigned part-of-speech tags. The
 | 
						||
[tag map](/usage/adding-languages#tag-map) is part of the vocabulary and defines
 | 
						||
the annotation scheme. If you're training a new language model, this will let
 | 
						||
you map the tags present in the treebank you train on to spaCy's tag scheme.
 | 
						||
 | 
						||
```python
 | 
						||
doc = Doc(Vocab(), words=["Facebook", "released", "React", "in", "2014"])
 | 
						||
gold = GoldParse(doc, entities=["U-ORG", "O", "U-TECHNOLOGY", "O", "U-DATE"])
 | 
						||
```
 | 
						||
 | 
						||
The same goes for named entities. The letters added before the labels refer to
 | 
						||
the tags of the [BILUO scheme](/usage/linguistic-features#updating-biluo) – `O`
 | 
						||
is a token outside an entity, `U` an single entity unit, `B` the beginning of an
 | 
						||
entity, `I` a token inside an entity and `L` the last token of an entity.
 | 
						||
 | 
						||
> - **Training data**: The training examples.
 | 
						||
> - **Text and label**: The current example.
 | 
						||
> - **Doc**: A `Doc` object created from the example text.
 | 
						||
> - **GoldParse**: A `GoldParse` object of the `Doc` and label.
 | 
						||
> - **nlp**: The `nlp` object with the model.
 | 
						||
> - **Optimizer**: A function that holds state between updates.
 | 
						||
> - **Update**: Update the model's weights.
 | 
						||
 | 
						||

 | 
						||
 | 
						||
Of course, it's not enough to only show a model a single example once.
 | 
						||
Especially if you only have few examples, you'll want to train for a **number of
 | 
						||
iterations**. At each iteration, the training data is **shuffled** to ensure the
 | 
						||
model doesn't make any generalizations based on the order of examples. Another
 | 
						||
technique to improve the learning results is to set a **dropout rate**, a rate
 | 
						||
at which to randomly "drop" individual features and representations. This makes
 | 
						||
it harder for the model to memorize the training data. For example, a `0.25`
 | 
						||
dropout means that each feature or internal representation has a 1/4 likelihood
 | 
						||
of being dropped.
 | 
						||
 | 
						||
> - [`begin_training()`](/api/language#begin_training): Start the training and
 | 
						||
>   return an optimizer function to update the model's weights. Can take an
 | 
						||
>   optional function converting the training data to spaCy's training format.
 | 
						||
> - [`update()`](/api/language#update): Update the model with the training
 | 
						||
>   example and gold data.
 | 
						||
> - [`to_disk()`](/api/language#to_disk): Save the updated model to a directory.
 | 
						||
 | 
						||
```python
 | 
						||
### Example training loop
 | 
						||
optimizer = nlp.begin_training(get_data)
 | 
						||
for itn in range(100):
 | 
						||
    random.shuffle(train_data)
 | 
						||
    for raw_text, entity_offsets in train_data:
 | 
						||
        doc = nlp.make_doc(raw_text)
 | 
						||
        gold = GoldParse(doc, entities=entity_offsets)
 | 
						||
        nlp.update([doc], [gold], drop=0.5, sgd=optimizer)
 | 
						||
nlp.to_disk("/model")
 | 
						||
```
 | 
						||
 | 
						||
The [`nlp.update`](/api/language#update) method takes the following arguments:
 | 
						||
 | 
						||
| Name    | Description                                                                                                                                                                                                   |
 | 
						||
| ------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
 | 
						||
| `docs`  | [`Doc`](/api/doc) objects. The `update` method takes a sequence of them, so you can batch up your training examples. Alternatively, you can also pass in a sequence of raw texts.                             |
 | 
						||
| `golds` | [`GoldParse`](/api/goldparse) objects. The `update` method takes a sequence of them, so you can batch up your training examples. Alternatively, you can also pass in a dictionary containing the annotations. |
 | 
						||
| `drop`  | Dropout rate. Makes it harder for the model to just memorize the data.                                                                                                                                        |
 | 
						||
| `sgd`   | An optimizer, i.e. a callable to update the model's weights. If not set, spaCy will create a new one and save it for further use.                                                                             |
 | 
						||
 | 
						||
Instead of writing your own training loop, you can also use the built-in
 | 
						||
[`train`](/api/cli#train) command, which expects data in spaCy's
 | 
						||
[JSON format](/api/annotation#json-input). On each epoch, a model will be saved
 | 
						||
out to the directory. After training, you can use the
 | 
						||
[`package`](/api/cli#package) command to generate an installable Python package
 | 
						||
from your model.
 | 
						||
 | 
						||
```bash
 | 
						||
python -m spacy convert /tmp/train.conllu /tmp/data
 | 
						||
python -m spacy train en /tmp/model /tmp/data/train.json -n 5
 | 
						||
```
 | 
						||
 | 
						||
### Simple training style {#training-simple-style new="2"}
 | 
						||
 | 
						||
Instead of sequences of `Doc` and `GoldParse` objects, you can also use the
 | 
						||
"simple training style" and pass **raw texts** and **dictionaries of
 | 
						||
annotations** to [`nlp.update`](/api/language#update). The dictionaries can have
 | 
						||
the keys `entities`, `heads`, `deps`, `tags` and `cats`. This is generally
 | 
						||
recommended, as it removes one layer of abstraction, and avoids unnecessary
 | 
						||
imports. It also makes it easier to structure and load your training data.
 | 
						||
 | 
						||
> #### Example Annotations
 | 
						||
>
 | 
						||
> ```python
 | 
						||
> {
 | 
						||
>    "entities": [(0, 4, "ORG")],
 | 
						||
>    "heads": [1, 1, 1, 5, 5, 2, 7, 5],
 | 
						||
>    "deps": ["nsubj", "ROOT", "prt", "quantmod", "compound", "pobj", "det", "npadvmod"],
 | 
						||
>    "tags": ["PROPN", "VERB", "ADP", "SYM", "NUM", "NUM", "DET", "NOUN"],
 | 
						||
>    "cats": {"BUSINESS": 1.0},
 | 
						||
> }
 | 
						||
> ```
 | 
						||
 | 
						||
```python
 | 
						||
### Simple training loop
 | 
						||
TRAIN_DATA = [
 | 
						||
        ("Uber blew through $1 million a week", {"entities": [(0, 4, "ORG")]}),
 | 
						||
        ("Google rebrands its business apps", {"entities": [(0, 6, "ORG")]})]
 | 
						||
 | 
						||
nlp = spacy.blank("en")
 | 
						||
optimizer = nlp.begin_training()
 | 
						||
for i in range(20):
 | 
						||
    random.shuffle(TRAIN_DATA)
 | 
						||
    for text, annotations in TRAIN_DATA:
 | 
						||
        nlp.update([text], [annotations], sgd=optimizer)
 | 
						||
nlp.to_disk("/model")
 | 
						||
```
 | 
						||
 | 
						||
The above training loop leaves out a few details that can really improve
 | 
						||
accuracy – but the principle really is _that_ simple. Once you've got your
 | 
						||
pipeline together and you want to tune the accuracy, you usually want to process
 | 
						||
your training examples in batches, and experiment with
 | 
						||
[`minibatch`](/api/top-level#util.minibatch) sizes and dropout rates, set via
 | 
						||
the `drop` keyword argument. See the [`Language`](/api/language) and
 | 
						||
[`Pipe`](/api/pipe) API docs for available options.
 | 
						||
 | 
						||
## Training the named entity recognizer {#ner}
 | 
						||
 | 
						||
All [spaCy models](/models) support online learning, so you can update a
 | 
						||
pretrained model with new examples. You'll usually need to provide many
 | 
						||
**examples** to meaningfully improve the system — a few hundred is a good start,
 | 
						||
although more is better.
 | 
						||
 | 
						||
You should avoid iterating over the same few examples multiple times, or the
 | 
						||
model is likely to "forget" how to annotate other examples. If you iterate over
 | 
						||
the same few examples, you're effectively changing the loss function. The
 | 
						||
optimizer will find a way to minimize the loss on your examples, without regard
 | 
						||
for the consequences on the examples it's no longer paying attention to. One way
 | 
						||
to avoid this
 | 
						||
["catastrophic forgetting" problem](https://explosion.ai/blog/pseudo-rehearsal-catastrophic-forgetting)
 | 
						||
is to "remind" the model of other examples by augmenting your annotations with
 | 
						||
sentences annotated with entities automatically recognized by the original
 | 
						||
model. Ultimately, this is an empirical process: you'll need to **experiment on
 | 
						||
your data** to find a solution that works best for you.
 | 
						||
 | 
						||
> #### Tip: Converting entity annotations
 | 
						||
>
 | 
						||
> You can train the entity recognizer with entity offsets or annotations in the
 | 
						||
> [BILUO scheme](/api/annotation#biluo). The `spacy.gold` module also exposes
 | 
						||
> [two helper functions](/api/goldparse#util) to convert offsets to BILUO tags,
 | 
						||
> and BILUO tags to entity offsets.
 | 
						||
 | 
						||
### Updating the Named Entity Recognizer {#example-train-ner}
 | 
						||
 | 
						||
This example shows how to update spaCy's entity recognizer with your own
 | 
						||
examples, starting off with an existing, pretrained model, or from scratch using
 | 
						||
a blank `Language` class. To do this, you'll need **example texts** and the
 | 
						||
**character offsets** and **labels** of each entity contained in the texts.
 | 
						||
 | 
						||
```python
 | 
						||
https://github.com/explosion/spaCy/tree/master/examples/training/train_ner.py
 | 
						||
```
 | 
						||
 | 
						||
#### Step by step guide {#step-by-step-ner}
 | 
						||
 | 
						||
1. **Load the model** you want to start with, or create an **empty model** using
 | 
						||
   [`spacy.blank`](/api/top-level#spacy.blank) with the ID of your language. If
 | 
						||
   you're using a blank model, don't forget to add the entity recognizer to the
 | 
						||
   pipeline. If you're using an existing model, make sure to disable all other
 | 
						||
   pipeline components during training using
 | 
						||
   [`nlp.disable_pipes`](/api/language#disable_pipes). This way, you'll only be
 | 
						||
   training the entity recognizer.
 | 
						||
2. **Shuffle and loop over** the examples. For each example, **update the
 | 
						||
   model** by calling [`nlp.update`](/api/language#update), which steps through
 | 
						||
   the words of the input. At each word, it makes a **prediction**. It then
 | 
						||
   consults the annotations to see whether it was right. If it was wrong, it
 | 
						||
   adjusts its weights so that the correct action will score higher next time.
 | 
						||
3. **Save** the trained model using [`nlp.to_disk`](/api/language#to_disk).
 | 
						||
4. **Test** the model to make sure the entities in the training data are
 | 
						||
   recognized correctly.
 | 
						||
 | 
						||
### Training an additional entity type {#example-new-entity-type}
 | 
						||
 | 
						||
This script shows how to add a new entity type `ANIMAL` to an existing
 | 
						||
pretrained NER model, or an empty `Language` class. To keep the example short
 | 
						||
and simple, only a few sentences are provided as examples. In practice, you'll
 | 
						||
need many more — a few hundred would be a good start. You will also likely need
 | 
						||
to mix in examples of other entity types, which might be obtained by running the
 | 
						||
entity recognizer over unlabelled sentences, and adding their annotations to the
 | 
						||
training set.
 | 
						||
 | 
						||
```python
 | 
						||
https://github.com/explosion/spaCy/tree/master/examples/training/train_new_entity_type.py
 | 
						||
```
 | 
						||
 | 
						||
<Infobox title="Important note" variant="warning">
 | 
						||
 | 
						||
If you're using an existing model, make sure to mix in examples of **other
 | 
						||
entity types** that spaCy correctly recognized before. Otherwise, your model
 | 
						||
might learn the new type, but "forget" what it previously knew. This is also
 | 
						||
referred to as the "catastrophic forgetting" problem.
 | 
						||
 | 
						||
</Infobox>
 | 
						||
 | 
						||
#### Step by step guide {#step-by-step-ner-new}
 | 
						||
 | 
						||
1. **Load the model** you want to start with, or create an **empty model** using
 | 
						||
   [`spacy.blank`](/api/top-level#spacy.blank) with the ID of your language. If
 | 
						||
   you're using a blank model, don't forget to add the entity recognizer to the
 | 
						||
   pipeline. If you're using an existing model, make sure to disable all other
 | 
						||
   pipeline components during training using
 | 
						||
   [`nlp.disable_pipes`](/api/language#disable_pipes). This way, you'll only be
 | 
						||
   training the entity recognizer.
 | 
						||
2. **Add the new entity label** to the entity recognizer using the
 | 
						||
   [`add_label`](/api/entityrecognizer#add_label) method. You can access the
 | 
						||
   entity recognizer in the pipeline via `nlp.get_pipe('ner')`.
 | 
						||
3. **Loop over** the examples and call [`nlp.update`](/api/language#update),
 | 
						||
   which steps through the words of the input. At each word, it makes a
 | 
						||
   **prediction**. It then consults the annotations, to see whether it was
 | 
						||
   right. If it was wrong, it adjusts its weights so that the correct action
 | 
						||
   will score higher next time.
 | 
						||
4. **Save** the trained model using [`nlp.to_disk`](/api/language#to_disk).
 | 
						||
5. **Test** the model to make sure the new entity is recognized correctly.
 | 
						||
 | 
						||
## Training the tagger and parser {#tagger-parser}
 | 
						||
 | 
						||
### Updating the Dependency Parser {#example-train-parser}
 | 
						||
 | 
						||
This example shows how to train spaCy's dependency parser, starting off with an
 | 
						||
existing model or a blank model. You'll need a set of **training examples** and
 | 
						||
the respective **heads** and **dependency label** for each token of the example
 | 
						||
texts.
 | 
						||
 | 
						||
```python
 | 
						||
https://github.com/explosion/spaCy/tree/master/examples/training/train_parser.py
 | 
						||
```
 | 
						||
 | 
						||
#### Step by step guide {#step-by-step-parser}
 | 
						||
 | 
						||
1. **Load the model** you want to start with, or create an **empty model** using
 | 
						||
   [`spacy.blank`](/api/top-level#spacy.blank) with the ID of your language. If
 | 
						||
   you're using a blank model, don't forget to add the parser to the pipeline.
 | 
						||
   If you're using an existing model, make sure to disable all other pipeline
 | 
						||
   components during training using
 | 
						||
   [`nlp.disable_pipes`](/api/language#disable_pipes). This way, you'll only be
 | 
						||
   training the parser.
 | 
						||
2. **Add the dependency labels** to the parser using the
 | 
						||
   [`add_label`](/api/dependencyparser#add_label) method. If you're starting off
 | 
						||
   with a pretrained spaCy model, this is usually not necessary – but it doesn't
 | 
						||
   hurt either, just to be safe.
 | 
						||
3. **Shuffle and loop over** the examples. For each example, **update the
 | 
						||
   model** by calling [`nlp.update`](/api/language#update), which steps through
 | 
						||
   the words of the input. At each word, it makes a **prediction**. It then
 | 
						||
   consults the annotations to see whether it was right. If it was wrong, it
 | 
						||
   adjusts its weights so that the correct action will score higher next time.
 | 
						||
4. **Save** the trained model using [`nlp.to_disk`](/api/language#to_disk).
 | 
						||
5. **Test** the model to make sure the parser works as expected.
 | 
						||
 | 
						||
### Updating the Part-of-speech Tagger {#example-train-tagger}
 | 
						||
 | 
						||
In this example, we're training spaCy's part-of-speech tagger with a custom tag
 | 
						||
map. We start off with a blank `Language` class, update its defaults with our
 | 
						||
custom tags and then train the tagger. You'll need a set of **training
 | 
						||
examples** and the respective **custom tags**, as well as a dictionary mapping
 | 
						||
those tags to the
 | 
						||
[Universal Dependencies scheme](http://universaldependencies.github.io/docs/u/pos/index.html).
 | 
						||
 | 
						||
```python
 | 
						||
https://github.com/explosion/spaCy/tree/master/examples/training/train_tagger.py
 | 
						||
```
 | 
						||
 | 
						||
#### Step by step guide {#step-by-step-tagger}
 | 
						||
 | 
						||
1. **Load the model** you want to start with, or create an **empty model** using
 | 
						||
   [`spacy.blank`](/api/top-level#spacy.blank) with the ID of your language. If
 | 
						||
   you're using a blank model, don't forget to add the tagger to the pipeline.
 | 
						||
   If you're using an existing model, make sure to disable all other pipeline
 | 
						||
   components during training using
 | 
						||
   [`nlp.disable_pipes`](/api/language#disable_pipes). This way, you'll only be
 | 
						||
   training the tagger.
 | 
						||
2. **Add the tag map** to the tagger using the
 | 
						||
   [`add_label`](/api/tagger#add_label) method. The first argument is the new
 | 
						||
   tag name, the second the mapping to spaCy's coarse-grained tags, e.g.
 | 
						||
   `{'pos': 'NOUN'}`.
 | 
						||
3. **Shuffle and loop over** the examples. For each example, **update the
 | 
						||
   model** by calling [`nlp.update`](/api/language#update), which steps through
 | 
						||
   the words of the input. At each word, it makes a **prediction**. It then
 | 
						||
   consults the annotations to see whether it was right. If it was wrong, it
 | 
						||
   adjusts its weights so that the correct action will score higher next time.
 | 
						||
4. **Save** the trained model using [`nlp.to_disk`](/api/language#to_disk).
 | 
						||
5. **Test** the model to make sure the parser works as expected.
 | 
						||
 | 
						||
### Training a parser for custom semantics {#intent-parser}
 | 
						||
 | 
						||
spaCy's parser component can be used to be trained to predict any type of tree
 | 
						||
structure over your input text – including **semantic relations** that are not
 | 
						||
syntactic dependencies. This can be useful to for **conversational
 | 
						||
applications**, which need to predict trees over whole documents or chat logs,
 | 
						||
with connections between the sentence roots used to annotate discourse
 | 
						||
structure. For example, you can train spaCy's parser to label intents and their
 | 
						||
targets, like attributes, quality, time and locations. The result could look
 | 
						||
like this:
 | 
						||
 | 
						||

 | 
						||
 | 
						||
```python
 | 
						||
doc = nlp("find a hotel with good wifi")
 | 
						||
print([(t.text, t.dep_, t.head.text) for t in doc if t.dep_ != '-'])
 | 
						||
# [('find', 'ROOT', 'find'), ('hotel', 'PLACE', 'find'),
 | 
						||
#  ('good', 'QUALITY', 'wifi'), ('wifi', 'ATTRIBUTE', 'hotel')]
 | 
						||
```
 | 
						||
 | 
						||
The above tree attaches "wifi" to "hotel" and assigns the dependency label
 | 
						||
`ATTRIBUTE`. This may not be a correct syntactic dependency – but in this case,
 | 
						||
it expresses exactly what we need: the user is looking for a hotel with the
 | 
						||
attribute "wifi" of the quality "good". This query can then be processed by your
 | 
						||
application and used to trigger the respective action – e.g. search the database
 | 
						||
for hotels with high ratings for their wifi offerings.
 | 
						||
 | 
						||
> #### Tip: merge phrases and entities
 | 
						||
>
 | 
						||
> To achieve even better accuracy, try merging multi-word tokens and entities
 | 
						||
> specific to your domain into one token before parsing your text. You can do
 | 
						||
> this by running the entity recognizer or
 | 
						||
> [rule-based matcher](/usage/rule-based-matching) to find relevant spans, and
 | 
						||
> merging them using [`Doc.retokenize`](/api/doc#retokenize). You could even add
 | 
						||
> your own custom
 | 
						||
> [pipeline component](/usage/processing-pipelines#custom-components) to do this
 | 
						||
> automatically – just make sure to add it `before='parser'`.
 | 
						||
 | 
						||
The following example shows a full implementation of a training loop for a
 | 
						||
custom message parser for a common "chat intent": finding local businesses. Our
 | 
						||
message semantics will have the following types of relations: `ROOT`, `PLACE`,
 | 
						||
`QUALITY`, `ATTRIBUTE`, `TIME` and `LOCATION`.
 | 
						||
 | 
						||
```python
 | 
						||
https://github.com/explosion/spaCy/tree/master/examples/training/train_intent_parser.py
 | 
						||
```
 | 
						||
 | 
						||
#### Step by step guide {#step-by-step-parser-custom}
 | 
						||
 | 
						||
1. **Create the training data** consisting of words, their heads and their
 | 
						||
   dependency labels in order. A token's head is the index of the token it is
 | 
						||
   attached to. The heads don't need to be syntactically correct – they should
 | 
						||
   express the **semantic relations** you want the parser to learn. For words
 | 
						||
   that shouldn't receive a label, you can choose an arbitrary placeholder, for
 | 
						||
   example `-`.
 | 
						||
2. **Load the model** you want to start with, or create an **empty model** using
 | 
						||
   [`spacy.blank`](/api/top-level#spacy.blank) with the ID of your language. If
 | 
						||
   you're using a blank model, don't forget to add the custom parser to the
 | 
						||
   pipeline. If you're using an existing model, make sure to **remove the old
 | 
						||
   parser** from the pipeline, and disable all other pipeline components during
 | 
						||
   training using [`nlp.disable_pipes`](/api/language#disable_pipes). This way,
 | 
						||
   you'll only be training the parser.
 | 
						||
3. **Add the dependency labels** to the parser using the
 | 
						||
   [`add_label`](/api/dependencyparser#add_label) method.
 | 
						||
4. **Shuffle and loop over** the examples. For each example, **update the
 | 
						||
   model** by calling [`nlp.update`](/api/language#update), which steps through
 | 
						||
   the words of the input. At each word, it makes a **prediction**. It then
 | 
						||
   consults the annotations to see whether it was right. If it was wrong, it
 | 
						||
   adjusts its weights so that the correct action will score higher next time.
 | 
						||
5. **Save** the trained model using [`nlp.to_disk`](/api/language#to_disk).
 | 
						||
6. **Test** the model to make sure the parser works as expected.
 | 
						||
 | 
						||
## Training a text classification model {#textcat}
 | 
						||
 | 
						||
### Adding a text classifier to a spaCy model {#example-textcat new="2"}
 | 
						||
 | 
						||
This example shows how to train a convolutional neural network text classifier
 | 
						||
on IMDB movie reviews, using spaCy's new
 | 
						||
[`TextCategorizer`](/api/textcategorizer) component. The dataset will be loaded
 | 
						||
automatically via Thinc's built-in dataset loader. Predictions are available via
 | 
						||
[`Doc.cats`](/api/doc#attributes).
 | 
						||
 | 
						||
```python
 | 
						||
https://github.com/explosion/spaCy/tree/master/examples/training/train_textcat.py
 | 
						||
```
 | 
						||
 | 
						||
#### Step by step guide {#step-by-step-textcat}
 | 
						||
 | 
						||
1. **Load the model** you want to start with, or create an **empty model** using
 | 
						||
   [`spacy.blank`](/api/top-level#spacy.blank) with the ID of your language. If
 | 
						||
   you're using an existing model, make sure to disable all other pipeline
 | 
						||
   components during training using
 | 
						||
   [`nlp.disable_pipes`](/api/language#disable_pipes). This way, you'll only be
 | 
						||
   training the text classifier.
 | 
						||
2. **Add the text classifier** to the pipeline, and add the labels you want to
 | 
						||
   train – for example, `POSITIVE`.
 | 
						||
3. **Load and pre-process the dataset**, shuffle the data and split off a part
 | 
						||
   of it to hold back for evaluation. This way, you'll be able to see results on
 | 
						||
   each training iteration.
 | 
						||
4. **Loop over** the training examples and partition them into batches using
 | 
						||
   spaCy's [`minibatch`](/api/top-level#util.minibatch) and
 | 
						||
   [`compounding`](/api/top-level#util.compounding) helpers.
 | 
						||
5. **Update the model** by calling [`nlp.update`](/api/language#update), which
 | 
						||
   steps through the examples and makes a **prediction**. It then consults the
 | 
						||
   annotations to see whether it was right. If it was wrong, it adjusts its
 | 
						||
   weights so that the correct prediction will score higher next time.
 | 
						||
6. Optionally, you can also **evaluate the text classifier** on each iteration,
 | 
						||
   by checking how it performs on the development data held back from the
 | 
						||
   dataset. This lets you print the **precision**, **recall** and **F-score**.
 | 
						||
7. **Save** the trained model using [`nlp.to_disk`](/api/language#to_disk).
 | 
						||
8. **Test** the model to make sure the text classifier works as expected.
 | 
						||
 | 
						||
## Entity linking {#entity-linker}
 | 
						||
 | 
						||
To train an entity linking model, you first need to define a knowledge base
 | 
						||
(KB).
 | 
						||
 | 
						||
### Creating a knowledge base {#kb}
 | 
						||
 | 
						||
A KB consists of a list of entities with unique identifiers. Each such entity
 | 
						||
has an entity vector that will be used to measure similarity with the context in
 | 
						||
which an entity is used. These vectors have a fixed length and are stored in the
 | 
						||
KB.
 | 
						||
 | 
						||
The following example shows how to build a knowledge base from scratch, given a
 | 
						||
list of entities and potential aliases. The script requires an `nlp` model with
 | 
						||
pretrained word vectors to obtain an encoding of an entity's description as its
 | 
						||
vector.
 | 
						||
 | 
						||
```python
 | 
						||
https://github.com/explosion/spaCy/tree/master/examples/training/create_kb.py
 | 
						||
```
 | 
						||
 | 
						||
#### Step by step guide {#step-by-step-kb}
 | 
						||
 | 
						||
1. **Load the model** you want to start with. It should contain pretrained word
 | 
						||
   vectors.
 | 
						||
2. **Obtain the entity embeddings** by running the descriptions of the entities
 | 
						||
   through the `nlp` model and taking the average of all words with
 | 
						||
   `nlp(desc).vector`. At this point, a custom encoding step can also be used.
 | 
						||
3. **Construct the KB** by defining all entities with their embeddings, and all
 | 
						||
   aliases with their prior probabilities.
 | 
						||
4. **Save** the KB using [`kb.dump`](/api/kb#dump).
 | 
						||
5. **Print** the contents of the KB to make sure the entities were added
 | 
						||
   correctly.
 | 
						||
 | 
						||
### Training an entity linking model {#entity-linker-model}
 | 
						||
 | 
						||
This example shows how to create an entity linker pipe using a previously
 | 
						||
created knowledge base. The entity linker is then trained with a set of custom
 | 
						||
examples. To do so, you need to provide **example texts**, and the **character
 | 
						||
offsets** and **knowledge base identifiers** of each entity contained in the
 | 
						||
texts.
 | 
						||
 | 
						||
```python
 | 
						||
https://github.com/explosion/spaCy/tree/master/examples/training/train_entity_linker.py
 | 
						||
```
 | 
						||
 | 
						||
#### Step by step guide {#step-by-step-entity-linker}
 | 
						||
 | 
						||
1. **Load the KB** you want to start with, and specify the path to the `Vocab`
 | 
						||
   object that was used to create this KB. Then, create an **empty model** using
 | 
						||
   [`spacy.blank`](/api/top-level#spacy.blank) with the ID of your language. Add
 | 
						||
   a component for recognizing sentences en one for identifying relevant
 | 
						||
   entities. In practical applications, you will want a more advanced pipeline
 | 
						||
   including also a component for
 | 
						||
   [named entity recognition](/usage/training#ner). Then, create a new entity
 | 
						||
   linker component, add the KB to it, and then add the entity linker to the
 | 
						||
   pipeline. If you're using a model with additional components, make sure to
 | 
						||
   disable all other pipeline components during training using
 | 
						||
   [`nlp.disable_pipes`](/api/language#disable_pipes). This way, you'll only be
 | 
						||
   training the entity linker.
 | 
						||
2. **Shuffle and loop over** the examples. For each example, **update the
 | 
						||
   model** by calling [`nlp.update`](/api/language#update), which steps through
 | 
						||
   the annotated examples of the input. For each combination of a mention in
 | 
						||
   text and a potential KB identifier, the model makes a **prediction** whether
 | 
						||
   or not this is the correct match. It then consults the annotations to see
 | 
						||
   whether it was right. If it was wrong, it adjusts its weights so that the
 | 
						||
   correct combination will score higher next time.
 | 
						||
3. **Save** the trained model using [`nlp.to_disk`](/api/language#to_disk).
 | 
						||
4. **Test** the model to make sure the entities in the training data are
 | 
						||
   recognized correctly.
 | 
						||
 | 
						||
## Optimization tips and advice {#tips}
 | 
						||
 | 
						||
There are lots of conflicting "recipes" for training deep neural networks at the
 | 
						||
moment. The cutting-edge models take a very long time to train, so most
 | 
						||
researchers can't run enough experiments to figure out what's _really_ going on.
 | 
						||
For what it's worth, here's a recipe that seems to work well on a lot of NLP
 | 
						||
problems:
 | 
						||
 | 
						||
1. Initialize with batch size 1, and compound to a maximum determined by your
 | 
						||
   data size and problem type.
 | 
						||
2. Use Adam solver with fixed learning rate.
 | 
						||
3. Use averaged parameters
 | 
						||
4. Use L2 regularization.
 | 
						||
5. Clip gradients by L2 norm to 1.
 | 
						||
6. On small data sizes, start at a high dropout rate, with linear decay.
 | 
						||
 | 
						||
This recipe has been cobbled together experimentally. Here's why the various
 | 
						||
elements of the recipe made enough sense to try initially, and what you might
 | 
						||
try changing, depending on your problem.
 | 
						||
 | 
						||
### Compounding batch size {#tips-batch-size}
 | 
						||
 | 
						||
The trick of increasing the batch size is starting to become quite popular (see
 | 
						||
[Smith et al., 2017](https://arxiv.org/abs/1711.00489)). Their recipe is quite
 | 
						||
different from how spaCy's models are being trained, but there are some
 | 
						||
similarities. In training the various spaCy models, we haven't found much
 | 
						||
advantage from decaying the learning rate – but starting with a low batch size
 | 
						||
has definitely helped. You should try it out on your data, and see how you go.
 | 
						||
Here's our current strategy:
 | 
						||
 | 
						||
```python
 | 
						||
### Batch heuristic
 | 
						||
def get_batches(train_data, model_type):
 | 
						||
    max_batch_sizes = {"tagger": 32, "parser": 16, "ner": 16, "textcat": 64}
 | 
						||
    max_batch_size = max_batch_sizes[model_type]
 | 
						||
    if len(train_data) < 1000:
 | 
						||
        max_batch_size /= 2
 | 
						||
    if len(train_data) < 500:
 | 
						||
        max_batch_size /= 2
 | 
						||
    batch_size = compounding(1, max_batch_size, 1.001)
 | 
						||
    batches = minibatch(train_data, size=batch_size)
 | 
						||
    return batches
 | 
						||
```
 | 
						||
 | 
						||
This will set the batch size to start at `1`, and increase each batch until it
 | 
						||
reaches a maximum size. The tagger, parser and entity recognizer all take whole
 | 
						||
sentences as input, so they're learning a lot of labels in a single example. You
 | 
						||
therefore need smaller batches for them. The batch size for the text categorizer
 | 
						||
should be somewhat larger, especially if your documents are long.
 | 
						||
 | 
						||
### Learning rate, regularization and gradient clipping {#tips-hyperparams}
 | 
						||
 | 
						||
By default spaCy uses the Adam solver, with default settings
 | 
						||
(`learn_rate=0.001`, `beta1=0.9`, `beta2=0.999`). Some researchers have said
 | 
						||
they found these settings terrible on their problems – but they've always
 | 
						||
performed very well in training spaCy's models, in combination with the rest of
 | 
						||
our recipe. You can change these settings directly, by modifying the
 | 
						||
corresponding attributes on the `optimizer` object. You can also set environment
 | 
						||
variables, to adjust the defaults.
 | 
						||
 | 
						||
There are two other key hyper-parameters of the solver: `L2` **regularization**,
 | 
						||
and **gradient clipping** (`max_grad_norm`). Gradient clipping is a hack that's
 | 
						||
not discussed often, but everybody seems to be using. It's quite important in
 | 
						||
helping to ensure the network doesn't diverge, which is a fancy way of saying
 | 
						||
"fall over during training". The effect is sort of similar to setting the
 | 
						||
learning rate low. It can also compensate for a large batch size (this is a good
 | 
						||
example of how the choices of all these hyper-parameters intersect).
 | 
						||
 | 
						||
### Dropout rate {#tips-dropout}
 | 
						||
 | 
						||
For small datasets, it's useful to set a **high dropout rate at first**, and
 | 
						||
**decay** it down towards a more reasonable value. This helps avoid the network
 | 
						||
immediately overfitting, while still encouraging it to learn some of the more
 | 
						||
interesting things in your data. spaCy comes with a
 | 
						||
[`decaying`](/api/top-level#util.decaying) utility function to facilitate this.
 | 
						||
You might try setting:
 | 
						||
 | 
						||
```python
 | 
						||
from spacy.util import decaying
 | 
						||
dropout = decaying(0.6, 0.2, 1e-4)
 | 
						||
```
 | 
						||
 | 
						||
You can then draw values from the iterator with `next(dropout)`, which you would
 | 
						||
pass to the `drop` keyword argument of [`nlp.update`](/api/language#update).
 | 
						||
It's pretty much always a good idea to use at least **some dropout**. All of the
 | 
						||
models currently use Bernoulli dropout, for no particularly principled reason –
 | 
						||
we just haven't experimented with another scheme like Gaussian dropout yet.
 | 
						||
 | 
						||
### Parameter averaging {#tips-param-avg}
 | 
						||
 | 
						||
The last part of our optimization recipe is **parameter averaging**, an old
 | 
						||
trick introduced by
 | 
						||
[Freund and Schapire (1999)](https://cseweb.ucsd.edu/~yfreund/papers/LargeMarginsUsingPerceptron.pdf),
 | 
						||
popularized in the NLP community by
 | 
						||
[Collins (2002)](http://www.aclweb.org/anthology/P04-1015), and explained in
 | 
						||
more detail by [Leon Bottou](http://leon.bottou.org/projects/sgd). Just about
 | 
						||
the only other people who seem to be using this for neural network training are
 | 
						||
the SyntaxNet team (one of whom is Michael Collins) – but it really seems to
 | 
						||
work great on every problem.
 | 
						||
 | 
						||
The trick is to store the moving average of the weights during training. We
 | 
						||
don't optimize this average – we just track it. Then when we want to actually
 | 
						||
use the model, we use the averages, not the most recent value. In spaCy (and
 | 
						||
[Thinc](https://github.com/explosion/thinc)) this is done by using a context
 | 
						||
manager, [`use_params`](/api/language#use_params), to temporarily replace the
 | 
						||
weights:
 | 
						||
 | 
						||
```python
 | 
						||
with nlp.use_params(optimizer.averages):
 | 
						||
    nlp.to_disk("/model")
 | 
						||
```
 | 
						||
 | 
						||
The context manager is handy because you naturally want to evaluate and save the
 | 
						||
model at various points during training (e.g. after each epoch). After
 | 
						||
evaluating and saving, the context manager will exit and the weights will be
 | 
						||
restored, so you resume training from the most recent value, rather than the
 | 
						||
average. By evaluating the model after each epoch, you can remove one
 | 
						||
hyper-parameter from consideration (the number of epochs). Having one less magic
 | 
						||
number to guess is extremely nice – so having the averaging under a context
 | 
						||
manager is very convenient.
 |