Document debug-data [ci skip]

This commit is contained in:
Ines Montani 2019-09-12 15:26:20 +02:00
parent 05a2df6616
commit b544dcb3c5
3 changed files with 252 additions and 75 deletions

View File

@ -47,6 +47,11 @@ def debug_data(
verbose=False, verbose=False,
no_format=False, no_format=False,
): ):
"""
Analyze, debug and validate your training and development data, get useful
stats, and find problems like invalid entity annotations, cyclic
dependencies, low data labels and more.
"""
msg = Printer(pretty=not no_format, ignore_warnings=ignore_warnings) msg = Printer(pretty=not no_format, ignore_warnings=ignore_warnings)
# Make sure all files and paths exists if they are needed # Make sure all files and paths exists if they are needed

View File

@ -8,6 +8,7 @@ menu:
- ['Info', 'info'] - ['Info', 'info']
- ['Validate', 'validate'] - ['Validate', 'validate']
- ['Convert', 'convert'] - ['Convert', 'convert']
- ['Debug data', 'debug-data']
- ['Train', 'train'] - ['Train', 'train']
- ['Pretrain', 'pretrain'] - ['Pretrain', 'pretrain']
- ['Init Model', 'init-model'] - ['Init Model', 'init-model']
@ -175,12 +176,172 @@ All output files generated by this command are compatible with
<!-- TODO: document jsonl option maybe update it? --> <!-- TODO: document jsonl option maybe update it? -->
| ID | Description | | ID | Description |
| ------------------------------ | --------------------------------------------------------------- | | ------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `auto` | Automatically pick converter based on file extension and file content (default). | | `auto` | Automatically pick converter based on file extension and file content (default). |
| `conll`, `conllu`, `conllubio` | Universal Dependencies `.conllu` or `.conll` format. | | `conll`, `conllu`, `conllubio` | Universal Dependencies `.conllu` or `.conll` format. |
| `ner` | NER with IOB/IOB2 tags, one token per line with columns separated by whitespace. The first column is the token and the final column is the IOB tag. Sentences are separated by blank lines and documents are separated by the line `-DOCSTART- -X- O O`. Supports CoNLL 2003 NER format. See [sample data](https://github.com/explosion/spaCy/tree/master/examples/training/ner_example_data). | | `ner` | NER with IOB/IOB2 tags, one token per line with columns separated by whitespace. The first column is the token and the final column is the IOB tag. Sentences are separated by blank lines and documents are separated by the line `-DOCSTART- -X- O O`. Supports CoNLL 2003 NER format. See [sample data](https://github.com/explosion/spaCy/tree/master/examples/training/ner_example_data). |
| `iob` | NER with IOB/IOB2 tags, one sentence per line with tokens separated by whitespace and annotation separated by `|`, either `word|B-ENT` or `word|POS|B-ENT`. See [sample data](https://github.com/explosion/spaCy/tree/master/examples/training/ner_example_data). | | `iob` | NER with IOB/IOB2 tags, one sentence per line with tokens separated by whitespace and annotation separated by `|`, either `word|B-ENT` or `word|POS|B-ENT`. See [sample data](https://github.com/explosion/spaCy/tree/master/examples/training/ner_example_data). |
## Debug data {#debug-data new="2.2"}
Analyze, debug and validate your training and development data, get useful
stats, and find problems like invalid entity annotations, cyclic dependencies,
low data labels and more.
```bash
$ python -m spacy debug-data [lang] [train_path] [dev_path] [--base-model] [--pipeline] [--ignore-warnings] [--verbose] [--no-format]
```
| Argument | Type | Description |
| -------------------------- | ---------- | -------------------------------------------------------------------------------------------------- |
| `lang` | positional | Model language. |
| `train_path` | positional | Location of JSON-formatted training data. Can be a file or a directory of files. |
| `dev_path` | positional | Location of JSON-formatted development data for evaluation. Can be a file or a directory of files. |
| `--base-model`, `-b` | option | Optional name of base model to update. Can be any loadable spaCy model. |
| `--pipeline`, `-p` | option | Comma-separated names of pipeline components to train. Defaults to `'tagger,parser,ner'`. |
| `--ignore-warnings`, `-IW` | flag | Ignore warnings, only show stats and errors. |
| `--verbose`, `-V` | flag | Print additional information and explanations. |
| --no-format, `-NF` | flag | Don't pretty-print the results. Use this if you want to write to a file. |
<Accordion title="Example output">
```
=========================== Data format validation ===========================
✔ Corpus is loadable
=============================== Training stats ===============================
Training pipeline: tagger, parser, ner
Starting with blank model 'en'
18127 training docs
2939 evaluation docs
⚠ 34 training examples also in evaluation data
============================== Vocab & Vectors ==============================
2083156 total words in the data (56962 unique)
⚠ 13020 misaligned tokens in the training data
⚠ 2423 misaligned tokens in the dev data
10 most common words: 'the' (98429), ',' (91756), '.' (87073), 'to' (50058),
'of' (49559), 'and' (44416), 'a' (34010), 'in' (31424), 'that' (22792), 'is'
(18952)
No word vectors present in the model
========================== Named Entity Recognition ==========================
18 new labels, 0 existing labels
528978 missing values (tokens with '-' label)
New: 'ORG' (23860), 'PERSON' (21395), 'GPE' (21193), 'DATE' (18080), 'CARDINAL'
(10490), 'NORP' (9033), 'MONEY' (5164), 'PERCENT' (3761), 'ORDINAL' (2122),
'LOC' (2113), 'TIME' (1616), 'WORK_OF_ART' (1229), 'QUANTITY' (1150), 'FAC'
(1134), 'EVENT' (974), 'PRODUCT' (935), 'LAW' (444), 'LANGUAGE' (338)
✔ Good amount of examples for all labels
✔ Examples without occurences available for all labels
✔ No entities consisting of or starting/ending with whitespace
=========================== Part-of-speech Tagging ===========================
49 labels in data (57 labels in tag map)
'NN' (266331), 'IN' (227365), 'DT' (185600), 'NNP' (164404), 'JJ' (119830),
'NNS' (110957), '.' (101482), ',' (92476), 'RB' (90090), 'PRP' (90081), 'VB'
(74538), 'VBD' (68199), 'CC' (62862), 'VBZ' (50712), 'VBP' (43420), 'VBN'
(42193), 'CD' (40326), 'VBG' (34764), 'TO' (31085), 'MD' (25863), 'PRP$'
(23335), 'HYPH' (13833), 'POS' (13427), 'UH' (13322), 'WP' (10423), 'WDT'
(9850), 'RP' (8230), 'WRB' (8201), ':' (8168), '''' (7392), '``' (6984), 'NNPS'
(5817), 'JJR' (5689), '$' (3710), 'EX' (3465), 'JJS' (3118), 'RBR' (2872),
'-RRB-' (2825), '-LRB-' (2788), 'PDT' (2078), 'XX' (1316), 'RBS' (1142), 'FW'
(794), 'NFP' (557), 'SYM' (440), 'WP$' (294), 'LS' (293), 'ADD' (191), 'AFX'
(24)
✔ All labels present in tag map for language 'en'
============================= Dependency Parsing =============================
Found 111703 sentences with an average length of 18.6 words.
Found 2251 nonprojective train sentences
Found 303 nonprojective dev sentences
47 labels in train data
211 labels in projectivized train data
'punct' (236796), 'prep' (188853), 'pobj' (182533), 'det' (172674), 'nsubj'
(169481), 'compound' (116142), 'ROOT' (111697), 'amod' (107945), 'dobj' (93540),
'aux' (86802), 'advmod' (86197), 'cc' (62679), 'conj' (59575), 'poss' (36449),
'ccomp' (36343), 'advcl' (29017), 'mark' (27990), 'nummod' (24582), 'relcl'
(21359), 'xcomp' (21081), 'attr' (18347), 'npadvmod' (17740), 'acomp' (17204),
'auxpass' (15639), 'appos' (15368), 'neg' (15266), 'nsubjpass' (13922), 'case'
(13408), 'acl' (12574), 'pcomp' (10340), 'nmod' (9736), 'intj' (9285), 'prt'
(8196), 'quantmod' (7403), 'dep' (4300), 'dative' (4091), 'agent' (3908), 'expl'
(3456), 'parataxis' (3099), 'oprd' (2326), 'predet' (1946), 'csubj' (1494),
'subtok' (1147), 'preconj' (692), 'meta' (469), 'csubjpass' (64), 'iobj' (1)
⚠ Low number of examples for label 'iobj' (1)
⚠ Low number of examples for 130 labels in the projectivized dependency
trees used for training. You may want to projectivize labels such as punct
before training in order to improve parser performance.
⚠ Projectivized labels with low numbers of examples: appos||attr: 12
advmod||dobj: 13 prep||ccomp: 12 nsubjpass||ccomp: 15 pcomp||prep: 14
amod||dobj: 9 attr||xcomp: 14 nmod||nsubj: 17 prep||advcl: 2 prep||prep: 5
nsubj||conj: 12 advcl||advmod: 18 ccomp||advmod: 11 ccomp||pcomp: 5 acl||pobj:
10 npadvmod||acomp: 7 dobj||pcomp: 14 nsubjpass||pcomp: 1 nmod||pobj: 8
amod||attr: 6 nmod||dobj: 12 aux||conj: 1 neg||conj: 1 dative||xcomp: 11
pobj||dative: 3 xcomp||acomp: 19 advcl||pobj: 2 nsubj||advcl: 2 csubj||ccomp: 1
advcl||acl: 1 relcl||nmod: 2 dobj||advcl: 10 advmod||advcl: 3 nmod||nsubjpass: 6
amod||pobj: 5 cc||neg: 1 attr||ccomp: 16 advcl||xcomp: 3 nmod||attr: 4
advcl||nsubjpass: 5 advcl||ccomp: 4 ccomp||conj: 1 punct||acl: 1 meta||acl: 1
parataxis||acl: 1 prep||acl: 1 amod||nsubj: 7 ccomp||ccomp: 3 acomp||xcomp: 5
dobj||acl: 5 prep||oprd: 6 advmod||acl: 2 dative||advcl: 1 pobj||agent: 5
xcomp||amod: 1 dep||advcl: 1 prep||amod: 8 relcl||compound: 1 advcl||csubj: 3
npadvmod||conj: 2 npadvmod||xcomp: 4 advmod||nsubj: 3 ccomp||amod: 7
advcl||conj: 1 nmod||conj: 2 advmod||nsubjpass: 2 dep||xcomp: 2 appos||ccomp: 1
advmod||dep: 1 advmod||advmod: 5 aux||xcomp: 8 dep||advmod: 1 dative||ccomp: 2
prep||dep: 1 conj||conj: 1 dep||ccomp: 4 cc||ROOT: 1 prep||ROOT: 1 nsubj||pcomp:
3 advmod||prep: 2 relcl||dative: 1 acl||conj: 1 advcl||attr: 4 prep||npadvmod: 1
nsubjpass||xcomp: 1 neg||advmod: 1 xcomp||oprd: 1 advcl||advcl: 1 dobj||dep: 3
nsubjpass||parataxis: 1 attr||pcomp: 1 ccomp||parataxis: 1 advmod||attr: 1
nmod||oprd: 1 appos||nmod: 2 advmod||relcl: 1 appos||npadvmod: 1 appos||conj: 1
prep||expl: 1 nsubjpass||conj: 1 punct||pobj: 1 cc||pobj: 1 conj||pobj: 1
punct||conj: 1 ccomp||dep: 1 oprd||xcomp: 3 ccomp||xcomp: 1 ccomp||nsubj: 1
nmod||dep: 1 xcomp||ccomp: 1 acomp||advcl: 1 intj||advmod: 1 advmod||acomp: 2
relcl||oprd: 1 advmod||prt: 1 advmod||pobj: 1 appos||nummod: 1 relcl||npadvmod:
3 mark||advcl: 1 aux||ccomp: 1 amod||nsubjpass: 1 npadvmod||advmod: 1 conj||dep:
1 nummod||pobj: 1 amod||npadvmod: 1 intj||pobj: 1 nummod||npadvmod: 1
xcomp||xcomp: 1 aux||dep: 1 advcl||relcl: 1
⚠ The following labels were found only in the train data: xcomp||amod,
advcl||relcl, prep||nsubjpass, acl||nsubj, nsubjpass||conj, xcomp||oprd,
advmod||conj, advmod||advmod, iobj, advmod||nsubjpass, dobj||conj, ccomp||amod,
meta||acl, xcomp||xcomp, prep||attr, prep||ccomp, advcl||acomp, acl||dobj,
advcl||advcl, pobj||agent, prep||advcl, nsubjpass||xcomp, prep||dep,
acomp||xcomp, aux||ccomp, ccomp||dep, conj||dep, relcl||compound,
nsubjpass||ccomp, nmod||dobj, advmod||advcl, advmod||acl, dobj||advcl,
dative||xcomp, prep||nsubj, ccomp||ccomp, nsubj||ccomp, xcomp||acomp,
prep||acomp, dep||advmod, acl||pobj, appos||dobj, npadvmod||acomp, cc||ROOT,
relcl||nsubj, nmod||pobj, acl||nsubjpass, ccomp||advmod, pcomp||prep,
amod||dobj, advmod||attr, advcl||csubj, appos||attr, dobj||pcomp, prep||ROOT,
relcl||pobj, advmod||pobj, amod||nsubj, ccomp||xcomp, prep||oprd,
npadvmod||advmod, appos||nummod, advcl||pobj, neg||advmod, acl||attr,
appos||nsubjpass, csubj||ccomp, amod||nsubjpass, intj||pobj, dep||advcl,
cc||neg, xcomp||ccomp, dative||ccomp, nmod||oprd, pobj||dative, prep||dobj,
dep||ccomp, relcl||attr, ccomp||nsubj, advcl||xcomp, nmod||dep, advcl||advmod,
ccomp||conj, pobj||prep, advmod||acomp, advmod||relcl, attr||pcomp,
ccomp||parataxis, oprd||xcomp, intj||advmod, nmod||nsubjpass, prep||npadvmod,
parataxis||acl, prep||pobj, advcl||dobj, amod||pobj, prep||acl, conj||pobj,
advmod||dep, punct||pobj, ccomp||acomp, acomp||advcl, nummod||npadvmod,
dobj||dep, npadvmod||xcomp, advcl||conj, relcl||npadvmod, punct||acl,
relcl||dobj, dobj||xcomp, nsubjpass||parataxis, dative||advcl, relcl||nmod,
advcl||ccomp, appos||npadvmod, ccomp||pcomp, prep||amod, mark||advcl,
prep||advmod, prep||xcomp, appos||nsubj, attr||ccomp, advmod||prt, dobj||ccomp,
aux||conj, advcl||nsubj, conj||conj, advmod||ccomp, advcl||nsubjpass,
attr||xcomp, nmod||conj, npadvmod||conj, relcl||dative, prep||expl,
nsubjpass||pcomp, advmod||xcomp, advmod||dobj, appos||pobj, nsubj||conj,
relcl||nsubjpass, advcl||attr, appos||ccomp, advmod||prep, prep||conj,
nmod||attr, punct||conj, neg||conj, dep||xcomp, aux||xcomp, dobj||acl,
nummod||pobj, amod||npadvmod, nsubj||pcomp, advcl||acl, appos||nmod,
relcl||oprd, prep||prep, cc||pobj, nmod||nsubj, amod||attr, aux||dep,
appos||conj, advmod||nsubj, nsubj||advcl, acl||conj
To train a parser, your data should include at least 20 instances of each label.
⚠ Multiple root labels (ROOT, nsubj, aux, npadvmod, prep) found in
training data. spaCy's parser uses a single root label ROOT so this distinction
will not be available.
================================== Summary ==================================
✔ 5 checks passed
⚠ 8 warnings
```
</Accordion>
## Train {#train} ## Train {#train}
Train a model. Expects data in spaCy's Train a model. Expects data in spaCy's
@ -292,7 +453,7 @@ $ python -m spacy pretrain [texts_loc] [vectors_model] [output_dir]
``` ```
| Argument | Type | Description | | Argument | Type | Description |
| ----------------------- | ---------- | --------------------------------------------------------------------------------------------------------------------------------- | | ----------------------------------------------------- | ---------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `texts_loc` | positional | Path to JSONL file with raw texts to learn from, with text provided as the key `"text"` or tokens as the key `"tokens"`. [See here](#pretrain-jsonl) for details. | | `texts_loc` | positional | Path to JSONL file with raw texts to learn from, with text provided as the key `"text"` or tokens as the key `"tokens"`. [See here](#pretrain-jsonl) for details. |
| `vectors_model` | positional | Name or path to spaCy model with vectors to learn from. | | `vectors_model` | positional | Name or path to spaCy model with vectors to learn from. |
| `output_dir` | positional | Directory to write models to on each epoch. | | `output_dir` | positional | Directory to write models to on each epoch. |
@ -331,7 +492,7 @@ tokenization can be provided.
> ``` > ```
| Key | Type | Description | | Key | Type | Description |
| -------- | ------- | -------------------------------------------- | | -------- | ------- | ---------------------------------------------------------- |
| `text` | unicode | The raw input text. Is not required if `tokens` available. | | `text` | unicode | The raw input text. Is not required if `tokens` available. |
| `tokens` | list | Optional tokenization, one string per token. | | `tokens` | list | Optional tokenization, one string per token. |
@ -424,7 +585,7 @@ pip install dist/en_model-0.0.0.tar.gz
| `input_dir` | positional | Path to directory containing model data. | | `input_dir` | positional | Path to directory containing model data. |
| `output_dir` | positional | Directory to create package folder in. | | `output_dir` | positional | Directory to create package folder in. |
| `--meta-path`, `-m` <Tag variant="new">2</Tag> | option | Path to `meta.json` file (optional). | | `--meta-path`, `-m` <Tag variant="new">2</Tag> | option | Path to `meta.json` file (optional). |
| `--create-meta`, `-c` <Tag variant="new">2</Tag> | flag | Create a `meta.json` file on the command line, even if one already exists in the directory. If an existing file is found, its entries will be shown as the defaults in the command line prompt. | `--create-meta`, `-c` <Tag variant="new">2</Tag> | flag | Create a `meta.json` file on the command line, even if one already exists in the directory. If an existing file is found, its entries will be shown as the defaults in the command line prompt. |
| `--force`, `-f` | flag | Force overwriting of existing folder in output directory. | | `--force`, `-f` | flag | Force overwriting of existing folder in output directory. |
| `--help`, `-h` | flag | Show help message and available arguments. | | `--help`, `-h` | flag | Show help message and available arguments. |
| **CREATES** | directory | A Python package containing the spaCy model. | | **CREATES** | directory | A Python package containing the spaCy model. |

View File

@ -10,9 +10,9 @@ menu:
--- ---
This guide describes how to train new statistical models for spaCy's This guide describes how to train new statistical models for spaCy's
part-of-speech tagger, named entity recognizer, dependency parser, part-of-speech tagger, named entity recognizer, dependency parser, text
text classifier and entity linker. Once the model is trained, classifier and entity linker. Once the model is trained, you can then
you can then [save and load](/usage/saving-loading#models) it. [save and load](/usage/saving-loading#models) it.
## Training basics {#basics} ## Training basics {#basics}
@ -40,6 +40,19 @@ mkdir models
python -m spacy train es models ancora-json/es_ancora-ud-train.json ancora-json/es_ancora-ud-dev.json python -m spacy train es models ancora-json/es_ancora-ud-train.json ancora-json/es_ancora-ud-dev.json
``` ```
<Infobox title="Tip: Debug your data">
If you're running spaCy v2.2 or above, you can use the
[`debug-data` command](/api/cli#debug-data) to analyze and validate your
training and development data, get useful stats, and find problems like invalid
entity annotations, cyclic dependencies, low data labels and more.
```bash
$ python -m spacy debug-data en train.json dev.json --verbose
```
</Infobox>
You can also use the [`gold.docs_to_json`](/api/goldparse#docs_to_json) helper You can also use the [`gold.docs_to_json`](/api/goldparse#docs_to_json) helper
to convert a list of `Doc` objects to spaCy's JSON training format. to convert a list of `Doc` objects to spaCy's JSON training format.
@ -223,10 +236,9 @@ of being dropped.
> - [`begin_training()`](/api/language#begin_training): Start the training and > - [`begin_training()`](/api/language#begin_training): Start the training and
> return an optimizer function to update the model's weights. Can take an > return an optimizer function to update the model's weights. Can take an
> optional function converting the training data to spaCy's training format. > optional function converting the training data to spaCy's training format.
> - [`update()`](/api/language#update): Update the model with the > - [`update()`](/api/language#update): Update the model with the training
> training example and gold data. > example and gold data.
> - [`to_disk()`](/api/language#to_disk): Save > - [`to_disk()`](/api/language#to_disk): Save the updated model to a directory.
> the updated model to a directory.
```python ```python
### Example training loop ### Example training loop
@ -405,19 +417,20 @@ referred to as the "catastrophic forgetting" problem.
## Entity linking {#entity-linker} ## Entity linking {#entity-linker}
To train an entity linking model, you first need to define a knowledge base (KB). To train an entity linking model, you first need to define a knowledge base
(KB).
### Creating a knowledge base {#kb} ### Creating a knowledge base {#kb}
A KB consists of a list of entities with unique identifiers. Each such entity A KB consists of a list of entities with unique identifiers. Each such entity
has an entity vector that will be used to measure similarity with the context in has an entity vector that will be used to measure similarity with the context in
which an entity is used. These vectors are pretrained and stored in the KB before which an entity is used. These vectors are pretrained and stored in the KB
the entity linking model will be trained. before the entity linking model will be trained.
The following example shows how to build a knowledge base from scratch, The following example shows how to build a knowledge base from scratch, given a
given a list of entities and potential aliases. The script further demonstrates list of entities and potential aliases. The script further demonstrates how to
how to pretrain and store the entity vectors. To run this example, the script pretrain and store the entity vectors. To run this example, the script needs
needs access to a `vocab` instance or an `nlp` model with pretrained word embeddings. access to a `vocab` instance or an `nlp` model with pretrained word embeddings.
```python ```python
https://github.com/explosion/spaCy/tree/master/examples/training/pretrain_kb.py https://github.com/explosion/spaCy/tree/master/examples/training/pretrain_kb.py
@ -428,10 +441,10 @@ https://github.com/explosion/spaCy/tree/master/examples/training/pretrain_kb.py
1. **Load the model** you want to start with, or create an **empty model** using 1. **Load the model** you want to start with, or create an **empty model** using
[`spacy.blank`](/api/top-level#spacy.blank) with the ID of your language and [`spacy.blank`](/api/top-level#spacy.blank) with the ID of your language and
a pre-defined [`vocab`](/api/vocab) object. a pre-defined [`vocab`](/api/vocab) object.
2. **Pretrain the entity embeddings** by running the descriptions of the entities 2. **Pretrain the entity embeddings** by running the descriptions of the
through a simple encoder-decoder network. The current implementation requires entities through a simple encoder-decoder network. The current implementation
the `nlp` model to have access to pre-trained word embeddings, but a custom requires the `nlp` model to have access to pre-trained word embeddings, but a
implementation of this enoding step can also be used. custom implementation of this enoding step can also be used.
3. **Construct the KB** by defining all entities with their pretrained vectors, 3. **Construct the KB** by defining all entities with their pretrained vectors,
and all aliases with their prior probabilities. and all aliases with their prior probabilities.
4. **Save** the KB using [`kb.dump`](/api/kb#dump). 4. **Save** the KB using [`kb.dump`](/api/kb#dump).
@ -439,11 +452,11 @@ https://github.com/explosion/spaCy/tree/master/examples/training/pretrain_kb.py
### Training an entity linking model {#entity-linker-model} ### Training an entity linking model {#entity-linker-model}
This example shows how to create an entity linker pipe using a previously created This example shows how to create an entity linker pipe using a previously
knowledge base. The entity linker pipe is then trained with your own created knowledge base. The entity linker pipe is then trained with your own
examples. To do so, you'll need to provide examples. To do so, you'll need to provide **example texts**, and the
**example texts**, and the **character offsets** and **knowledge base identifiers** **character offsets** and **knowledge base identifiers** of each entity
of each entity contained in the texts. contained in the texts.
```python ```python
https://github.com/explosion/spaCy/tree/master/examples/training/train_entity_linker.py https://github.com/explosion/spaCy/tree/master/examples/training/train_entity_linker.py
@ -451,25 +464,23 @@ https://github.com/explosion/spaCy/tree/master/examples/training/train_entity_li
#### Step by step guide {#step-by-step-entity-linker} #### Step by step guide {#step-by-step-entity-linker}
1. **Load the KB** you want to start with, and specify the path 1. **Load the KB** you want to start with, and specify the path to the `Vocab`
to the `Vocab` object that was used to create this KB. object that was used to create this KB. Then, create an **empty model** using
Then, create an **empty model** using
[`spacy.blank`](/api/top-level#spacy.blank) with the ID of your language. [`spacy.blank`](/api/top-level#spacy.blank) with the ID of your language.
Don't forget to add the KB to the entity linker, Don't forget to add the KB to the entity linker, and to add the entity linker
and to add the entity linker to the pipeline. to the pipeline. In practical applications, you will want a more advanced
In practical applications, you will want a more advanced pipeline including pipeline including also a component for
also a component for [named entity recognition](/usage/training#ner). [named entity recognition](/usage/training#ner). If you're using a model with
If you're using a model with additional components, make sure to disable all other additional components, make sure to disable all other pipeline components
pipeline components during training using during training using [`nlp.disable_pipes`](/api/language#disable_pipes).
[`nlp.disable_pipes`](/api/language#disable_pipes). This way, you'll only be This way, you'll only be training the entity linker.
training the entity linker.
2. **Shuffle and loop over** the examples. For each example, **update the 2. **Shuffle and loop over** the examples. For each example, **update the
model** by calling [`nlp.update`](/api/language#update), which steps through model** by calling [`nlp.update`](/api/language#update), which steps through
the annotated examples of the input. For each combination of a mention in text and the annotated examples of the input. For each combination of a mention in
a potential KB identifier, the model makes a **prediction** whether or not text and a potential KB identifier, the model makes a **prediction** whether
this is the correct match. It then or not this is the correct match. It then consults the annotations to see
consults the annotations to see whether it was right. If it was wrong, it whether it was right. If it was wrong, it adjusts its weights so that the
adjusts its weights so that the correct combination will score higher next time. correct combination will score higher next time.
3. **Save** the trained model using [`nlp.to_disk`](/api/language#to_disk). 3. **Save** the trained model using [`nlp.to_disk`](/api/language#to_disk).
4. **Test** the model to make sure the entities in the training data are 4. **Test** the model to make sure the entities in the training data are
recognized correctly. recognized correctly.