mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-26 09:14:32 +03:00
Document debug-data [ci skip]
This commit is contained in:
parent
05a2df6616
commit
b544dcb3c5
|
@ -47,6 +47,11 @@ def debug_data(
|
||||||
verbose=False,
|
verbose=False,
|
||||||
no_format=False,
|
no_format=False,
|
||||||
):
|
):
|
||||||
|
"""
|
||||||
|
Analyze, debug and validate your training and development data, get useful
|
||||||
|
stats, and find problems like invalid entity annotations, cyclic
|
||||||
|
dependencies, low data labels and more.
|
||||||
|
"""
|
||||||
msg = Printer(pretty=not no_format, ignore_warnings=ignore_warnings)
|
msg = Printer(pretty=not no_format, ignore_warnings=ignore_warnings)
|
||||||
|
|
||||||
# Make sure all files and paths exists if they are needed
|
# Make sure all files and paths exists if they are needed
|
||||||
|
|
|
@ -8,6 +8,7 @@ menu:
|
||||||
- ['Info', 'info']
|
- ['Info', 'info']
|
||||||
- ['Validate', 'validate']
|
- ['Validate', 'validate']
|
||||||
- ['Convert', 'convert']
|
- ['Convert', 'convert']
|
||||||
|
- ['Debug data', 'debug-data']
|
||||||
- ['Train', 'train']
|
- ['Train', 'train']
|
||||||
- ['Pretrain', 'pretrain']
|
- ['Pretrain', 'pretrain']
|
||||||
- ['Init Model', 'init-model']
|
- ['Init Model', 'init-model']
|
||||||
|
@ -175,12 +176,172 @@ All output files generated by this command are compatible with
|
||||||
<!-- TODO: document jsonl option – maybe update it? -->
|
<!-- TODO: document jsonl option – maybe update it? -->
|
||||||
|
|
||||||
| ID | Description |
|
| ID | Description |
|
||||||
| ------------------------------ | --------------------------------------------------------------- |
|
| ------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `auto` | Automatically pick converter based on file extension and file content (default). |
|
| `auto` | Automatically pick converter based on file extension and file content (default). |
|
||||||
| `conll`, `conllu`, `conllubio` | Universal Dependencies `.conllu` or `.conll` format. |
|
| `conll`, `conllu`, `conllubio` | Universal Dependencies `.conllu` or `.conll` format. |
|
||||||
| `ner` | NER with IOB/IOB2 tags, one token per line with columns separated by whitespace. The first column is the token and the final column is the IOB tag. Sentences are separated by blank lines and documents are separated by the line `-DOCSTART- -X- O O`. Supports CoNLL 2003 NER format. See [sample data](https://github.com/explosion/spaCy/tree/master/examples/training/ner_example_data). |
|
| `ner` | NER with IOB/IOB2 tags, one token per line with columns separated by whitespace. The first column is the token and the final column is the IOB tag. Sentences are separated by blank lines and documents are separated by the line `-DOCSTART- -X- O O`. Supports CoNLL 2003 NER format. See [sample data](https://github.com/explosion/spaCy/tree/master/examples/training/ner_example_data). |
|
||||||
| `iob` | NER with IOB/IOB2 tags, one sentence per line with tokens separated by whitespace and annotation separated by `|`, either `word|B-ENT` or `word|POS|B-ENT`. See [sample data](https://github.com/explosion/spaCy/tree/master/examples/training/ner_example_data). |
|
| `iob` | NER with IOB/IOB2 tags, one sentence per line with tokens separated by whitespace and annotation separated by `|`, either `word|B-ENT` or `word|POS|B-ENT`. See [sample data](https://github.com/explosion/spaCy/tree/master/examples/training/ner_example_data). |
|
||||||
|
|
||||||
|
## Debug data {#debug-data new="2.2"}
|
||||||
|
|
||||||
|
Analyze, debug and validate your training and development data, get useful
|
||||||
|
stats, and find problems like invalid entity annotations, cyclic dependencies,
|
||||||
|
low data labels and more.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
$ python -m spacy debug-data [lang] [train_path] [dev_path] [--base-model] [--pipeline] [--ignore-warnings] [--verbose] [--no-format]
|
||||||
|
```
|
||||||
|
|
||||||
|
| Argument | Type | Description |
|
||||||
|
| -------------------------- | ---------- | -------------------------------------------------------------------------------------------------- |
|
||||||
|
| `lang` | positional | Model language. |
|
||||||
|
| `train_path` | positional | Location of JSON-formatted training data. Can be a file or a directory of files. |
|
||||||
|
| `dev_path` | positional | Location of JSON-formatted development data for evaluation. Can be a file or a directory of files. |
|
||||||
|
| `--base-model`, `-b` | option | Optional name of base model to update. Can be any loadable spaCy model. |
|
||||||
|
| `--pipeline`, `-p` | option | Comma-separated names of pipeline components to train. Defaults to `'tagger,parser,ner'`. |
|
||||||
|
| `--ignore-warnings`, `-IW` | flag | Ignore warnings, only show stats and errors. |
|
||||||
|
| `--verbose`, `-V` | flag | Print additional information and explanations. |
|
||||||
|
| --no-format, `-NF` | flag | Don't pretty-print the results. Use this if you want to write to a file. |
|
||||||
|
|
||||||
|
<Accordion title="Example output">
|
||||||
|
|
||||||
|
```
|
||||||
|
=========================== Data format validation ===========================
|
||||||
|
✔ Corpus is loadable
|
||||||
|
|
||||||
|
=============================== Training stats ===============================
|
||||||
|
Training pipeline: tagger, parser, ner
|
||||||
|
Starting with blank model 'en'
|
||||||
|
18127 training docs
|
||||||
|
2939 evaluation docs
|
||||||
|
⚠ 34 training examples also in evaluation data
|
||||||
|
|
||||||
|
============================== Vocab & Vectors ==============================
|
||||||
|
ℹ 2083156 total words in the data (56962 unique)
|
||||||
|
⚠ 13020 misaligned tokens in the training data
|
||||||
|
⚠ 2423 misaligned tokens in the dev data
|
||||||
|
10 most common words: 'the' (98429), ',' (91756), '.' (87073), 'to' (50058),
|
||||||
|
'of' (49559), 'and' (44416), 'a' (34010), 'in' (31424), 'that' (22792), 'is'
|
||||||
|
(18952)
|
||||||
|
ℹ No word vectors present in the model
|
||||||
|
|
||||||
|
========================== Named Entity Recognition ==========================
|
||||||
|
ℹ 18 new labels, 0 existing labels
|
||||||
|
528978 missing values (tokens with '-' label)
|
||||||
|
New: 'ORG' (23860), 'PERSON' (21395), 'GPE' (21193), 'DATE' (18080), 'CARDINAL'
|
||||||
|
(10490), 'NORP' (9033), 'MONEY' (5164), 'PERCENT' (3761), 'ORDINAL' (2122),
|
||||||
|
'LOC' (2113), 'TIME' (1616), 'WORK_OF_ART' (1229), 'QUANTITY' (1150), 'FAC'
|
||||||
|
(1134), 'EVENT' (974), 'PRODUCT' (935), 'LAW' (444), 'LANGUAGE' (338)
|
||||||
|
✔ Good amount of examples for all labels
|
||||||
|
✔ Examples without occurences available for all labels
|
||||||
|
✔ No entities consisting of or starting/ending with whitespace
|
||||||
|
|
||||||
|
=========================== Part-of-speech Tagging ===========================
|
||||||
|
ℹ 49 labels in data (57 labels in tag map)
|
||||||
|
'NN' (266331), 'IN' (227365), 'DT' (185600), 'NNP' (164404), 'JJ' (119830),
|
||||||
|
'NNS' (110957), '.' (101482), ',' (92476), 'RB' (90090), 'PRP' (90081), 'VB'
|
||||||
|
(74538), 'VBD' (68199), 'CC' (62862), 'VBZ' (50712), 'VBP' (43420), 'VBN'
|
||||||
|
(42193), 'CD' (40326), 'VBG' (34764), 'TO' (31085), 'MD' (25863), 'PRP$'
|
||||||
|
(23335), 'HYPH' (13833), 'POS' (13427), 'UH' (13322), 'WP' (10423), 'WDT'
|
||||||
|
(9850), 'RP' (8230), 'WRB' (8201), ':' (8168), '''' (7392), '``' (6984), 'NNPS'
|
||||||
|
(5817), 'JJR' (5689), '$' (3710), 'EX' (3465), 'JJS' (3118), 'RBR' (2872),
|
||||||
|
'-RRB-' (2825), '-LRB-' (2788), 'PDT' (2078), 'XX' (1316), 'RBS' (1142), 'FW'
|
||||||
|
(794), 'NFP' (557), 'SYM' (440), 'WP$' (294), 'LS' (293), 'ADD' (191), 'AFX'
|
||||||
|
(24)
|
||||||
|
✔ All labels present in tag map for language 'en'
|
||||||
|
|
||||||
|
============================= Dependency Parsing =============================
|
||||||
|
ℹ Found 111703 sentences with an average length of 18.6 words.
|
||||||
|
ℹ Found 2251 nonprojective train sentences
|
||||||
|
ℹ Found 303 nonprojective dev sentences
|
||||||
|
ℹ 47 labels in train data
|
||||||
|
ℹ 211 labels in projectivized train data
|
||||||
|
'punct' (236796), 'prep' (188853), 'pobj' (182533), 'det' (172674), 'nsubj'
|
||||||
|
(169481), 'compound' (116142), 'ROOT' (111697), 'amod' (107945), 'dobj' (93540),
|
||||||
|
'aux' (86802), 'advmod' (86197), 'cc' (62679), 'conj' (59575), 'poss' (36449),
|
||||||
|
'ccomp' (36343), 'advcl' (29017), 'mark' (27990), 'nummod' (24582), 'relcl'
|
||||||
|
(21359), 'xcomp' (21081), 'attr' (18347), 'npadvmod' (17740), 'acomp' (17204),
|
||||||
|
'auxpass' (15639), 'appos' (15368), 'neg' (15266), 'nsubjpass' (13922), 'case'
|
||||||
|
(13408), 'acl' (12574), 'pcomp' (10340), 'nmod' (9736), 'intj' (9285), 'prt'
|
||||||
|
(8196), 'quantmod' (7403), 'dep' (4300), 'dative' (4091), 'agent' (3908), 'expl'
|
||||||
|
(3456), 'parataxis' (3099), 'oprd' (2326), 'predet' (1946), 'csubj' (1494),
|
||||||
|
'subtok' (1147), 'preconj' (692), 'meta' (469), 'csubjpass' (64), 'iobj' (1)
|
||||||
|
⚠ Low number of examples for label 'iobj' (1)
|
||||||
|
⚠ Low number of examples for 130 labels in the projectivized dependency
|
||||||
|
trees used for training. You may want to projectivize labels such as punct
|
||||||
|
before training in order to improve parser performance.
|
||||||
|
⚠ Projectivized labels with low numbers of examples: appos||attr: 12
|
||||||
|
advmod||dobj: 13 prep||ccomp: 12 nsubjpass||ccomp: 15 pcomp||prep: 14
|
||||||
|
amod||dobj: 9 attr||xcomp: 14 nmod||nsubj: 17 prep||advcl: 2 prep||prep: 5
|
||||||
|
nsubj||conj: 12 advcl||advmod: 18 ccomp||advmod: 11 ccomp||pcomp: 5 acl||pobj:
|
||||||
|
10 npadvmod||acomp: 7 dobj||pcomp: 14 nsubjpass||pcomp: 1 nmod||pobj: 8
|
||||||
|
amod||attr: 6 nmod||dobj: 12 aux||conj: 1 neg||conj: 1 dative||xcomp: 11
|
||||||
|
pobj||dative: 3 xcomp||acomp: 19 advcl||pobj: 2 nsubj||advcl: 2 csubj||ccomp: 1
|
||||||
|
advcl||acl: 1 relcl||nmod: 2 dobj||advcl: 10 advmod||advcl: 3 nmod||nsubjpass: 6
|
||||||
|
amod||pobj: 5 cc||neg: 1 attr||ccomp: 16 advcl||xcomp: 3 nmod||attr: 4
|
||||||
|
advcl||nsubjpass: 5 advcl||ccomp: 4 ccomp||conj: 1 punct||acl: 1 meta||acl: 1
|
||||||
|
parataxis||acl: 1 prep||acl: 1 amod||nsubj: 7 ccomp||ccomp: 3 acomp||xcomp: 5
|
||||||
|
dobj||acl: 5 prep||oprd: 6 advmod||acl: 2 dative||advcl: 1 pobj||agent: 5
|
||||||
|
xcomp||amod: 1 dep||advcl: 1 prep||amod: 8 relcl||compound: 1 advcl||csubj: 3
|
||||||
|
npadvmod||conj: 2 npadvmod||xcomp: 4 advmod||nsubj: 3 ccomp||amod: 7
|
||||||
|
advcl||conj: 1 nmod||conj: 2 advmod||nsubjpass: 2 dep||xcomp: 2 appos||ccomp: 1
|
||||||
|
advmod||dep: 1 advmod||advmod: 5 aux||xcomp: 8 dep||advmod: 1 dative||ccomp: 2
|
||||||
|
prep||dep: 1 conj||conj: 1 dep||ccomp: 4 cc||ROOT: 1 prep||ROOT: 1 nsubj||pcomp:
|
||||||
|
3 advmod||prep: 2 relcl||dative: 1 acl||conj: 1 advcl||attr: 4 prep||npadvmod: 1
|
||||||
|
nsubjpass||xcomp: 1 neg||advmod: 1 xcomp||oprd: 1 advcl||advcl: 1 dobj||dep: 3
|
||||||
|
nsubjpass||parataxis: 1 attr||pcomp: 1 ccomp||parataxis: 1 advmod||attr: 1
|
||||||
|
nmod||oprd: 1 appos||nmod: 2 advmod||relcl: 1 appos||npadvmod: 1 appos||conj: 1
|
||||||
|
prep||expl: 1 nsubjpass||conj: 1 punct||pobj: 1 cc||pobj: 1 conj||pobj: 1
|
||||||
|
punct||conj: 1 ccomp||dep: 1 oprd||xcomp: 3 ccomp||xcomp: 1 ccomp||nsubj: 1
|
||||||
|
nmod||dep: 1 xcomp||ccomp: 1 acomp||advcl: 1 intj||advmod: 1 advmod||acomp: 2
|
||||||
|
relcl||oprd: 1 advmod||prt: 1 advmod||pobj: 1 appos||nummod: 1 relcl||npadvmod:
|
||||||
|
3 mark||advcl: 1 aux||ccomp: 1 amod||nsubjpass: 1 npadvmod||advmod: 1 conj||dep:
|
||||||
|
1 nummod||pobj: 1 amod||npadvmod: 1 intj||pobj: 1 nummod||npadvmod: 1
|
||||||
|
xcomp||xcomp: 1 aux||dep: 1 advcl||relcl: 1
|
||||||
|
⚠ The following labels were found only in the train data: xcomp||amod,
|
||||||
|
advcl||relcl, prep||nsubjpass, acl||nsubj, nsubjpass||conj, xcomp||oprd,
|
||||||
|
advmod||conj, advmod||advmod, iobj, advmod||nsubjpass, dobj||conj, ccomp||amod,
|
||||||
|
meta||acl, xcomp||xcomp, prep||attr, prep||ccomp, advcl||acomp, acl||dobj,
|
||||||
|
advcl||advcl, pobj||agent, prep||advcl, nsubjpass||xcomp, prep||dep,
|
||||||
|
acomp||xcomp, aux||ccomp, ccomp||dep, conj||dep, relcl||compound,
|
||||||
|
nsubjpass||ccomp, nmod||dobj, advmod||advcl, advmod||acl, dobj||advcl,
|
||||||
|
dative||xcomp, prep||nsubj, ccomp||ccomp, nsubj||ccomp, xcomp||acomp,
|
||||||
|
prep||acomp, dep||advmod, acl||pobj, appos||dobj, npadvmod||acomp, cc||ROOT,
|
||||||
|
relcl||nsubj, nmod||pobj, acl||nsubjpass, ccomp||advmod, pcomp||prep,
|
||||||
|
amod||dobj, advmod||attr, advcl||csubj, appos||attr, dobj||pcomp, prep||ROOT,
|
||||||
|
relcl||pobj, advmod||pobj, amod||nsubj, ccomp||xcomp, prep||oprd,
|
||||||
|
npadvmod||advmod, appos||nummod, advcl||pobj, neg||advmod, acl||attr,
|
||||||
|
appos||nsubjpass, csubj||ccomp, amod||nsubjpass, intj||pobj, dep||advcl,
|
||||||
|
cc||neg, xcomp||ccomp, dative||ccomp, nmod||oprd, pobj||dative, prep||dobj,
|
||||||
|
dep||ccomp, relcl||attr, ccomp||nsubj, advcl||xcomp, nmod||dep, advcl||advmod,
|
||||||
|
ccomp||conj, pobj||prep, advmod||acomp, advmod||relcl, attr||pcomp,
|
||||||
|
ccomp||parataxis, oprd||xcomp, intj||advmod, nmod||nsubjpass, prep||npadvmod,
|
||||||
|
parataxis||acl, prep||pobj, advcl||dobj, amod||pobj, prep||acl, conj||pobj,
|
||||||
|
advmod||dep, punct||pobj, ccomp||acomp, acomp||advcl, nummod||npadvmod,
|
||||||
|
dobj||dep, npadvmod||xcomp, advcl||conj, relcl||npadvmod, punct||acl,
|
||||||
|
relcl||dobj, dobj||xcomp, nsubjpass||parataxis, dative||advcl, relcl||nmod,
|
||||||
|
advcl||ccomp, appos||npadvmod, ccomp||pcomp, prep||amod, mark||advcl,
|
||||||
|
prep||advmod, prep||xcomp, appos||nsubj, attr||ccomp, advmod||prt, dobj||ccomp,
|
||||||
|
aux||conj, advcl||nsubj, conj||conj, advmod||ccomp, advcl||nsubjpass,
|
||||||
|
attr||xcomp, nmod||conj, npadvmod||conj, relcl||dative, prep||expl,
|
||||||
|
nsubjpass||pcomp, advmod||xcomp, advmod||dobj, appos||pobj, nsubj||conj,
|
||||||
|
relcl||nsubjpass, advcl||attr, appos||ccomp, advmod||prep, prep||conj,
|
||||||
|
nmod||attr, punct||conj, neg||conj, dep||xcomp, aux||xcomp, dobj||acl,
|
||||||
|
nummod||pobj, amod||npadvmod, nsubj||pcomp, advcl||acl, appos||nmod,
|
||||||
|
relcl||oprd, prep||prep, cc||pobj, nmod||nsubj, amod||attr, aux||dep,
|
||||||
|
appos||conj, advmod||nsubj, nsubj||advcl, acl||conj
|
||||||
|
To train a parser, your data should include at least 20 instances of each label.
|
||||||
|
⚠ Multiple root labels (ROOT, nsubj, aux, npadvmod, prep) found in
|
||||||
|
training data. spaCy's parser uses a single root label ROOT so this distinction
|
||||||
|
will not be available.
|
||||||
|
|
||||||
|
================================== Summary ==================================
|
||||||
|
✔ 5 checks passed
|
||||||
|
⚠ 8 warnings
|
||||||
|
```
|
||||||
|
|
||||||
|
</Accordion>
|
||||||
|
|
||||||
## Train {#train}
|
## Train {#train}
|
||||||
|
|
||||||
Train a model. Expects data in spaCy's
|
Train a model. Expects data in spaCy's
|
||||||
|
@ -292,7 +453,7 @@ $ python -m spacy pretrain [texts_loc] [vectors_model] [output_dir]
|
||||||
```
|
```
|
||||||
|
|
||||||
| Argument | Type | Description |
|
| Argument | Type | Description |
|
||||||
| ----------------------- | ---------- | --------------------------------------------------------------------------------------------------------------------------------- |
|
| ----------------------------------------------------- | ---------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `texts_loc` | positional | Path to JSONL file with raw texts to learn from, with text provided as the key `"text"` or tokens as the key `"tokens"`. [See here](#pretrain-jsonl) for details. |
|
| `texts_loc` | positional | Path to JSONL file with raw texts to learn from, with text provided as the key `"text"` or tokens as the key `"tokens"`. [See here](#pretrain-jsonl) for details. |
|
||||||
| `vectors_model` | positional | Name or path to spaCy model with vectors to learn from. |
|
| `vectors_model` | positional | Name or path to spaCy model with vectors to learn from. |
|
||||||
| `output_dir` | positional | Directory to write models to on each epoch. |
|
| `output_dir` | positional | Directory to write models to on each epoch. |
|
||||||
|
@ -308,8 +469,8 @@ $ python -m spacy pretrain [texts_loc] [vectors_model] [output_dir]
|
||||||
| `--n-iter`, `-i` | option | Number of iterations to pretrain. |
|
| `--n-iter`, `-i` | option | Number of iterations to pretrain. |
|
||||||
| `--use-vectors`, `-uv` | flag | Whether to use the static vectors as input features. |
|
| `--use-vectors`, `-uv` | flag | Whether to use the static vectors as input features. |
|
||||||
| `--n-save-every`, `-se` | option | Save model every X batches. |
|
| `--n-save-every`, `-se` | option | Save model every X batches. |
|
||||||
| `--init-tok2vec`, `-t2v` <Tag variant="new">2.1</Tag> | option | Path to pretrained weights for the token-to-vector parts of the models. See `spacy pretrain`. Experimental.|
|
| `--init-tok2vec`, `-t2v` <Tag variant="new">2.1</Tag> | option | Path to pretrained weights for the token-to-vector parts of the models. See `spacy pretrain`. Experimental. |
|
||||||
| `--epoch-start`, `-es` <Tag variant="new">2.1.5</Tag> | option | The epoch to start counting at. Only relevant when using `--init-tok2vec` and the given weight file has been renamed. Prevents unintended overwriting of existing weight files.|
|
| `--epoch-start`, `-es` <Tag variant="new">2.1.5</Tag> | option | The epoch to start counting at. Only relevant when using `--init-tok2vec` and the given weight file has been renamed. Prevents unintended overwriting of existing weight files. |
|
||||||
| **CREATES** | weights | The pre-trained weights that can be used to initialize `spacy train`. |
|
| **CREATES** | weights | The pre-trained weights that can be used to initialize `spacy train`. |
|
||||||
|
|
||||||
### JSONL format for raw text {#pretrain-jsonl}
|
### JSONL format for raw text {#pretrain-jsonl}
|
||||||
|
@ -331,7 +492,7 @@ tokenization can be provided.
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Key | Type | Description |
|
| Key | Type | Description |
|
||||||
| -------- | ------- | -------------------------------------------- |
|
| -------- | ------- | ---------------------------------------------------------- |
|
||||||
| `text` | unicode | The raw input text. Is not required if `tokens` available. |
|
| `text` | unicode | The raw input text. Is not required if `tokens` available. |
|
||||||
| `tokens` | list | Optional tokenization, one string per token. |
|
| `tokens` | list | Optional tokenization, one string per token. |
|
||||||
|
|
||||||
|
@ -424,7 +585,7 @@ pip install dist/en_model-0.0.0.tar.gz
|
||||||
| `input_dir` | positional | Path to directory containing model data. |
|
| `input_dir` | positional | Path to directory containing model data. |
|
||||||
| `output_dir` | positional | Directory to create package folder in. |
|
| `output_dir` | positional | Directory to create package folder in. |
|
||||||
| `--meta-path`, `-m` <Tag variant="new">2</Tag> | option | Path to `meta.json` file (optional). |
|
| `--meta-path`, `-m` <Tag variant="new">2</Tag> | option | Path to `meta.json` file (optional). |
|
||||||
| `--create-meta`, `-c` <Tag variant="new">2</Tag> | flag | Create a `meta.json` file on the command line, even if one already exists in the directory. If an existing file is found, its entries will be shown as the defaults in the command line prompt.
|
| `--create-meta`, `-c` <Tag variant="new">2</Tag> | flag | Create a `meta.json` file on the command line, even if one already exists in the directory. If an existing file is found, its entries will be shown as the defaults in the command line prompt. |
|
||||||
| `--force`, `-f` | flag | Force overwriting of existing folder in output directory. |
|
| `--force`, `-f` | flag | Force overwriting of existing folder in output directory. |
|
||||||
| `--help`, `-h` | flag | Show help message and available arguments. |
|
| `--help`, `-h` | flag | Show help message and available arguments. |
|
||||||
| **CREATES** | directory | A Python package containing the spaCy model. |
|
| **CREATES** | directory | A Python package containing the spaCy model. |
|
||||||
|
|
|
@ -10,9 +10,9 @@ menu:
|
||||||
---
|
---
|
||||||
|
|
||||||
This guide describes how to train new statistical models for spaCy's
|
This guide describes how to train new statistical models for spaCy's
|
||||||
part-of-speech tagger, named entity recognizer, dependency parser,
|
part-of-speech tagger, named entity recognizer, dependency parser, text
|
||||||
text classifier and entity linker. Once the model is trained,
|
classifier and entity linker. Once the model is trained, you can then
|
||||||
you can then [save and load](/usage/saving-loading#models) it.
|
[save and load](/usage/saving-loading#models) it.
|
||||||
|
|
||||||
## Training basics {#basics}
|
## Training basics {#basics}
|
||||||
|
|
||||||
|
@ -40,6 +40,19 @@ mkdir models
|
||||||
python -m spacy train es models ancora-json/es_ancora-ud-train.json ancora-json/es_ancora-ud-dev.json
|
python -m spacy train es models ancora-json/es_ancora-ud-train.json ancora-json/es_ancora-ud-dev.json
|
||||||
```
|
```
|
||||||
|
|
||||||
|
<Infobox title="Tip: Debug your data">
|
||||||
|
|
||||||
|
If you're running spaCy v2.2 or above, you can use the
|
||||||
|
[`debug-data` command](/api/cli#debug-data) to analyze and validate your
|
||||||
|
training and development data, get useful stats, and find problems like invalid
|
||||||
|
entity annotations, cyclic dependencies, low data labels and more.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
$ python -m spacy debug-data en train.json dev.json --verbose
|
||||||
|
```
|
||||||
|
|
||||||
|
</Infobox>
|
||||||
|
|
||||||
You can also use the [`gold.docs_to_json`](/api/goldparse#docs_to_json) helper
|
You can also use the [`gold.docs_to_json`](/api/goldparse#docs_to_json) helper
|
||||||
to convert a list of `Doc` objects to spaCy's JSON training format.
|
to convert a list of `Doc` objects to spaCy's JSON training format.
|
||||||
|
|
||||||
|
@ -223,10 +236,9 @@ of being dropped.
|
||||||
> - [`begin_training()`](/api/language#begin_training): Start the training and
|
> - [`begin_training()`](/api/language#begin_training): Start the training and
|
||||||
> return an optimizer function to update the model's weights. Can take an
|
> return an optimizer function to update the model's weights. Can take an
|
||||||
> optional function converting the training data to spaCy's training format.
|
> optional function converting the training data to spaCy's training format.
|
||||||
> - [`update()`](/api/language#update): Update the model with the
|
> - [`update()`](/api/language#update): Update the model with the training
|
||||||
> training example and gold data.
|
> example and gold data.
|
||||||
> - [`to_disk()`](/api/language#to_disk): Save
|
> - [`to_disk()`](/api/language#to_disk): Save the updated model to a directory.
|
||||||
> the updated model to a directory.
|
|
||||||
|
|
||||||
```python
|
```python
|
||||||
### Example training loop
|
### Example training loop
|
||||||
|
@ -405,19 +417,20 @@ referred to as the "catastrophic forgetting" problem.
|
||||||
|
|
||||||
## Entity linking {#entity-linker}
|
## Entity linking {#entity-linker}
|
||||||
|
|
||||||
To train an entity linking model, you first need to define a knowledge base (KB).
|
To train an entity linking model, you first need to define a knowledge base
|
||||||
|
(KB).
|
||||||
|
|
||||||
### Creating a knowledge base {#kb}
|
### Creating a knowledge base {#kb}
|
||||||
|
|
||||||
A KB consists of a list of entities with unique identifiers. Each such entity
|
A KB consists of a list of entities with unique identifiers. Each such entity
|
||||||
has an entity vector that will be used to measure similarity with the context in
|
has an entity vector that will be used to measure similarity with the context in
|
||||||
which an entity is used. These vectors are pretrained and stored in the KB before
|
which an entity is used. These vectors are pretrained and stored in the KB
|
||||||
the entity linking model will be trained.
|
before the entity linking model will be trained.
|
||||||
|
|
||||||
The following example shows how to build a knowledge base from scratch,
|
The following example shows how to build a knowledge base from scratch, given a
|
||||||
given a list of entities and potential aliases. The script further demonstrates
|
list of entities and potential aliases. The script further demonstrates how to
|
||||||
how to pretrain and store the entity vectors. To run this example, the script
|
pretrain and store the entity vectors. To run this example, the script needs
|
||||||
needs access to a `vocab` instance or an `nlp` model with pretrained word embeddings.
|
access to a `vocab` instance or an `nlp` model with pretrained word embeddings.
|
||||||
|
|
||||||
```python
|
```python
|
||||||
https://github.com/explosion/spaCy/tree/master/examples/training/pretrain_kb.py
|
https://github.com/explosion/spaCy/tree/master/examples/training/pretrain_kb.py
|
||||||
|
@ -428,10 +441,10 @@ https://github.com/explosion/spaCy/tree/master/examples/training/pretrain_kb.py
|
||||||
1. **Load the model** you want to start with, or create an **empty model** using
|
1. **Load the model** you want to start with, or create an **empty model** using
|
||||||
[`spacy.blank`](/api/top-level#spacy.blank) with the ID of your language and
|
[`spacy.blank`](/api/top-level#spacy.blank) with the ID of your language and
|
||||||
a pre-defined [`vocab`](/api/vocab) object.
|
a pre-defined [`vocab`](/api/vocab) object.
|
||||||
2. **Pretrain the entity embeddings** by running the descriptions of the entities
|
2. **Pretrain the entity embeddings** by running the descriptions of the
|
||||||
through a simple encoder-decoder network. The current implementation requires
|
entities through a simple encoder-decoder network. The current implementation
|
||||||
the `nlp` model to have access to pre-trained word embeddings, but a custom
|
requires the `nlp` model to have access to pre-trained word embeddings, but a
|
||||||
implementation of this enoding step can also be used.
|
custom implementation of this enoding step can also be used.
|
||||||
3. **Construct the KB** by defining all entities with their pretrained vectors,
|
3. **Construct the KB** by defining all entities with their pretrained vectors,
|
||||||
and all aliases with their prior probabilities.
|
and all aliases with their prior probabilities.
|
||||||
4. **Save** the KB using [`kb.dump`](/api/kb#dump).
|
4. **Save** the KB using [`kb.dump`](/api/kb#dump).
|
||||||
|
@ -439,11 +452,11 @@ https://github.com/explosion/spaCy/tree/master/examples/training/pretrain_kb.py
|
||||||
|
|
||||||
### Training an entity linking model {#entity-linker-model}
|
### Training an entity linking model {#entity-linker-model}
|
||||||
|
|
||||||
This example shows how to create an entity linker pipe using a previously created
|
This example shows how to create an entity linker pipe using a previously
|
||||||
knowledge base. The entity linker pipe is then trained with your own
|
created knowledge base. The entity linker pipe is then trained with your own
|
||||||
examples. To do so, you'll need to provide
|
examples. To do so, you'll need to provide **example texts**, and the
|
||||||
**example texts**, and the **character offsets** and **knowledge base identifiers**
|
**character offsets** and **knowledge base identifiers** of each entity
|
||||||
of each entity contained in the texts.
|
contained in the texts.
|
||||||
|
|
||||||
```python
|
```python
|
||||||
https://github.com/explosion/spaCy/tree/master/examples/training/train_entity_linker.py
|
https://github.com/explosion/spaCy/tree/master/examples/training/train_entity_linker.py
|
||||||
|
@ -451,25 +464,23 @@ https://github.com/explosion/spaCy/tree/master/examples/training/train_entity_li
|
||||||
|
|
||||||
#### Step by step guide {#step-by-step-entity-linker}
|
#### Step by step guide {#step-by-step-entity-linker}
|
||||||
|
|
||||||
1. **Load the KB** you want to start with, and specify the path
|
1. **Load the KB** you want to start with, and specify the path to the `Vocab`
|
||||||
to the `Vocab` object that was used to create this KB.
|
object that was used to create this KB. Then, create an **empty model** using
|
||||||
Then, create an **empty model** using
|
|
||||||
[`spacy.blank`](/api/top-level#spacy.blank) with the ID of your language.
|
[`spacy.blank`](/api/top-level#spacy.blank) with the ID of your language.
|
||||||
Don't forget to add the KB to the entity linker,
|
Don't forget to add the KB to the entity linker, and to add the entity linker
|
||||||
and to add the entity linker to the pipeline.
|
to the pipeline. In practical applications, you will want a more advanced
|
||||||
In practical applications, you will want a more advanced pipeline including
|
pipeline including also a component for
|
||||||
also a component for [named entity recognition](/usage/training#ner).
|
[named entity recognition](/usage/training#ner). If you're using a model with
|
||||||
If you're using a model with additional components, make sure to disable all other
|
additional components, make sure to disable all other pipeline components
|
||||||
pipeline components during training using
|
during training using [`nlp.disable_pipes`](/api/language#disable_pipes).
|
||||||
[`nlp.disable_pipes`](/api/language#disable_pipes). This way, you'll only be
|
This way, you'll only be training the entity linker.
|
||||||
training the entity linker.
|
|
||||||
2. **Shuffle and loop over** the examples. For each example, **update the
|
2. **Shuffle and loop over** the examples. For each example, **update the
|
||||||
model** by calling [`nlp.update`](/api/language#update), which steps through
|
model** by calling [`nlp.update`](/api/language#update), which steps through
|
||||||
the annotated examples of the input. For each combination of a mention in text and
|
the annotated examples of the input. For each combination of a mention in
|
||||||
a potential KB identifier, the model makes a **prediction** whether or not
|
text and a potential KB identifier, the model makes a **prediction** whether
|
||||||
this is the correct match. It then
|
or not this is the correct match. It then consults the annotations to see
|
||||||
consults the annotations to see whether it was right. If it was wrong, it
|
whether it was right. If it was wrong, it adjusts its weights so that the
|
||||||
adjusts its weights so that the correct combination will score higher next time.
|
correct combination will score higher next time.
|
||||||
3. **Save** the trained model using [`nlp.to_disk`](/api/language#to_disk).
|
3. **Save** the trained model using [`nlp.to_disk`](/api/language#to_disk).
|
||||||
4. **Test** the model to make sure the entities in the training data are
|
4. **Test** the model to make sure the entities in the training data are
|
||||||
recognized correctly.
|
recognized correctly.
|
||||||
|
|
Loading…
Reference in New Issue
Block a user