Add notes on preparing training data to docs (#8964)

* Add training data section

Not entirely sure this is in the right location on the page - maybe it
should be after quickstart?

* Add pointer from binary format to training data section

* Minor cleanup

* Add to ToC, fix filename

* Update website/docs/usage/training.md

Co-authored-by: Ines Montani <ines@ines.io>

* Update website/docs/usage/training.md

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update website/docs/usage/training.md

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Move the training data section further down the page

* Update website/docs/usage/training.md

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update website/docs/usage/training.md

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Run prettier

Co-authored-by: Ines Montani <ines@ines.io>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
This commit is contained in:
Paul O'Leary McCann 2021-08-17 00:37:21 +09:00 committed by svlandeg
parent d65e03adae
commit 4ed5d9ad5a
2 changed files with 58 additions and 0 deletions

View File

@ -283,6 +283,10 @@ CLI [`train`](/api/cli#train) command. The built-in
of the `.conllu` format used by the of the `.conllu` format used by the
[Universal Dependencies corpora](https://github.com/UniversalDependencies). [Universal Dependencies corpora](https://github.com/UniversalDependencies).
Note that while this is the format used to save training data, you do not have
to understand the internal details to use it or create training data. See the
section on [preparing training data](/usage/training#training-data).
### JSON training format {#json-input tag="deprecated"} ### JSON training format {#json-input tag="deprecated"}
<Infobox variant="warning" title="Changed in v3.0"> <Infobox variant="warning" title="Changed in v3.0">

View File

@ -6,6 +6,7 @@ menu:
- ['Introduction', 'basics'] - ['Introduction', 'basics']
- ['Quickstart', 'quickstart'] - ['Quickstart', 'quickstart']
- ['Config System', 'config'] - ['Config System', 'config']
- ['Training Data', 'training-data']
- ['Custom Training', 'config-custom'] - ['Custom Training', 'config-custom']
- ['Custom Functions', 'custom-functions'] - ['Custom Functions', 'custom-functions']
- ['Initialization', 'initialization'] - ['Initialization', 'initialization']
@ -355,6 +356,59 @@ that reference this variable.
</Infobox> </Infobox>
## Preparing Training Data {#training-data}
Training data for NLP projects comes in many different formats. For some common
formats such as CoNLL, spaCy provides [converters](/api/cli#convert) you can use
from the command line. In other cases you'll have to prepare the training data
yourself.
When converting training data for use in spaCy, the main thing is to create
[`Doc`](/api/doc) objects just like the results you want as output from the
pipeline. For example, if you're creating an NER pipeline, loading your
annotations and setting them as the `.ents` property on a `Doc` is all you need
to worry about. On disk the annotations will be saved as a
[`DocBin`](/api/docbin) in the
[`.spacy` format](/api/data-formats#binary-training), but the details of that
are handled automatically.
Here's an example of creating a `.spacy` file from some NER annotations.
```python
### preprocess.py
import spacy
from spacy.tokens import DocBin
nlp = spacy.blank("en")
training_data = [
("Tokyo Tower is 333m tall.", [(0, 11, "BUILDING")]),
]
# the DocBin will store the example documents
db = DocBin()
for text, annotations in training_data:
doc = nlp(text)
ents = []
for start, end, label in annotations:
span = doc.char_span(start, end, label=label)
ents.append(span)
doc.ents = ents
db.add(doc)
db.to_disk("./train.spacy")
```
For more examples of how to convert training data from a wide variety of formats
for use with spaCy, look at the preprocessing steps in the
[tutorial projects](https://github.com/explosion/projects/tree/v3/tutorials).
<Accordion title="What about the spaCy JSON format?" id="json-annotations" spaced>
In spaCy v2, the recommended way to store training data was in
[a particular JSON format](/api/data-formats#json-input), but in v3 this format
is deprecated. It's fine as a readable storage format, but there's no need to
convert your data to JSON before creating a `.spacy` file.
</Accordion>
## Customizing the pipeline and training {#config-custom} ## Customizing the pipeline and training {#config-custom}
### Defining pipeline components {#config-components} ### Defining pipeline components {#config-components}