mirror of
https://github.com/explosion/spaCy.git
synced 2024-12-28 19:06:33 +03:00
Add notes on preparing training data to docs (#8964)
* Add training data section Not entirely sure this is in the right location on the page - maybe it should be after quickstart? * Add pointer from binary format to training data section * Minor cleanup * Add to ToC, fix filename * Update website/docs/usage/training.md Co-authored-by: Ines Montani <ines@ines.io> * Update website/docs/usage/training.md Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update website/docs/usage/training.md Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Move the training data section further down the page * Update website/docs/usage/training.md Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update website/docs/usage/training.md Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Run prettier Co-authored-by: Ines Montani <ines@ines.io> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
This commit is contained in:
parent
d65e03adae
commit
4ed5d9ad5a
|
@ -283,6 +283,10 @@ CLI [`train`](/api/cli#train) command. The built-in
|
||||||
of the `.conllu` format used by the
|
of the `.conllu` format used by the
|
||||||
[Universal Dependencies corpora](https://github.com/UniversalDependencies).
|
[Universal Dependencies corpora](https://github.com/UniversalDependencies).
|
||||||
|
|
||||||
|
Note that while this is the format used to save training data, you do not have
|
||||||
|
to understand the internal details to use it or create training data. See the
|
||||||
|
section on [preparing training data](/usage/training#training-data).
|
||||||
|
|
||||||
### JSON training format {#json-input tag="deprecated"}
|
### JSON training format {#json-input tag="deprecated"}
|
||||||
|
|
||||||
<Infobox variant="warning" title="Changed in v3.0">
|
<Infobox variant="warning" title="Changed in v3.0">
|
||||||
|
|
|
@ -6,6 +6,7 @@ menu:
|
||||||
- ['Introduction', 'basics']
|
- ['Introduction', 'basics']
|
||||||
- ['Quickstart', 'quickstart']
|
- ['Quickstart', 'quickstart']
|
||||||
- ['Config System', 'config']
|
- ['Config System', 'config']
|
||||||
|
- ['Training Data', 'training-data']
|
||||||
- ['Custom Training', 'config-custom']
|
- ['Custom Training', 'config-custom']
|
||||||
- ['Custom Functions', 'custom-functions']
|
- ['Custom Functions', 'custom-functions']
|
||||||
- ['Initialization', 'initialization']
|
- ['Initialization', 'initialization']
|
||||||
|
@ -355,6 +356,59 @@ that reference this variable.
|
||||||
|
|
||||||
</Infobox>
|
</Infobox>
|
||||||
|
|
||||||
|
## Preparing Training Data {#training-data}
|
||||||
|
|
||||||
|
Training data for NLP projects comes in many different formats. For some common
|
||||||
|
formats such as CoNLL, spaCy provides [converters](/api/cli#convert) you can use
|
||||||
|
from the command line. In other cases you'll have to prepare the training data
|
||||||
|
yourself.
|
||||||
|
|
||||||
|
When converting training data for use in spaCy, the main thing is to create
|
||||||
|
[`Doc`](/api/doc) objects just like the results you want as output from the
|
||||||
|
pipeline. For example, if you're creating an NER pipeline, loading your
|
||||||
|
annotations and setting them as the `.ents` property on a `Doc` is all you need
|
||||||
|
to worry about. On disk the annotations will be saved as a
|
||||||
|
[`DocBin`](/api/docbin) in the
|
||||||
|
[`.spacy` format](/api/data-formats#binary-training), but the details of that
|
||||||
|
are handled automatically.
|
||||||
|
|
||||||
|
Here's an example of creating a `.spacy` file from some NER annotations.
|
||||||
|
|
||||||
|
```python
|
||||||
|
### preprocess.py
|
||||||
|
import spacy
|
||||||
|
from spacy.tokens import DocBin
|
||||||
|
|
||||||
|
nlp = spacy.blank("en")
|
||||||
|
training_data = [
|
||||||
|
("Tokyo Tower is 333m tall.", [(0, 11, "BUILDING")]),
|
||||||
|
]
|
||||||
|
# the DocBin will store the example documents
|
||||||
|
db = DocBin()
|
||||||
|
for text, annotations in training_data:
|
||||||
|
doc = nlp(text)
|
||||||
|
ents = []
|
||||||
|
for start, end, label in annotations:
|
||||||
|
span = doc.char_span(start, end, label=label)
|
||||||
|
ents.append(span)
|
||||||
|
doc.ents = ents
|
||||||
|
db.add(doc)
|
||||||
|
db.to_disk("./train.spacy")
|
||||||
|
```
|
||||||
|
|
||||||
|
For more examples of how to convert training data from a wide variety of formats
|
||||||
|
for use with spaCy, look at the preprocessing steps in the
|
||||||
|
[tutorial projects](https://github.com/explosion/projects/tree/v3/tutorials).
|
||||||
|
|
||||||
|
<Accordion title="What about the spaCy JSON format?" id="json-annotations" spaced>
|
||||||
|
|
||||||
|
In spaCy v2, the recommended way to store training data was in
|
||||||
|
[a particular JSON format](/api/data-formats#json-input), but in v3 this format
|
||||||
|
is deprecated. It's fine as a readable storage format, but there's no need to
|
||||||
|
convert your data to JSON before creating a `.spacy` file.
|
||||||
|
|
||||||
|
</Accordion>
|
||||||
|
|
||||||
## Customizing the pipeline and training {#config-custom}
|
## Customizing the pipeline and training {#config-custom}
|
||||||
|
|
||||||
### Defining pipeline components {#config-components}
|
### Defining pipeline components {#config-components}
|
||||||
|
|
Loading…
Reference in New Issue
Block a user