diff --git a/website/docs/api/data-formats.md b/website/docs/api/data-formats.md index 1bdeb509a..001455f33 100644 --- a/website/docs/api/data-formats.md +++ b/website/docs/api/data-formats.md @@ -283,6 +283,10 @@ CLI [`train`](/api/cli#train) command. The built-in of the `.conllu` format used by the [Universal Dependencies corpora](https://github.com/UniversalDependencies). +Note that while this is the format used to save training data, you do not have +to understand the internal details to use it or create training data. See the +section on [preparing training data](/usage/training#training-data). + ### JSON training format {#json-input tag="deprecated"} diff --git a/website/docs/usage/training.md b/website/docs/usage/training.md index 6deba3761..0fe34f2a2 100644 --- a/website/docs/usage/training.md +++ b/website/docs/usage/training.md @@ -6,6 +6,7 @@ menu: - ['Introduction', 'basics'] - ['Quickstart', 'quickstart'] - ['Config System', 'config'] + - ['Training Data', 'training-data'] - ['Custom Training', 'config-custom'] - ['Custom Functions', 'custom-functions'] - ['Initialization', 'initialization'] @@ -355,6 +356,59 @@ that reference this variable. +## Preparing Training Data {#training-data} + +Training data for NLP projects comes in many different formats. For some common +formats such as CoNLL, spaCy provides [converters](/api/cli#convert) you can use +from the command line. In other cases you'll have to prepare the training data +yourself. + +When converting training data for use in spaCy, the main thing is to create +[`Doc`](/api/doc) objects just like the results you want as output from the +pipeline. For example, if you're creating an NER pipeline, loading your +annotations and setting them as the `.ents` property on a `Doc` is all you need +to worry about. On disk the annotations will be saved as a +[`DocBin`](/api/docbin) in the +[`.spacy` format](/api/data-formats#binary-training), but the details of that +are handled automatically. + +Here's an example of creating a `.spacy` file from some NER annotations. + +```python +### preprocess.py +import spacy +from spacy.tokens import DocBin + +nlp = spacy.blank("en") +training_data = [ + ("Tokyo Tower is 333m tall.", [(0, 11, "BUILDING")]), +] +# the DocBin will store the example documents +db = DocBin() +for text, annotations in training_data: + doc = nlp(text) + ents = [] + for start, end, label in annotations: + span = doc.char_span(start, end, label=label) + ents.append(span) + doc.ents = ents + db.add(doc) +db.to_disk("./train.spacy") +``` + +For more examples of how to convert training data from a wide variety of formats +for use with spaCy, look at the preprocessing steps in the +[tutorial projects](https://github.com/explosion/projects/tree/v3/tutorials). + + + +In spaCy v2, the recommended way to store training data was in +[a particular JSON format](/api/data-formats#json-input), but in v3 this format +is deprecated. It's fine as a readable storage format, but there's no need to +convert your data to JSON before creating a `.spacy` file. + + + ## Customizing the pipeline and training {#config-custom} ### Defining pipeline components {#config-components}