Add notes on preparing training data to docs (#8964)

* Add training data section Not entirely sure this is in the right location on the page - maybe it should be after quickstart? * Add pointer from binary format to training data section * Minor cleanup * Add to ToC, fix filename * Update website/docs/usage/training.md Co-authored-by: Ines Montani <ines@ines.io> * Update website/docs/usage/training.md Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update website/docs/usage/training.md Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Move the training data section further down the page * Update website/docs/usage/training.md Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update website/docs/usage/training.md Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Run prettier Co-authored-by: Ines Montani <ines@ines.io> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2025-11-05 18:37:27 +03:00 · 2021-08-17 00:37:21 +09:00 · 2021-08-17 00:37:21 +09:00 · 4ed5d9ad5a
commit 4ed5d9ad5a
parent d65e03adae
2 changed files with 58 additions and 0 deletions
--- a/website/docs/api/data-formats.md
+++ b/website/docs/api/data-formats.md
@ -283,6 +283,10 @@ CLI [`train`](/api/cli#train) command. The built-in
 of the `.conllu` format used by the
 [Universal Dependencies corpora](https://github.com/UniversalDependencies).
 Note that while this is the format used to save training data, you do not have
 to understand the internal details to use it or create training data. See the
 section on [preparing training data](/usage/training#training-data).
 ### JSON training format {#json-input tag="deprecated"}
 <Infobox variant="warning" title="Changed in v3.0">
--- a/website/docs/usage/training.md
+++ b/website/docs/usage/training.md
@ -6,6 +6,7 @@ menu:
  - ['Introduction', 'basics']
  - ['Quickstart', 'quickstart']
  - ['Config System', 'config']
  - ['Training Data', 'training-data']
  - ['Custom Training', 'config-custom']
  - ['Custom Functions', 'custom-functions']
  - ['Initialization', 'initialization']
@ -355,6 +356,59 @@ that reference this variable.
 </Infobox>
 ## Preparing Training Data {#training-data}
 Training data for NLP projects comes in many different formats. For some common
 formats such as CoNLL, spaCy provides [converters](/api/cli#convert) you can use
 from the command line. In other cases you'll have to prepare the training data
 yourself.
 When converting training data for use in spaCy, the main thing is to create
 [`Doc`](/api/doc) objects just like the results you want as output from the
 pipeline. For example, if you're creating an NER pipeline, loading your
 annotations and setting them as the `.ents` property on a `Doc` is all you need
 to worry about. On disk the annotations will be saved as a
 [`DocBin`](/api/docbin) in the
 [`.spacy` format](/api/data-formats#binary-training), but the details of that
 are handled automatically.
 Here's an example of creating a `.spacy` file from some NER annotations.
 ```python
 ### preprocess.py
 import spacy
 from spacy.tokens import DocBin
 nlp = spacy.blank("en")
 training_data = [
  ("Tokyo Tower is 333m tall.", [(0, 11, "BUILDING")]),
 ]
 # the DocBin will store the example documents
 db = DocBin()
 for text, annotations in training_data:
    doc = nlp(text)
    ents = []
    for start, end, label in annotations:
        span = doc.char_span(start, end, label=label)
        ents.append(span)
    doc.ents = ents
    db.add(doc)
 db.to_disk("./train.spacy")
 ```
 For more examples of how to convert training data from a wide variety of formats
 for use with spaCy, look at the preprocessing steps in the
 [tutorial projects](https://github.com/explosion/projects/tree/v3/tutorials).
 <Accordion title="What about the spaCy JSON format?" id="json-annotations" spaced>
 In spaCy v2, the recommended way to store training data was in
 [a particular JSON format](/api/data-formats#json-input), but in v3 this format
 is deprecated. It's fine as a readable storage format, but there's no need to
 convert your data to JSON before creating a `.spacy` file.
 </Accordion>
 ## Customizing the pipeline and training {#config-custom}
 ### Defining pipeline components {#config-components}