mirror of
				https://github.com/explosion/spaCy.git
				synced 2025-10-30 23:47:31 +03:00 
			
		
		
		
	Add notes on preparing training data to docs (#8964)
* Add training data section Not entirely sure this is in the right location on the page - maybe it should be after quickstart? * Add pointer from binary format to training data section * Minor cleanup * Add to ToC, fix filename * Update website/docs/usage/training.md Co-authored-by: Ines Montani <ines@ines.io> * Update website/docs/usage/training.md Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update website/docs/usage/training.md Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Move the training data section further down the page * Update website/docs/usage/training.md Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update website/docs/usage/training.md Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Run prettier Co-authored-by: Ines Montani <ines@ines.io> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
This commit is contained in:
		
							parent
							
								
									d65e03adae
								
							
						
					
					
						commit
						4ed5d9ad5a
					
				|  | @ -283,6 +283,10 @@ CLI [`train`](/api/cli#train) command. The built-in | |||
| of the `.conllu` format used by the | ||||
| [Universal Dependencies corpora](https://github.com/UniversalDependencies). | ||||
| 
 | ||||
| Note that while this is the format used to save training data, you do not have | ||||
| to understand the internal details to use it or create training data. See the | ||||
| section on [preparing training data](/usage/training#training-data). | ||||
| 
 | ||||
| ### JSON training format {#json-input tag="deprecated"} | ||||
| 
 | ||||
| <Infobox variant="warning" title="Changed in v3.0"> | ||||
|  |  | |||
|  | @ -6,6 +6,7 @@ menu: | |||
|   - ['Introduction', 'basics'] | ||||
|   - ['Quickstart', 'quickstart'] | ||||
|   - ['Config System', 'config'] | ||||
|   - ['Training Data', 'training-data'] | ||||
|   - ['Custom Training', 'config-custom'] | ||||
|   - ['Custom Functions', 'custom-functions'] | ||||
|   - ['Initialization', 'initialization'] | ||||
|  | @ -355,6 +356,59 @@ that reference this variable. | |||
| 
 | ||||
| </Infobox> | ||||
| 
 | ||||
| ## Preparing Training Data {#training-data} | ||||
| 
 | ||||
| Training data for NLP projects comes in many different formats. For some common | ||||
| formats such as CoNLL, spaCy provides [converters](/api/cli#convert) you can use | ||||
| from the command line. In other cases you'll have to prepare the training data | ||||
| yourself. | ||||
| 
 | ||||
| When converting training data for use in spaCy, the main thing is to create | ||||
| [`Doc`](/api/doc) objects just like the results you want as output from the | ||||
| pipeline. For example, if you're creating an NER pipeline, loading your | ||||
| annotations and setting them as the `.ents` property on a `Doc` is all you need | ||||
| to worry about. On disk the annotations will be saved as a | ||||
| [`DocBin`](/api/docbin) in the | ||||
| [`.spacy` format](/api/data-formats#binary-training), but the details of that | ||||
| are handled automatically. | ||||
| 
 | ||||
| Here's an example of creating a `.spacy` file from some NER annotations. | ||||
| 
 | ||||
| ```python | ||||
| ### preprocess.py | ||||
| import spacy | ||||
| from spacy.tokens import DocBin | ||||
| 
 | ||||
| nlp = spacy.blank("en") | ||||
| training_data = [ | ||||
|   ("Tokyo Tower is 333m tall.", [(0, 11, "BUILDING")]), | ||||
| ] | ||||
| # the DocBin will store the example documents | ||||
| db = DocBin() | ||||
| for text, annotations in training_data: | ||||
|     doc = nlp(text) | ||||
|     ents = [] | ||||
|     for start, end, label in annotations: | ||||
|         span = doc.char_span(start, end, label=label) | ||||
|         ents.append(span) | ||||
|     doc.ents = ents | ||||
|     db.add(doc) | ||||
| db.to_disk("./train.spacy") | ||||
| ``` | ||||
| 
 | ||||
| For more examples of how to convert training data from a wide variety of formats | ||||
| for use with spaCy, look at the preprocessing steps in the | ||||
| [tutorial projects](https://github.com/explosion/projects/tree/v3/tutorials). | ||||
| 
 | ||||
| <Accordion title="What about the spaCy JSON format?" id="json-annotations" spaced> | ||||
| 
 | ||||
| In spaCy v2, the recommended way to store training data was in | ||||
| [a particular JSON format](/api/data-formats#json-input), but in v3 this format | ||||
| is deprecated. It's fine as a readable storage format, but there's no need to | ||||
| convert your data to JSON before creating a `.spacy` file. | ||||
| 
 | ||||
| </Accordion> | ||||
| 
 | ||||
| ## Customizing the pipeline and training {#config-custom} | ||||
| 
 | ||||
| ### Defining pipeline components {#config-components} | ||||
|  |  | |||
		Loading…
	
		Reference in New Issue
	
	Block a user