mirror of
				https://github.com/explosion/spaCy.git
				synced 2025-10-31 16:07:41 +03:00 
			
		
		
		
	
		
			
				
	
	
	
		
			4.1 KiB
		
	
	
	
	
	
	
	
			
		
		
	
	
			4.1 KiB
		
	
	
	
	
	
	
	
| title | teaser | tag | source | new | 
|---|---|---|---|---|
| Corpus | An annotated corpus | class | spacy/gold/corpus.py | 3 | 
This class manages annotated corpora and can read training and development
datasets in the DocBin (.spacy) format.
Corpus.__init__
Create a Corpus. The input data can be a file or a directory of files.
Example
from spacy.gold import Corpus corpus = Corpus("./train.spacy", "./dev.spacy")
| Name | Type | Description | 
|---|---|---|
| train | str / Path | Training data ( .spacyfile or directory of.spacyfiles). | 
| dev | str / Path | Development data ( .spacyfile or directory of.spacyfiles). | 
| limit | int | Maximum number of examples returned. 0for no limit (default). | 
Corpus.train_dataset
Yield examples from the training data.
Example
from spacy.gold import Corpus import spacy corpus = Corpus("./train.spacy", "./dev.spacy") nlp = spacy.blank("en") train_data = corpus.train_dataset(nlp)
| Name | Type | Description | 
|---|---|---|
| nlp | Language | The current nlpobject. | 
| keyword-only | ||
| shuffle | bool | Whether to shuffle the examples. Defaults to True. | 
| gold_preproc | bool | Whether to train on gold-standard sentences and tokens. Defaults to False. | 
| max_length | int | Maximum document length. Longer documents will be split into sentences, if sentence boundaries are available. 0for no limit (default). | 
| YIELDS | Example | The examples. | 
Corpus.dev_dataset
Yield examples from the development data.
Example
from spacy.gold import Corpus import spacy corpus = Corpus("./train.spacy", "./dev.spacy") nlp = spacy.blank("en") dev_data = corpus.dev_dataset(nlp)
| Name | Type | Description | 
|---|---|---|
| nlp | Language | The current nlpobject. | 
| keyword-only | ||
| gold_preproc | bool | Whether to train on gold-standard sentences and tokens. Defaults to False. | 
| YIELDS | Example | The examples. | 
Corpus.count_train
Get the word count of all training examples.
Example
from spacy.gold import Corpus import spacy corpus = Corpus("./train.spacy", "./dev.spacy") nlp = spacy.blank("en") word_count = corpus.count_train(nlp)
| Name | Type | Description | 
|---|---|---|
| nlp | Language | The current nlpobject. | 
| RETURNS | int | The word count. |