mirror of
				https://github.com/explosion/spaCy.git
				synced 2025-11-04 01:48:04 +03:00 
			
		
		
		
	
		
			
				
	
	
	
		
			4.1 KiB
		
	
	
	
	
	
	
	
			
		
		
	
	
			4.1 KiB
		
	
	
	
	
	
	
	
| title | teaser | tag | source | new | 
|---|---|---|---|---|
| Corpus | An annotated corpus | class | spacy/gold/corpus.py | 3 | 
This class manages annotated corpora and can read training and development
datasets in the DocBin (.spacy) format.
Corpus.__init__
Create a Corpus. The input data can be a file or a directory of files.
Example
from spacy.gold import Corpus corpus = Corpus("./train.spacy", "./dev.spacy")
| Name | Type | Description | 
|---|---|---|
train | 
str / Path | 
Training data (.spacy file or directory of .spacy files). | 
dev | 
str / Path | 
Development data (.spacy file or directory of .spacy files). | 
limit | 
int | Maximum number of examples returned. 0 for no limit (default). | 
Corpus.train_dataset
Yield examples from the training data.
Example
from spacy.gold import Corpus import spacy corpus = Corpus("./train.spacy", "./dev.spacy") nlp = spacy.blank("en") train_data = corpus.train_dataset(nlp)
| Name | Type | Description | 
|---|---|---|
nlp | 
Language | 
The current nlp object. | 
| keyword-only | ||
shuffle | 
bool | Whether to shuffle the examples. Defaults to True. | 
gold_preproc | 
bool | Whether to train on gold-standard sentences and tokens. Defaults to False. | 
max_length | 
int | Maximum document length. Longer documents will be split into sentences, if sentence boundaries are available. 0 for no limit (default).  | 
| YIELDS | Example | 
The examples. | 
Corpus.dev_dataset
Yield examples from the development data.
Example
from spacy.gold import Corpus import spacy corpus = Corpus("./train.spacy", "./dev.spacy") nlp = spacy.blank("en") dev_data = corpus.dev_dataset(nlp)
| Name | Type | Description | 
|---|---|---|
nlp | 
Language | 
The current nlp object. | 
| keyword-only | ||
gold_preproc | 
bool | Whether to train on gold-standard sentences and tokens. Defaults to False. | 
| YIELDS | Example | 
The examples. | 
Corpus.count_train
Get the word count of all training examples.
Example
from spacy.gold import Corpus import spacy corpus = Corpus("./train.spacy", "./dev.spacy") nlp = spacy.blank("en") word_count = corpus.count_train(nlp)
| Name | Type | Description | 
|---|---|---|
nlp | 
Language | 
The current nlp object. | 
| RETURNS | int | The word count. |