spaCy/website/docs/api/corpus.md
2020-07-29 11:36:42 +02:00

4.1 KiB

title teaser tag source new
Corpus An annotated corpus class spacy/gold/corpus.py 3

This class manages annotated corpora and can read training and development datasets in the DocBin (.spacy) format.

Corpus.__init__

Create a Corpus. The input data can be a file or a directory of files.

Example

from spacy.gold import Corpus

corpus = Corpus("./train.spacy", "./dev.spacy")
Name Type Description
train str / Path Training data (.spacy file or directory of .spacy files).
dev str / Path Development data (.spacy file or directory of .spacy files).
limit int Maximum number of examples returned. 0 for no limit (default).

Corpus.train_dataset

Yield examples from the training data.

Example

from spacy.gold import Corpus
import spacy

corpus = Corpus("./train.spacy", "./dev.spacy")
nlp = spacy.blank("en")
train_data = corpus.train_dataset(nlp)
Name Type Description
nlp Language The current nlp object.
keyword-only
shuffle bool Whether to shuffle the examples. Defaults to True.
gold_preproc bool Whether to train on gold-standard sentences and tokens. Defaults to False.
max_length int Maximum document length. Longer documents will be split into sentences, if sentence boundaries are available. 0 for no limit (default). 
YIELDS Example The examples.

Corpus.dev_dataset

Yield examples from the development data.

Example

from spacy.gold import Corpus
import spacy

corpus = Corpus("./train.spacy", "./dev.spacy")
nlp = spacy.blank("en")
dev_data = corpus.dev_dataset(nlp)
Name Type Description
nlp Language The current nlp object.
keyword-only
gold_preproc bool Whether to train on gold-standard sentences and tokens. Defaults to False.
YIELDS Example The examples.

Corpus.count_train

Get the word count of all training examples.

Example

from spacy.gold import Corpus
import spacy

corpus = Corpus("./train.spacy", "./dev.spacy")
nlp = spacy.blank("en")
word_count = corpus.count_train(nlp)
Name Type Description
nlp Language The current nlp object.
RETURNS int The word count.