spaCy/website/docs/api/corpus.md
2020-08-06 19:30:43 +02:00

4.5 KiB

title teaser tag source new
Corpus An annotated corpus class spacy/gold/corpus.py 3

This class manages annotated corpora and can be used for training and development datasets in the DocBin (.spacy) format. To customize the data loading during training, you can register your own data readers and batchers.

Config and implementation

spacy.Corpus.v1 is a registered function that creates a Corpus of training or evaluation data. It takes the same arguments as the Corpus class and returns a callable that yields Example objects. You can replace it with your own registered function in the @readers registry to customize the data loading and streaming.

Example config

[paths]
train = "corpus/train.spacy"

[training.train_corpus]
@readers = "spacy.Corpus.v1"
path = ${paths:train}
gold_preproc = false
max_length = 0
limit = 0
Name Type Description
path Path The directory or filename to read from. Expects data in spaCy's binary .spacy format.
 gold_preproc bool Whether to set up the Example object with gold-standard sentences and tokens for the predictions. See Corpus for details.
max_length int Maximum document length. Longer documents will be split into sentences, if sentence boundaries are available. Defaults to 0 for no limit.
limit int Limit corpus to a subset of examples, e.g. for debugging. Defaults to 0 for no limit.
https://github.com/explosion/spaCy/blob/develop/spacy/gold/corpus.py

Corpus.__init__

Create a Corpus for iterating Example objects from a file or directory of .spacy data files. The gold_preproc setting lets you specify whether to set up the Example object with gold-standard sentences and tokens for the predictions. Gold preprocessing helps the annotations align to the tokenization, and may result in sequences of more consistent length. However, it may reduce runtime accuracy due to train/test skew.

Example

from spacy.gold import Corpus

# With a single file
corpus = Corpus("./data/train.spacy")

# With a directory
corpus = Corpus("./data", limit=10)
Name Type Description
path str / Path The directory or filename to read from.
keyword-only
 gold_preproc bool Whether to set up the Example object with gold-standard sentences and tokens for the predictions. Defaults to False.
max_length int Maximum document length. Longer documents will be split into sentences, if sentence boundaries are available. Defaults to 0 for no limit.
limit int Limit corpus to a subset of examples, e.g. for debugging. Defaults to 0 for no limit.

Corpus.__call__

Yield examples from the data.

Example

from spacy.gold import Corpus
import spacy

corpus = Corpus("./train.spacy")
nlp = spacy.blank("en")
train_data = corpus(nlp)
Name Type Description
nlp Language The current nlp object.
YIELDS Example The examples.