4.5 KiB
title | teaser | tag | source | new |
---|---|---|---|---|
Corpus | An annotated corpus | class | spacy/gold/corpus.py | 3 |
This class manages annotated corpora and can be used for training and
development datasets in the DocBin (.spacy
) format. To
customize the data loading during training, you can register your own
data readers and batchers.
Config and implementation
spacy.Corpus.v1
is a registered function that creates a Corpus
of training
or evaluation data. It takes the same arguments as the Corpus
class and
returns a callable that yields Example
objects. You can
replace it with your own registered function in the
@readers
registry to customize the data loading and
streaming.
Example config
[paths] train = "corpus/train.spacy" [training.train_corpus] @readers = "spacy.Corpus.v1" path = ${paths:train} gold_preproc = false max_length = 0 limit = 0
Name | Type | Description |
---|---|---|
path |
Path |
The directory or filename to read from. Expects data in spaCy's binary .spacy format. |
gold_preproc |
bool | Whether to set up the Example object with gold-standard sentences and tokens for the predictions. See Corpus for details. |
max_length |
int | Maximum document length. Longer documents will be split into sentences, if sentence boundaries are available. Defaults to 0 for no limit. |
limit |
int | Limit corpus to a subset of examples, e.g. for debugging. Defaults to 0 for no limit. |
https://github.com/explosion/spaCy/blob/develop/spacy/gold/corpus.py
Corpus.__init__
Create a Corpus
for iterating Example objects from a file or
directory of .spacy
data files. The
gold_preproc
setting lets you specify whether to set up the Example
object
with gold-standard sentences and tokens for the predictions. Gold preprocessing
helps the annotations align to the tokenization, and may result in sequences of
more consistent length. However, it may reduce runtime accuracy due to
train/test skew.
Example
from spacy.gold import Corpus # With a single file corpus = Corpus("./data/train.spacy") # With a directory corpus = Corpus("./data", limit=10)
Name | Type | Description |
---|---|---|
path |
str / Path |
The directory or filename to read from. |
keyword-only | ||
gold_preproc |
bool | Whether to set up the Example object with gold-standard sentences and tokens for the predictions. Defaults to False . |
max_length |
int | Maximum document length. Longer documents will be split into sentences, if sentence boundaries are available. Defaults to 0 for no limit. |
limit |
int | Limit corpus to a subset of examples, e.g. for debugging. Defaults to 0 for no limit. |
Corpus.__call__
Yield examples from the data.
Example
from spacy.gold import Corpus import spacy corpus = Corpus("./train.spacy") nlp = spacy.blank("en") train_data = corpus(nlp)
Name | Type | Description |
---|---|---|
nlp |
Language |
The current nlp object. |
YIELDS | Example |
The examples. |