* Support data augmentation in Corpus * Note initial docs for data augmentation * Add augmenter to quickstart * Fix flake8 * Format * Fix test * Update spacy/tests/training/test_training.py * Improve data augmentation arguments * Update templates * Move randomization out into caller * Refactor * Update spacy/training/augment.py * Update spacy/tests/training/test_training.py * Fix augment * Fix test
7.5 KiB
title | teaser | tag | source | new |
---|---|---|---|---|
Corpus | An annotated corpus | class | spacy/training/corpus.py | 3 |
This class manages annotated corpora and can be used for training and
development datasets in the DocBin (.spacy
) format. To
customize the data loading during training, you can register your own
data readers and batchers.
Config and implementation
spacy.Corpus.v1
is a registered function that creates a Corpus
of training
or evaluation data. It takes the same arguments as the Corpus
class and
returns a callable that yields Example
objects. You can
replace it with your own registered function in the
@readers
registry to customize the data loading and
streaming.
Example config
[paths] train = "corpus/train.spacy" [corpora.train] @readers = "spacy.Corpus.v1" path = ${paths.train} gold_preproc = false max_length = 0 limit = 0
Name | Description |
---|---|
path |
The directory or filename to read from. Expects data in spaCy's binary .spacy format. |
gold_preproc |
Whether to set up the Example object with gold-standard sentences and tokens for the predictions. See Corpus for details. |
max_length |
Maximum document length. Longer documents will be split into sentences, if sentence boundaries are available. Defaults to 0 for no limit. |
limit |
Limit corpus to a subset of examples, e.g. for debugging. Defaults to 0 for no limit. |
%%GITHUB_SPACY/spacy/training/corpus.py
Corpus.__init__
Create a Corpus
for iterating Example objects from a file or
directory of .spacy
data files. The
gold_preproc
setting lets you specify whether to set up the Example
object
with gold-standard sentences and tokens for the predictions. Gold preprocessing
helps the annotations align to the tokenization, and may result in sequences of
more consistent length. However, it may reduce runtime accuracy due to
train/test skew.
Example
from spacy.training import Corpus # With a single file corpus = Corpus("./data/train.spacy") # With a directory corpus = Corpus("./data", limit=10)
Name | Description |
---|---|
path |
The directory or filename to read from. |
keyword-only | |
gold_preproc |
Whether to set up the Example object with gold-standard sentences and tokens for the predictions. Defaults to False . |
max_length |
Maximum document length. Longer documents will be split into sentences, if sentence boundaries are available. Defaults to 0 for no limit. |
limit |
Limit corpus to a subset of examples, e.g. for debugging. Defaults to 0 for no limit. |
augmenter |
Optional data augmentation callback. |
Corpus.__call__
Yield examples from the data.
Example
from spacy.training import Corpus import spacy corpus = Corpus("./train.spacy") nlp = spacy.blank("en") train_data = corpus(nlp)
Name | Description |
---|---|
nlp |
The current nlp object. |
YIELDS | The examples. |
JsonlTexts
Iterate Doc objects from a file or directory of JSONL (newline-delimited JSON) formatted raw text files. Can be used to read the raw text corpus for language model pretraining from a JSONL file.
Tip: Writing JSONL
Our utility library
srsly
provides a handywrite_jsonl
helper that takes a file path and list of dictionaries and writes out JSONL-formatted data.import srsly data = [{"text": "Some text"}, {"text": "More..."}] srsly.write_jsonl("/path/to/text.jsonl", data)
### Example
{"text": "Can I ask where you work now and what you do, and if you enjoy it?"}
{"text": "They may just pull out of the Seattle market completely, at least until they have autonomous vehicles."}
{"text": "My cynical view on this is that it will never be free to the public. Reason: what would be the draw of joining the military? Right now their selling point is free Healthcare and Education. Ironically both are run horribly and most, that I've talked to, come out wishing they never went in."}
JsonlTexts._\init__
Initialize the reader.
Example
from spacy.training import JsonlTexts corpus = JsonlTexts("./data/texts.jsonl")
### Example config [corpora.pretrain] @readers = "spacy.JsonlReader.v1" path = "corpus/raw_text.jsonl" min_length = 0 max_length = 0 limit = 0
Name | Description |
---|---|
path |
The directory or filename to read from. Expects newline-delimited JSON with a key "text" for each record. |
keyword-only | |
min_length |
Minimum document length (in tokens). Shorter documents will be skipped. Defaults to 0 , which indicates no limit. |
max_length |
Maximum document length (in tokens). Longer documents will be skipped. Defaults to 0 , which indicates no limit. |
limit |
Limit corpus to a subset of examples, e.g. for debugging. Defaults to 0 for no limit. |
JsonlTexts.__call__
Yield examples from the data.
Example
from spacy.training import JsonlTexts import spacy corpus = JsonlTexts("./texts.jsonl") nlp = spacy.blank("en") data = corpus(nlp)
Name | Description |
---|---|
nlp |
The current nlp object. |
YIELDS | The examples. |