spaCy/corpus.md at 361f91e2860aa0c808ccb3985ca62f89da3d5a7b

mirror of https://github.com/explosion/spaCy.git synced 2025-07-14 18:22:27 +03:00

Matthew Honnibal a976da168c

Support data augmentation in Corpus (#6155 )

* Support data augmentation in Corpus

* Note initial docs for data augmentation

* Add augmenter to quickstart

* Fix flake8

* Format

* Fix test

* Update spacy/tests/training/test_training.py

* Improve data augmentation arguments

* Update templates

* Move randomization out into caller

* Refactor

* Update spacy/training/augment.py

* Update spacy/tests/training/test_training.py

* Fix augment

* Fix test

2020-09-28 03:03:27 +02:00

7.5 KiB

Raw Blame History

title	teaser	tag	source	new
Corpus	An annotated corpus	class	spacy/training/corpus.py	3

This class manages annotated corpora and can be used for training and development datasets in the DocBin (.spacy) format. To customize the data loading during training, you can register your own data readers and batchers.

Config and implementation

spacy.Corpus.v1 is a registered function that creates a Corpus of training or evaluation data. It takes the same arguments as the Corpus class and returns a callable that yields Example objects. You can replace it with your own registered function in the @readers registry to customize the data loading and streaming.

Example config

[paths]
train = "corpus/train.spacy"

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
gold_preproc = false
max_length = 0
limit = 0

Name	Description
`path`	The directory or filename to read from. Expects data in spaCy's binary `.spacy` format. ~~Path~~
`gold_preproc`	Whether to set up the Example object with gold-standard sentences and tokens for the predictions. See `Corpus` for details. ~~bool~~
`max_length`	Maximum document length. Longer documents will be split into sentences, if sentence boundaries are available. Defaults to `0` for no limit. ~~int~~
`limit`	Limit corpus to a subset of examples, e.g. for debugging. Defaults to `0` for no limit. ~~int~~

%%GITHUB_SPACY/spacy/training/corpus.py

Corpus.init

Create a Corpus for iterating Example objects from a file or directory of .spacy data files. The gold_preproc setting lets you specify whether to set up the Example object with gold-standard sentences and tokens for the predictions. Gold preprocessing helps the annotations align to the tokenization, and may result in sequences of more consistent length. However, it may reduce runtime accuracy due to train/test skew.

Example

from spacy.training import Corpus

# With a single file
corpus = Corpus("./data/train.spacy")

# With a directory
corpus = Corpus("./data", limit=10)

Name	Description
`path`	The directory or filename to read from. ~~Union[str, Path]~~
keyword-only
`gold_preproc`	Whether to set up the Example object with gold-standard sentences and tokens for the predictions. Defaults to `False`. ~~bool~~
`max_length`	Maximum document length. Longer documents will be split into sentences, if sentence boundaries are available. Defaults to `0` for no limit. ~~int~~
`limit`	Limit corpus to a subset of examples, e.g. for debugging. Defaults to `0` for no limit. ~~int~~
`augmenter`	Optional data augmentation callback. ~~CallableLanguage, Example], Iterable[Example~~

Corpus.call

Yield examples from the data.

Example

from spacy.training import Corpus
import spacy

corpus = Corpus("./train.spacy")
nlp = spacy.blank("en")
train_data = corpus(nlp)

Name	Description
`nlp`	The current `nlp` object. ~~Language~~
YIELDS	The examples. ~~Example~~

JsonlTexts

Iterate Doc objects from a file or directory of JSONL (newline-delimited JSON) formatted raw text files. Can be used to read the raw text corpus for language model pretraining from a JSONL file.

Tip: Writing JSONL

Our utility library srsly provides a handy write_jsonl helper that takes a file path and list of dictionaries and writes out JSONL-formatted data.
import srsly
data = [{"text": "Some text"}, {"text": "More..."}]
srsly.write_jsonl("/path/to/text.jsonl", data)

### Example
{"text": "Can I ask where you work now and what you do, and if you enjoy it?"}
{"text": "They may just pull out of the Seattle market completely, at least until they have autonomous vehicles."}
{"text": "My cynical view on this is that it will never be free to the public. Reason: what would be the draw of joining the military? Right now their selling point is free Healthcare and Education. Ironically both are run horribly and most, that I've talked to, come out wishing they never went in."}

JsonlTexts._\init__

Initialize the reader.

Example

from spacy.training import JsonlTexts

corpus = JsonlTexts("./data/texts.jsonl")

### Example config
[corpora.pretrain]
@readers = "spacy.JsonlReader.v1"
path = "corpus/raw_text.jsonl"
min_length = 0
max_length = 0
limit = 0

Name	Description
`path`	The directory or filename to read from. Expects newline-delimited JSON with a key `"text"` for each record. ~~Union[str, Path]~~
keyword-only
`min_length`	Minimum document length (in tokens). Shorter documents will be skipped. Defaults to `0`, which indicates no limit. ~~int~~
`max_length`	Maximum document length (in tokens). Longer documents will be skipped. Defaults to `0`, which indicates no limit. ~~int~~
`limit`	Limit corpus to a subset of examples, e.g. for debugging. Defaults to `0` for no limit. ~~int~~

JsonlTexts.call

Yield examples from the data.

Example

from spacy.training import JsonlTexts
import spacy

corpus = JsonlTexts("./texts.jsonl")
nlp = spacy.blank("en")
data = corpus(nlp)

Name	Description
`nlp`	The current `nlp` object. ~~Language~~
YIELDS	The examples. ~~Example~~

7.5 KiB Raw Blame History

Config and implementation

Example config

Corpus.__init__

Example

Corpus.__call__

Example

JsonlTexts

Tip: Writing JSONL

JsonlTexts._\init__

Example

JsonlTexts.__call__

Example

7.5 KiB

Raw Blame History

Corpus.init

Corpus.call

JsonlTexts.call