spaCy/spacy/training
Adriane Boyd ff84075839
Support large/infinite training corpora (#7208)
* Support infinite generators for training corpora

Support a training corpus with an infinite generator in the `spacy
train` training loop:

* Revert `create_train_batches` to the state where an infinite generator
can be used as the in the first epoch of exactly one epoch without
resulting in a memory leak (`max_epochs != 1` will still result in a
memory leak)
* Move the shuffling for the first epoch into the corpus reader,
renaming it to `spacy.Corpus.v2`.

* Switch to training option for shuffling in memory

Training loop:

* Add option `training.shuffle_train_corpus_in_memory` that controls
whether the corpus is loaded in memory once and shuffled in the training
loop
  * Revert changes to `create_train_batches` and rename to
`create_train_batches_with_shuffling` for use with `spacy.Corpus.v1` and
a corpus that should be loaded in memory
  * Add `create_train_batches_without_shuffling` for a corpus that
should not be shuffled in the training loop: the corpus is merely
batched during training

Corpus readers:

* Restore `spacy.Corpus.v1`
* Add `spacy.ShuffledCorpus.v1` for a corpus shuffled in memory in the
reader instead of the training loop
  * In combination with `shuffle_train_corpus_in_memory = False`, each
epoch could result in a different augmentation

* Refactor create_train_batches, validation

* Rename config setting to `training.shuffle_train_corpus`
* Refactor to use a single `create_train_batches` method with a
`shuffle` option
* Only validate `get_examples` in initialize step if:
  * labels are required
  * labels are not provided

* Switch back to max_epochs=-1 for streaming train corpus

* Use first 100 examples for stream train corpus init

* Always check validate_get_examples in initialize
2021-04-08 18:08:04 +10:00
..
converters Switch converters to generator functions (#6547) 2020-12-15 16:47:16 +08:00
__init__.pxd Renaming gold & annotation_setter (#6042) 2020-09-09 10:31:03 +02:00
__init__.py Replace pytokenizations with internal alignment (#6293) 2020-11-03 16:24:38 +01:00
align.pyx Fix alignment for 1-to-1 tokens and lowercasing (#6476) 2020-12-08 14:25:16 +08:00
alignment.py Replace pytokenizations with internal alignment (#6293) 2020-11-03 16:24:38 +01:00
augment.py Fix lowercase augmentation (#7336) 2021-03-09 14:02:32 +11:00
batchers.py Renaming gold & annotation_setter (#6042) 2020-09-09 10:31:03 +02:00
corpus.py Support large/infinite training corpora (#7208) 2021-04-08 18:08:04 +10:00
example.pxd Make a pre-check to speed up alignment cache (#6139) 2020-09-24 18:13:39 +02:00
example.pyx Support doc.spans in Example.from_dict (#7197) 2021-03-03 01:12:54 +11:00
gold_io.pyx Use null raw for has_unknown_spaces in docs_to_json 2020-10-15 09:57:54 +02:00
initialize.py Support large/infinite training corpora (#7208) 2021-04-08 18:08:04 +10:00
iob_utils.py Merge pull request #6089 from adrianeboyd/feature/doc-ents-v3-2 2020-09-24 14:44:42 +02:00
loggers.py W&B integration: Optional support for dataset and model checkpoint logging and versioning (#7429) 2021-04-01 19:36:23 +02:00
loop.py Support large/infinite training corpora (#7208) 2021-04-08 18:08:04 +10:00
pretrain.py replace "is not" with != 2021-03-18 21:09:11 +01:00