spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-07-13 09:42:26 +03:00

History

Adriane Boyd ff84075839 Support large/infinite training corpora (#7208 ) * Support infinite generators for training corpora Support a training corpus with an infinite generator in the `spacy train` training loop: * Revert `create_train_batches` to the state where an infinite generator can be used as the in the first epoch of exactly one epoch without resulting in a memory leak (`max_epochs != 1` will still result in a memory leak) * Move the shuffling for the first epoch into the corpus reader, renaming it to `spacy.Corpus.v2`. * Switch to training option for shuffling in memory Training loop: * Add option `training.shuffle_train_corpus_in_memory` that controls whether the corpus is loaded in memory once and shuffled in the training loop * Revert changes to `create_train_batches` and rename to `create_train_batches_with_shuffling` for use with `spacy.Corpus.v1` and a corpus that should be loaded in memory * Add `create_train_batches_without_shuffling` for a corpus that should not be shuffled in the training loop: the corpus is merely batched during training Corpus readers: * Restore `spacy.Corpus.v1` * Add `spacy.ShuffledCorpus.v1` for a corpus shuffled in memory in the reader instead of the training loop * In combination with `shuffle_train_corpus_in_memory = False`, each epoch could result in a different augmentation * Refactor create_train_batches, validation * Rename config setting to `training.shuffle_train_corpus` * Refactor to use a single `create_train_batches` method with a `shuffle` option * Only validate `get_examples` in initialize step if: * labels are required * labels are not provided * Switch back to max_epochs=-1 for streaming train corpus * Use first 100 examples for stream train corpus init * Always check validate_get_examples in initialize		2021-04-08 18:08:04 +10:00
..
converters	Switch converters to generator functions (#6547 )	2020-12-15 16:47:16 +08:00
__init__.pxd	Renaming gold & annotation_setter (#6042 )	2020-09-09 10:31:03 +02:00
__init__.py	Replace pytokenizations with internal alignment (#6293 )	2020-11-03 16:24:38 +01:00
align.pyx	Fix alignment for 1-to-1 tokens and lowercasing (#6476 )	2020-12-08 14:25:16 +08:00
alignment.py	Replace pytokenizations with internal alignment (#6293 )	2020-11-03 16:24:38 +01:00
augment.py	Fix lowercase augmentation (#7336 )	2021-03-09 14:02:32 +11:00
batchers.py	Renaming gold & annotation_setter (#6042 )	2020-09-09 10:31:03 +02:00
corpus.py	Support large/infinite training corpora (#7208 )	2021-04-08 18:08:04 +10:00
example.pxd	Make a pre-check to speed up alignment cache (#6139 )	2020-09-24 18:13:39 +02:00
example.pyx	Support doc.spans in Example.from_dict (#7197 )	2021-03-03 01:12:54 +11:00
gold_io.pyx	Use null raw for has_unknown_spaces in docs_to_json	2020-10-15 09:57:54 +02:00
initialize.py	Support large/infinite training corpora (#7208 )	2021-04-08 18:08:04 +10:00
iob_utils.py	Merge pull request #6089 from adrianeboyd/feature/doc-ents-v3-2	2020-09-24 14:44:42 +02:00
loggers.py	W&B integration: Optional support for dataset and model checkpoint logging and versioning (#7429 )	2021-04-01 19:36:23 +02:00
loop.py	Support large/infinite training corpora (#7208 )	2021-04-08 18:08:04 +10:00
pretrain.py	replace "is not" with !=	2021-03-18 21:09:11 +01:00