* Support infinite generators for training corpora
Support a training corpus with an infinite generator in the `spacy
train` training loop:
* Revert `create_train_batches` to the state where an infinite generator
can be used as the in the first epoch of exactly one epoch without
resulting in a memory leak (`max_epochs != 1` will still result in a
memory leak)
* Move the shuffling for the first epoch into the corpus reader,
renaming it to `spacy.Corpus.v2`.
* Switch to training option for shuffling in memory
Training loop:
* Add option `training.shuffle_train_corpus_in_memory` that controls
whether the corpus is loaded in memory once and shuffled in the training
loop
* Revert changes to `create_train_batches` and rename to
`create_train_batches_with_shuffling` for use with `spacy.Corpus.v1` and
a corpus that should be loaded in memory
* Add `create_train_batches_without_shuffling` for a corpus that
should not be shuffled in the training loop: the corpus is merely
batched during training
Corpus readers:
* Restore `spacy.Corpus.v1`
* Add `spacy.ShuffledCorpus.v1` for a corpus shuffled in memory in the
reader instead of the training loop
* In combination with `shuffle_train_corpus_in_memory = False`, each
epoch could result in a different augmentation
* Refactor create_train_batches, validation
* Rename config setting to `training.shuffle_train_corpus`
* Refactor to use a single `create_train_batches` method with a
`shuffle` option
* Only validate `get_examples` in initialize step if:
* labels are required
* labels are not provided
* Switch back to max_epochs=-1 for streaming train corpus
* Use first 100 examples for stream train corpus init
* Always check validate_get_examples in initialize
* Fix patience for identical scores
Fix training patience so that the earliest best step is chosen for
identical max scores.
* Restore break, remove print
* Explicitly define best_step for clarity
* Allow output_path to be None during training
* Fix cat scoring (?)
* Improve error message for weighted None score
* Improve messages
So we can call this in other places etc.
* FIx output path check
* Use latest wasabi
* Revert "Improve error message for weighted None score"
This reverts commit 7059926763.
* Exclude None scores from final score by default
It's otherwise very difficult to keep track of the score weights if we modify a config programmatically, source components etc.
* Update warnings and use logger.warning
* Fix memory issues in Language.evaluate
Reset annotation in predicted docs before evaluating and store all data
in `examples`.
* Minor refactor to docs generator init
* Fix generator expression
* Fix final generator check
* Refactor pipeline loop
* Handle examples generator in Language.evaluate
* Add test with generator
* Use make_doc
* rename Pipe to TrainablePipe
* split functionality between Pipe and TrainablePipe
* remove unnecessary methods from certain components
* cleanup
* hasattr(component, "pipe") should be sufficient again
* remove serialization and vocab/cfg from Pipe
* unify _ensure_examples and validate_examples
* small fixes
* hasattr checks for self.cfg and self.vocab
* make is_resizable and is_trainable properties
* serialize strings.json instead of vocab
* fix KB IO + tests
* fix typos
* more typos
* _added_strings as a set
* few more tests specifically for _added_strings field
* bump to 3.0.0a36
* Make logging and progress easier to control
* Update docs
* Cleanup errors
* Fix ConfigValidationError
* Pass stdout/stderr, not wasabi.Printer
* Fix type
* Upd logging example
* Fix logger example
* Fix type