mirror of
https://github.com/explosion/spaCy.git
synced 2025-12-03 08:14:20 +03:00
* Support infinite generators for training corpora Support a training corpus with an infinite generator in the `spacy train` training loop: * Revert `create_train_batches` to the state where an infinite generator can be used as the in the first epoch of exactly one epoch without resulting in a memory leak (`max_epochs != 1` will still result in a memory leak) * Move the shuffling for the first epoch into the corpus reader, renaming it to `spacy.Corpus.v2`. * Switch to training option for shuffling in memory Training loop: * Add option `training.shuffle_train_corpus_in_memory` that controls whether the corpus is loaded in memory once and shuffled in the training loop * Revert changes to `create_train_batches` and rename to `create_train_batches_with_shuffling` for use with `spacy.Corpus.v1` and a corpus that should be loaded in memory * Add `create_train_batches_without_shuffling` for a corpus that should not be shuffled in the training loop: the corpus is merely batched during training Corpus readers: * Restore `spacy.Corpus.v1` * Add `spacy.ShuffledCorpus.v1` for a corpus shuffled in memory in the reader instead of the training loop * In combination with `shuffle_train_corpus_in_memory = False`, each epoch could result in a different augmentation * Refactor create_train_batches, validation * Rename config setting to `training.shuffle_train_corpus` * Refactor to use a single `create_train_batches` method with a `shuffle` option * Only validate `get_examples` in initialize step if: * labels are required * labels are not provided * Switch back to max_epochs=-1 for streaming train corpus * Use first 100 examples for stream train corpus init * Always check validate_get_examples in initialize |
||
|---|---|---|
| .. | ||
| cli | ||
| displacy | ||
| lang | ||
| matcher | ||
| ml | ||
| pipeline | ||
| tests | ||
| tokens | ||
| training | ||
| __init__.pxd | ||
| __init__.py | ||
| __main__.py | ||
| about.py | ||
| attrs.pxd | ||
| attrs.pyx | ||
| compat.py | ||
| default_config_pretraining.cfg | ||
| default_config.cfg | ||
| errors.py | ||
| glossary.py | ||
| kb.pxd | ||
| kb.pyx | ||
| language.py | ||
| lexeme.pxd | ||
| lexeme.pyx | ||
| lookups.py | ||
| morphology.pxd | ||
| morphology.pyx | ||
| parts_of_speech.pxd | ||
| parts_of_speech.pyx | ||
| pipe_analysis.py | ||
| py.typed | ||
| schemas.py | ||
| scorer.py | ||
| strings.pxd | ||
| strings.pyx | ||
| structs.pxd | ||
| symbols.pxd | ||
| symbols.pyx | ||
| tokenizer.pxd | ||
| tokenizer.pyx | ||
| typedefs.pxd | ||
| typedefs.pyx | ||
| util.py | ||
| vectors.pyx | ||
| vocab.pxd | ||
| vocab.pyx | ||