spaCy

mirror of https://github.com/explosion/spaCy.git synced 2026-02-02 13:36:18 +03:00

History

Adriane Boyd ff84075839 Support large/infinite training corpora (#7208 ) * Support infinite generators for training corpora Support a training corpus with an infinite generator in the `spacy train` training loop: * Revert `create_train_batches` to the state where an infinite generator can be used as the in the first epoch of exactly one epoch without resulting in a memory leak (`max_epochs != 1` will still result in a memory leak) * Move the shuffling for the first epoch into the corpus reader, renaming it to `spacy.Corpus.v2`. * Switch to training option for shuffling in memory Training loop: * Add option `training.shuffle_train_corpus_in_memory` that controls whether the corpus is loaded in memory once and shuffled in the training loop * Revert changes to `create_train_batches` and rename to `create_train_batches_with_shuffling` for use with `spacy.Corpus.v1` and a corpus that should be loaded in memory * Add `create_train_batches_without_shuffling` for a corpus that should not be shuffled in the training loop: the corpus is merely batched during training Corpus readers: * Restore `spacy.Corpus.v1` * Add `spacy.ShuffledCorpus.v1` for a corpus shuffled in memory in the reader instead of the training loop * In combination with `shuffle_train_corpus_in_memory = False`, each epoch could result in a different augmentation * Refactor create_train_batches, validation * Rename config setting to `training.shuffle_train_corpus` * Refactor to use a single `create_train_batches` method with a `shuffle` option * Only validate `get_examples` in initialize step if: * labels are required * labels are not provided * Switch back to max_epochs=-1 for streaming train corpus * Use first 100 examples for stream train corpus init * Always check validate_get_examples in initialize		2021-04-08 18:08:04 +10:00
..
cli	Add --code option to init fill-config	2021-03-12 10:03:57 +01:00
displacy	Also exclude user hooks in displacy conversion (#7419 )	2021-03-12 09:41:59 +01:00
lang	Added more exception to the italian language from https://forum.wordr … (#7246 )	2021-03-30 10:23:32 +02:00
matcher	Preserve user data for DependencyMatcher on spans (#7528 )	2021-03-30 12:26:22 +02:00
ml	Fixing pretrain (#7342 )	2021-03-09 14:01:13 +11:00
pipeline	Update lexeme_norm checks	2021-03-19 10:59:27 +01:00
tests	Fix __add__ method of PRFScore (#7557 )	2021-04-08 17:34:14 +10:00
tokens	Fix/update extension copying in Span.as_doc and Doc.from_docs (#7574 )	2021-03-30 09:49:12 +02:00
training	Support large/infinite training corpora (#7208 )	2021-04-08 18:08:04 +10:00
__init__.pxd	* Seems to be working after refactor. Need to wire up more POS tag features, and wire up save/load of POS tags.	2014-10-24 02:23:42 +11:00
__init__.py	Add vocab kwarg back to spacy.load	2021-03-11 10:58:59 +01:00
__main__.py	Tidy up	2020-06-22 00:45:40 +02:00
about.py	Update thinc pin and set version to v3.0.5 (#7389 )	2021-03-10 11:10:53 +01:00
attrs.pxd	Merge branch 'develop' into master-tmp	2020-05-21 18:39:06 +02:00
attrs.pyx	Merge branch 'develop' into master-tmp	2020-05-21 18:39:06 +02:00
compat.py	Use Literal type for nr_feature_tokens	2020-09-23 16:00:03 +02:00
default_config_pretraining.cfg	pretrain architectures (#6451 )	2020-12-08 14:41:03 +08:00
default_config.cfg	Support large/infinite training corpora (#7208 )	2021-04-08 18:08:04 +10:00
errors.py	Add warning if initial vectors are empty (#7641 )	2021-04-04 20:20:24 +02:00
glossary.py	unicode -> str consistency	2020-05-24 17:20:58 +02:00
kb.pxd	Revert added_strings change (#6236 )	2020-10-10 18:55:07 +02:00
kb.pyx	Replace links to nightly docs [ci skip]	2021-01-30 20:09:38 +11:00
language.py	Proactively remove unused listeners	2021-03-17 22:41:41 +09:00
lexeme.pxd	Fix Lexeme.from_ptr	2020-08-10 16:43:37 +02:00
lexeme.pyx	reduce memory load when reading all vectors from file (#6945 )	2021-02-07 08:05:43 +08:00
lookups.py	Replace links to nightly docs [ci skip]	2021-01-30 20:09:38 +11:00
morphology.pxd	Add Lemmatizer and simplify related components (#5848 )	2020-08-07 15:27:13 +02:00
morphology.pyx	Prevent 0-length mem alloc (#6653 )	2021-01-06 12:50:17 +11:00
parts_of_speech.pxd	Add support for Universal Dependencies v2.0	2017-03-03 13:17:34 +01:00
parts_of_speech.pyx	Drop Python 2.7 and 3.5 (#4828 )	2019-12-22 01:53:56 +01:00
pipe_analysis.py	Tidy up and auto-format	2020-09-29 21:39:28 +02:00
py.typed	Add py.typed	2021-03-16 09:48:31 +01:00
schemas.py	Support env vars and CLI overrides for project.yml	2021-02-10 13:45:27 +11:00
scorer.py	Fix __add__ method of PRFScore (#7557 )	2021-04-08 17:34:14 +10:00
strings.pxd	Remove 'cleanup' of strings (#6007 )	2020-09-01 16:12:15 +02:00
strings.pyx	Replace links to nightly docs [ci skip]	2021-01-30 20:09:38 +11:00
structs.pxd	Add SpanGroup and Graph container types to represent arbitrary annotations (#6696 )	2021-01-14 17:30:41 +11:00
symbols.pxd	introduce token.has_head and refer to MISSING_DEP_ (WIP)	2021-01-12 17:17:06 +01:00
symbols.pyx	Add _ as a symbol (#6153 )	2020-09-27 22:20:14 +02:00
tokenizer.pxd	Simplify specials and cache checks (#6012 )	2020-09-03 09:42:49 +02:00
tokenizer.pyx	Run PhraseMatcher on Spans (#6918 )	2021-02-10 23:43:32 +11:00
typedefs.pxd	Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master	2020-11-25 11:49:34 +01:00
typedefs.pyx	Tidy up rest	2017-10-27 21:07:59 +02:00
util.py	Update lexeme_norm checks	2021-03-19 10:59:27 +01:00
vectors.pyx	Replace links to nightly docs [ci skip]	2021-01-30 20:09:38 +11:00
vocab.pxd	Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master	2020-11-25 11:49:34 +01:00
vocab.pyx	Extend docs related to Vocab.get_noun_chunks	2021-02-25 16:38:21 +01:00