spaCy/spacy
Matthew Honnibal 5d0d2de955 Support 'memory zones' for user memory management
Add a context manage nlp.memory_zone(), which will begin
memory_zone() blocks on the vocab, string store, and potentially
other components.

Once the memory_zone() block expires, spaCy will free any shared
resources that were allocated for the text-processing that occurred
within the memory_zone. If you create Doc objects within a memory
zone, it's invalid to access them once the memory zone is expired.

The purpose of this is that spaCy creates and stores Lexeme objects
in the Vocab that can be shared between multiple Doc objects. It also
interns strings. Normally, spaCy can't know when all Doc objects using
a Lexeme are out-of-scope, so new Lexemes accumulate in the vocab,
causing memory pressure.

Memory zones solve this problem by telling spaCy "okay none of the
documents allocated within this block will be accessed again". This
lets spaCy free all new Lexeme objects and other data that were
created during the block.

The mechanism is general, so memory_zone() context managers can be
added to other components that could benefit from them, e.g. pipeline
components.

I experimented with adding memory zone support to the tokenizer as well,
for its cache. However, this seems unnecessarily complicated. It makes
more sense to just stick a limit on the cache size. This lets spaCy
benefit from the efficiency advantage of the cache better, because
we can maintain a (bounded) cache even if only small batches of
documents are being processed.
2024-09-08 13:06:54 +02:00
..
cli Update cli.package for removed spacy.vectors.name attr 2024-09-01 16:43:49 +02:00
displacy Fix displacy span stacking (#13068) 2023-11-02 12:02:18 +01:00
kb Modify EL batching to doc-wise streaming approach (#12367) 2024-04-09 11:39:18 +02:00
lang Fix Spanish lemmatizer 2024-09-04 14:29:34 +02:00
matcher Merge remote-tracking branch 'upstream/master' into maintenance/v4-merge-master-20240119 2024-01-19 12:34:29 +01:00
ml fix the fix for textcat init functionality 2024-05-14 18:45:51 +02:00
pipeline Merge branch 'master' into feat/update_v4 2024-05-14 17:42:48 +02:00
tests Support 'memory zones' for user memory management 2024-09-08 13:06:54 +02:00
tokens Merge branch 'master' into feat/update_v4 2024-05-14 17:42:48 +02:00
training Merge branch 'master' into feat/update_v4 2024-05-14 17:42:48 +02:00
__init__.pxd * Seems to be working after refactor. Need to wire up more POS tag features, and wire up save/load of POS tags. 2014-10-24 02:23:42 +11:00
__init__.py Revert "Load the cli module lazily for spacy.info (#12962)" 2023-10-04 12:33:33 +02:00
__main__.py Tidy up 2020-06-22 00:45:40 +02:00
about.py Update about 2024-09-07 00:47:09 +02:00
attrs.pxd merge fixes (2) 2023-07-19 16:38:37 +02:00
attrs.pyx Merge remote-tracking branch 'upstream/master' into maintenance/v4-merge-master-20240119 2024-01-19 12:34:29 +01:00
compat.py No need for Literal compat, since we only support >= 3.8 2023-12-21 09:47:38 +01:00
default_config_distillation.cfg Add the configuration schema for distillation (#12201) 2023-01-31 13:06:02 +01:00
default_config_pretraining.cfg Add new parameter for saving every n epoch in pretraining (#8912) 2021-08-12 11:14:48 +02:00
default_config.cfg Support registered vectors (#12492) 2023-08-01 15:46:08 +02:00
errors.py Merge branch 'master' into feat/update_v4 2024-05-14 17:42:48 +02:00
glossary.py isort all the things 2023-06-26 11:41:03 +02:00
language.py Fix dump meta 2024-09-07 00:46:48 +02:00
lexeme.pxd isort all the things 2023-06-26 11:41:03 +02:00
lexeme.pyi isort all the things 2023-06-26 11:41:03 +02:00
lexeme.pyx Merge branch 'master' into feat/update_v4 2024-05-14 17:42:48 +02:00
lookups.py Recommend lookups tables from URLs or other loaders (#12283) 2023-07-31 15:54:35 +02:00
morphology.pxd ci: add cython linter (#12694) 2023-07-19 12:03:31 +02:00
morphology.pyx Merge remote-tracking branch 'upstream/master' into maintenance/v4-merge-master-20240119 2024-01-19 12:34:29 +01:00
parts_of_speech.pxd cython fixes and cleanup 2023-07-19 17:41:29 +02:00
parts_of_speech.pyx Add profile=False to currently unprofiled cython 2023-09-28 17:09:41 +02:00
pipe_analysis.py isort all the things 2023-06-26 11:41:03 +02:00
py.typed Add py.typed 2021-03-16 09:48:31 +01:00
schemas.py Merge remote-tracking branch 'upstream/master' into maintenance/v4-merge-master-20240119 2024-01-19 12:34:29 +01:00
scorer.py Merge remote-tracking branch 'upstream/master' into maintenance/v4-merge-master-20240119 2024-01-19 12:34:29 +01:00
strings.pxd Support 'memory zones' for user memory management 2024-09-08 13:06:54 +02:00
strings.pyi isort all the things 2023-06-26 11:41:03 +02:00
strings.pyx Support 'memory zones' for user memory management 2024-09-08 13:06:54 +02:00
structs.pxd Merge branch 'upstream_master' into sync_v4 2023-07-19 16:37:31 +02:00
symbols.pxd Merge branch 'upstream_master' into sync_v4 2023-07-19 16:37:31 +02:00
symbols.pyx Merge remote-tracking branch 'upstream/master' into maintenance/v4-merge-master-20240119 2024-01-19 12:34:29 +01:00
tokenizer.pxd Support 'memory zones' for user memory management 2024-09-08 13:06:54 +02:00
tokenizer.pyx Support 'memory zones' for user memory management 2024-09-08 13:06:54 +02:00
ty.py isort all the things 2023-06-26 11:41:03 +02:00
typedefs.pxd isort all the things 2023-06-26 11:41:03 +02:00
typedefs.pyx Add profile=False to currently unprofiled cython 2023-09-28 17:09:41 +02:00
util.py Merge remote-tracking branch 'upstream/master' into maintenance/v4-merge-master-20240119 2024-01-19 12:34:29 +01:00
vectors.pyx Merge remote-tracking branch 'upstream/master' into maintenance/v4-merge-master-20240119 2024-01-19 12:34:29 +01:00
vocab.pxd Support 'memory zones' for user memory management 2024-09-08 13:06:54 +02:00
vocab.pyi Support 'memory zones' for user memory management 2024-09-08 13:06:54 +02:00
vocab.pyx Support 'memory zones' for user memory management 2024-09-08 13:06:54 +02:00