Add a context manage nlp.memory_zone(), which will begin
memory_zone() blocks on the vocab, string store, and potentially
other components.
Once the memory_zone() block expires, spaCy will free any shared
resources that were allocated for the text-processing that occurred
within the memory_zone. If you create Doc objects within a memory
zone, it's invalid to access them once the memory zone is expired.
The purpose of this is that spaCy creates and stores Lexeme objects
in the Vocab that can be shared between multiple Doc objects. It also
interns strings. Normally, spaCy can't know when all Doc objects using
a Lexeme are out-of-scope, so new Lexemes accumulate in the vocab,
causing memory pressure.
Memory zones solve this problem by telling spaCy "okay none of the
documents allocated within this block will be accessed again". This
lets spaCy free all new Lexeme objects and other data that were
created during the block.
The mechanism is general, so memory_zone() context managers can be
added to other components that could benefit from them, e.g. pipeline
components.
I experimented with adding memory zone support to the tokenizer as well,
for its cache. However, this seems unnecessarily complicated. It makes
more sense to just stick a limit on the cache size. This lets spaCy
benefit from the efficiency advantage of the cache better, because
we can maintain a (bounded) cache even if only small batches of
documents are being processed.
* fix type annotation in docs
* only restore entities after loss calculation
* restore entities of sample in initialization
* rename overfitting function
* fix EL scorer
* Relax test
* fix formatting
* Update spacy/pipeline/entity_linker.py
Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com>
* rename to _ensure_ents
* further rename
* allow for scorer to be None
---------
Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com>
* Add distill subcommand
This subcommand distills a student model from a teacher model.
* Fixes from Sofie
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Type and doc fixes
* Wording
* distill: document missing `-o`
* Wording
* Small fix
---------
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
The 'direct' option in 'spacy download' is supposed to only download from our model releases repository. However, users were able to pass in a relative path, allowing download from arbitrary repositories. This meant that a service that sourced strings from user input and which used the direct option would allow users to install arbitrary packages.
* Remove debug data normalization for span analysis
As a result of this normalization, `debug data` could show a user tokens
that do not exist in their data.
* Update spacy/cli/debug_data.py
---------
Co-authored-by: svlandeg <svlandeg@github.com>
* TextCatParametricAttention.v1: set key transform dimensions
This is necessary for tok2vec implementations that initialize
lazily (e.g. curated transformers).
* Add lazily-initialized tok2vec to simulate transformers
Add a lazily-initialized tok2vec to the tests and test the current
textcat models with it.
Fix some additional issues found using this test.
* isort
* Add `test.` prefix to `LazyInitTok2Vec.v1`
The doc/token extension serialization tests add extensions that are not
serializable with pickle. This didn't cause issues before due to the
implicit run order of tests. However, test ordering has changed with
pytest 8.0.0, leading to failed tests in test_language.
Update the fixtures in the extension serialization tests to do proper
teardown and remove the extensions.
macOS now uses port 5000 for the AirPlay receiver functionality, so this
test will always fail on a macOS desktop (unless AirPlay receiver
functionality is disabled like in CI).
Before this change, the workers of pipe call with n_process != 1 were
stopped by calling `terminate` on the processes. However, terminating a
process can leave queues, pipes, and other concurrent data structures in
an invalid state.
With this change, we stop using terminate and take the following approach
instead:
* When the all documents are processed, the parent process puts a
sentinel in the queue of each worker.
* The parent process then calls `join` on each worker process to
let them finish up gracefully.
* Worker processes break from the queue processing loop when the
sentinel is encountered, so that they exit.
We need special handling when one of the workers encounters an error and
the error handler is set to raise an exception. In this case, we cannot
rely on the sentinel to finish all workers -- the queue is a FIFO queue
and there may be other work queued up before the sentinel. We use the
following approach to handle error scenarios:
* The parent puts the end-of-work sentinel in the queue of each worker.
* The parent closes the reading-end of the channel of each worker.
* Then:
- If the worker was waiting for work, it will encounter the sentinel
and break from the processing loop.
- If the worker was processing a batch, it will attempt to write
results to the channel. This will fail because the channel was
closed by the parent and the worker will break from the processing
loop.