* Add distill subcommand
This subcommand distills a student model from a teacher model.
* Fixes from Sofie
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Type and doc fixes
* Wording
* distill: document missing `-o`
* Wording
* Small fix
---------
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Remove debug data normalization for span analysis
As a result of this normalization, `debug data` could show a user tokens
that do not exist in their data.
* Update spacy/cli/debug_data.py
---------
Co-authored-by: svlandeg <svlandeg@github.com>
The doc/token extension serialization tests add extensions that are not
serializable with pickle. This didn't cause issues before due to the
implicit run order of tests. However, test ordering has changed with
pytest 8.0.0, leading to failed tests in test_language.
Update the fixtures in the extension serialization tests to do proper
teardown and remove the extensions.
macOS now uses port 5000 for the AirPlay receiver functionality, so this
test will always fail on a macOS desktop (unless AirPlay receiver
functionality is disabled like in CI).
Before this change, the workers of pipe call with n_process != 1 were
stopped by calling `terminate` on the processes. However, terminating a
process can leave queues, pipes, and other concurrent data structures in
an invalid state.
With this change, we stop using terminate and take the following approach
instead:
* When the all documents are processed, the parent process puts a
sentinel in the queue of each worker.
* The parent process then calls `join` on each worker process to
let them finish up gracefully.
* Worker processes break from the queue processing loop when the
sentinel is encountered, so that they exit.
We need special handling when one of the workers encounters an error and
the error handler is set to raise an exception. In this case, we cannot
rely on the sentinel to finish all workers -- the queue is a FIFO queue
and there may be other work queued up before the sentinel. We use the
following approach to handle error scenarios:
* The parent puts the end-of-work sentinel in the queue of each worker.
* The parent closes the reading-end of the channel of each worker.
* Then:
- If the worker was waiting for work, it will encounter the sentinel
and break from the processing loop.
- If the worker was processing a batch, it will attempt to write
results to the channel. This will fail because the channel was
closed by the parent and the worker will break from the processing
loop.
* Add spacy.TextCatParametricAttention.v1
This layer provides is a simplification of the ensemble classifier that
only uses paramteric attention. We have found empirically that with a
sufficient amount of training data, using the ensemble classifier with
BoW does not provide significant improvement in classifier accuracy.
However, plugging in a BoW classifier does reduce GPU training and
inference performance substantially, since it uses a GPU-only kernel.
* Fix merge fallout
* Add TextCatReduce.v1
This is a textcat classifier that pools the vectors generated by a
tok2vec implementation and then applies a classifier to the pooled
representation. Three reductions are supported for pooling: first, max,
and mean. When multiple reductions are enabled, the reductions are
concatenated before providing them to the classification layer.
This model is a generalization of the TextCatCNN model, which only
supports mean reductions and is a bit of a misnomer, because it can also
be used with transformers. This change also reimplements TextCatCNN.v2
using the new TextCatReduce.v1 layer.
* Doc fixes
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Fully specify `TextCatCNN` <-> `TextCatReduce` equivalence
* Move TextCatCNN docs to legacy, in prep for moving to spacy-legacy
* Add back a test for TextCatCNN.v2
* Replace TextCatCNN in pipe configurations and templates
* Add an infobox to the `TextCatReduce` section with an `TextCatCNN` anchor
* Add last reduction (`use_reduce_last`)
* Remove non-working TextCatCNN Netlify redirect
* Revert layer changes for the quickstart
* Revert one more quickstart change
* Remove unused import
* Fix docstring
* Fix setting name in error message
---------
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Update README.md to include links for GPU processing, LLM, and spaCy's blog.
* Create ojo4f3.md
* corrected README to most current version with links to GPU processing, LLM's, and the spaCy blog.
* Delete .github/contributors/ojo4f3.md
* changed LLM icon
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Apply suggestions from code review
---------
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>