* Add spacy.TextCatParametricAttention.v1
This layer provides is a simplification of the ensemble classifier that
only uses paramteric attention. We have found empirically that with a
sufficient amount of training data, using the ensemble classifier with
BoW does not provide significant improvement in classifier accuracy.
However, plugging in a BoW classifier does reduce GPU training and
inference performance substantially, since it uses a GPU-only kernel.
* Fix merge fallout
* Add TextCatReduce.v1
This is a textcat classifier that pools the vectors generated by a
tok2vec implementation and then applies a classifier to the pooled
representation. Three reductions are supported for pooling: first, max,
and mean. When multiple reductions are enabled, the reductions are
concatenated before providing them to the classification layer.
This model is a generalization of the TextCatCNN model, which only
supports mean reductions and is a bit of a misnomer, because it can also
be used with transformers. This change also reimplements TextCatCNN.v2
using the new TextCatReduce.v1 layer.
* Doc fixes
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Fully specify `TextCatCNN` <-> `TextCatReduce` equivalence
* Move TextCatCNN docs to legacy, in prep for moving to spacy-legacy
* Add back a test for TextCatCNN.v2
* Replace TextCatCNN in pipe configurations and templates
* Add an infobox to the `TextCatReduce` section with an `TextCatCNN` anchor
* Add last reduction (`use_reduce_last`)
* Remove non-working TextCatCNN Netlify redirect
* Revert layer changes for the quickstart
* Revert one more quickstart change
* Remove unused import
* Fix docstring
* Fix setting name in error message
---------
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Update README.md to include links for GPU processing, LLM, and spaCy's blog.
* Create ojo4f3.md
* corrected README to most current version with links to GPU processing, LLM's, and the spaCy blog.
* Delete .github/contributors/ojo4f3.md
* changed LLM icon
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Apply suggestions from code review
---------
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Update `TextCatBOW` to use the fixed `SparseLinear` layer
A while ago, we fixed the `SparseLinear` layer to use all available
parameters: https://github.com/explosion/thinc/pull/754
This change updates `TextCatBOW` to `v3` which uses the new
`SparseLinear_v2` layer. This results in a sizeable improvement on a
text categorization task that was tested.
While at it, this `spacy.TextCatBOW.v3` also adds the `length_exponent`
option to make it possible to change the hidden size. Ideally, we'd just
have an option called `length`. But the way that `TextCatBOW` uses
hashes results in a non-uniform distribution of parameters when the
length is not a power of two.
* Replace TexCatBOW `length_exponent` parameter by `length`
We now round up the length to the next power of two if it isn't
a power of two.
* Remove some tests for TextCatBOW.v2
* Fix missing import
* add language extensions for norwegian nynorsk and faroese
* update docstring for nn/examples.py
* use relative imports
* add fo and nn tokenizers to pytest fixtures
* add unittests for fo and nn and fix bug in nn
* remove module docstring from fo/__init__.py
* add comments about example sentences' origin
* add license information to faroese data credit
* format unittests using black
* add __init__ files to test/lang/nn and tests/lang/fo
* fix import order and use relative imports in fo/__nit__.py and nn/__init__.py
* Make the tests a bit more compact
* Add fo and nn to website languages
* Add note about jul.
* Add "jul." as exception
---------
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Update the "Missing factory" error message
This accounts for model installations that took place during the current Python session.
* Add a note about Jupyter notebooks
* Move error to `spacy.cli.download`
Add extra message for Jupyter sessions
* Add additional note for interactive sessions
* Remove note about `spacy-transformers` from error message
* `isort`
* Improve checks for colab (also helps displacy)
* Update warning messages
* Improve flow for multiple checks
---------
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Update Tokenizer.explain for special cases with whitespace
Update `Tokenizer.explain` to skip special case matches if the exact
text has not been matched due to intervening whitespace.
Enable fuzzy `Tokenizer.explain` tests with additional whitespace
normalization.
* Add unit test for special cases with whitespace, xfail fuzzy tests again
* Fix displacy span stacking.
* Format. Remove counter.
* Remove test files.
* Add unit test. Refactor to allow for unit test.
* Fix off-by-one error in tests.
* Add note on score_weight if using a non-default span_key for SpanCat.
* Fix formatting.
* Fix formatting.
* Fix typo.
* Use warning infobox.
* Fix infobox formatting.