* Clarify error when words are of wrong type
See #9437
* Update docs
* Use try/except
* Apply suggestions from code review
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Add section for spacy.cli.train.train
* Add link from training page to train function
* Ensure path in train helper
* Update docs
Co-authored-by: Ines Montani <ines@ines.io>
* Add micro PRF for morph scoring
For pipelines where morph features are added by more than one component
and a reference training corpus may not contain all features, a micro
PRF score is more flexible than a simple accuracy score. An example is
the reading and inflection features added by the Japanese tokenizer.
* Use `morph_micro_f` as the default morph score for Japanese
morphologizers.
* Update docstring
* Fix typo in docstring
* Update Scorer API docs
* Fix results type
* Organize score list by attribute prefix
* Update for python 3.10
* Update mac image
* Update build constraints for python 3.10
* Add extras for cupy cuda 11.3-11.5
* Remove cupy-cuda115 extra
* Require thinc>=8.0.12
* Switch CI to windows-2019
* Skip mypy for python 3.10
So that the install/upgrade quickstart also upgrades
`spacy-transformers` with `pip install spacy[transformers]`, require
`spacy-transformers>=1.1.2` in the `transformers` extra.
* Add support for fasttext-bloom hash-only vectors
Overview:
* Extend `Vectors` to have two modes: `default` and `ngram`
* `default` is the default mode and equivalent to the current
`Vectors`
* `ngram` supports the hash-only ngram tables from `fasttext-bloom`
* Extend `spacy.StaticVectors.v2` to handle both modes with no changes
for `default` vectors
* Extend `spacy init vectors` to support ngram tables
The `ngram` mode **only** supports vector tables produced by this
fork of fastText, which adds an option to represent all vectors using
only the ngram buckets table and which uses the exact same ngram
generation algorithm and hash function (`MurmurHash3_x64_128`).
`fasttext-bloom` produces an additional `.hashvec` table, which can be
loaded by `spacy init vectors --fasttext-bloom-vectors`.
https://github.com/adrianeboyd/fastText/tree/feature/bloom
Implementation details:
* `Vectors` now includes the `StringStore` as `Vectors.strings` so that
the API can stay consistent for both `default` (which can look up from
`str` or `int`) and `ngram` (which requires `str` to calculate the
ngrams).
* In ngram mode `Vectors` uses a default `Vectors` object as a cache
since the ngram vectors lookups are relatively expensive.
* The default cache size is the same size as the provided ngram vector
table.
* Once the cache is full, no more entries are added. The user is
responsible for managing the cache in cases where the initial
documents are not representative of the texts.
* The cache can be resized by setting `Vectors.ngram_cache_size` or
cleared with `vectors._ngram_cache.clear()`.
* The API ends up a bit split between methods for `default` and for
`ngram`, so functions that only make sense for `default` or `ngram`
include warnings with custom messages suggesting alternatives where
possible.
* `Vocab.vectors` becomes a property so that the string stores can be
synced when assigning vectors to a vocab.
* `Vectors` serializes its own config settings as `vectors.cfg`.
* The `Vectors` serialization methods have added support for `exclude`
so that the `Vocab` can exclude the `Vectors` strings while serializing.
Removed:
* The `minn` and `maxn` options and related code from
`Vocab.get_vector`, which does not work in a meaningful way for default
vector tables.
* The unused `GlobalRegistry` in `Vectors`.
* Refactor to use reduce_mean
Refactor to use reduce_mean and remove the ngram vectors cache.
* Rename to floret
* Rename to floret in error messages
* Use --vectors-mode in CLI, vector init
* Fix vectors mode in init
* Remove unused var
* Minor API and docstrings adjustments
* Rename `--vectors-mode` to `--mode` in `init vectors` CLI
* Rename `Vectors.get_floret_vectors` to `Vectors.get_batch` and support
both modes.
* Minor updates to Vectors docstrings.
* Update API docs for Vectors and init vectors CLI
* Update types for StaticVectors
* Ignore prefix in suffix matches
Ignore the currently matched prefix when looking for suffix matches in
the tokenizer. Otherwise a lookbehind in the suffix pattern may match
incorrectly due the presence of the prefix in the token string.
* Move °[cfkCFK]. to a tokenizer exception
* Adjust exceptions for same tokenization as v3.1
* Also update test accordingly
* Continue to split . after °CFK if ° is not a prefix
* Exclude new ° exceptions for pl
* Switch back to default tokenization of "° C ."
* Revert "Exclude new ° exceptions for pl"
This reverts commit 952013a5b4.
* Add exceptions for °C for hu
* Raise an error when multiprocessing is used on a GPU
As reported in #5507, a confusing exception is thrown when
multiprocessing is used with a GPU model and the `fork` multiprocessing
start method:
cupy.cuda.runtime.CUDARuntimeError: cudaErrorInitializationError: initialization error
This change checks whether one of the models uses the GPU when
multiprocessing is used. If so, raise a friendly error message.
Even though multiprocessing can work on a GPU with the `spawn` method,
it quickly runs the GPU out-of-memory on real-world data. Also,
multiprocessing on a single GPU typically does not provide large
performance gains.
* Move GPU multiprocessing check to Language.pipe
* Warn rather than error when using multiprocessing with GPU models
* Improve GPU multiprocessing warning message.
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Reduce API assumptions
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Update spacy/language.py
* Update spacy/language.py
* Test that warning is thrown with GPU + multiprocessing
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* add custom protocols in spacy.ty
* add a test for the new types in spacy.ty
* import Example when type checking
* some type fixes
* put Protocol in compat
* revert update check back to hasattr
* runtime_checkable in compat as well
* Replace use_ops("numpy") by use_ops("cpu") in the parser
This ensures that the best available CPU implementation is chosen
(e.g. Thinc Apple Ops on macOS).
* Run spaCy tests with apple-thinc-ops on macOS
* Remove some old version refs in the docs
* Remove warning
* Update spacy/matcher/matcher.pyx
* Remove all references to the punctuation warning
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Add the spacy.models_with_nvtx_range.v1 callback
This callback recursively adds NVTX ranges to the Models in each pipe in
a pipeline.
* Fix create_models_with_nvtx_range type signature
* NVTX range: wrap models of all trainable pipes jointly
This avoids that (sub-)models that are shared between pipes get wrapped
twice.
* NVTX range callback: make color configurable
Add forward_color and backprop_color options to set the color for the
NVTX range.
* Move create_models_with_nvtx_range to spacy.ml
* Update create_models_with_nvtx_range for thinc changes
with_nvtx_range now updates an existing node, rather than returning a
wrapper node. So, we can simply walk over the nodes and update them.
* NVTX: use after_pipeline_creation in example
* Add note about how the model name is used
* Add link to TransformersModel docs, separate paragraph
* Local link
* Revise docs
* Update website/docs/usage/embeddings-transformers.md
* Update website/docs/usage/embeddings-transformers.md
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Add note about how the model name is used
* Add link to TransformersModel docs, separate paragraph
* Local link
* Revise docs
* Update website/docs/usage/embeddings-transformers.md
* Update website/docs/usage/embeddings-transformers.md
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>