* Add section for spacy.cli.train.train
* Add link from training page to train function
* Ensure path in train helper
* Update docs
Co-authored-by: Ines Montani <ines@ines.io>
* Add support for fasttext-bloom hash-only vectors
Overview:
* Extend `Vectors` to have two modes: `default` and `ngram`
* `default` is the default mode and equivalent to the current
`Vectors`
* `ngram` supports the hash-only ngram tables from `fasttext-bloom`
* Extend `spacy.StaticVectors.v2` to handle both modes with no changes
for `default` vectors
* Extend `spacy init vectors` to support ngram tables
The `ngram` mode **only** supports vector tables produced by this
fork of fastText, which adds an option to represent all vectors using
only the ngram buckets table and which uses the exact same ngram
generation algorithm and hash function (`MurmurHash3_x64_128`).
`fasttext-bloom` produces an additional `.hashvec` table, which can be
loaded by `spacy init vectors --fasttext-bloom-vectors`.
https://github.com/adrianeboyd/fastText/tree/feature/bloom
Implementation details:
* `Vectors` now includes the `StringStore` as `Vectors.strings` so that
the API can stay consistent for both `default` (which can look up from
`str` or `int`) and `ngram` (which requires `str` to calculate the
ngrams).
* In ngram mode `Vectors` uses a default `Vectors` object as a cache
since the ngram vectors lookups are relatively expensive.
* The default cache size is the same size as the provided ngram vector
table.
* Once the cache is full, no more entries are added. The user is
responsible for managing the cache in cases where the initial
documents are not representative of the texts.
* The cache can be resized by setting `Vectors.ngram_cache_size` or
cleared with `vectors._ngram_cache.clear()`.
* The API ends up a bit split between methods for `default` and for
`ngram`, so functions that only make sense for `default` or `ngram`
include warnings with custom messages suggesting alternatives where
possible.
* `Vocab.vectors` becomes a property so that the string stores can be
synced when assigning vectors to a vocab.
* `Vectors` serializes its own config settings as `vectors.cfg`.
* The `Vectors` serialization methods have added support for `exclude`
so that the `Vocab` can exclude the `Vectors` strings while serializing.
Removed:
* The `minn` and `maxn` options and related code from
`Vocab.get_vector`, which does not work in a meaningful way for default
vector tables.
* The unused `GlobalRegistry` in `Vectors`.
* Refactor to use reduce_mean
Refactor to use reduce_mean and remove the ngram vectors cache.
* Rename to floret
* Rename to floret in error messages
* Use --vectors-mode in CLI, vector init
* Fix vectors mode in init
* Remove unused var
* Minor API and docstrings adjustments
* Rename `--vectors-mode` to `--mode` in `init vectors` CLI
* Rename `Vectors.get_floret_vectors` to `Vectors.get_batch` and support
both modes.
* Minor updates to Vectors docstrings.
* Update API docs for Vectors and init vectors CLI
* Update types for StaticVectors
* use language-matching to allow language code aliases
Signed-off-by: Elia Robyn Speer <elia@explosion.ai>
* link to "IETF language tags" in docs
Signed-off-by: Elia Robyn Speer <elia@explosion.ai>
* Make requirements consistent
Signed-off-by: Elia Robyn Speer <elia@explosion.ai>
* change "two-letter language ID" to "IETF language tag" in language docs
Signed-off-by: Elia Robyn Speer <elia@explosion.ai>
* use langcodes 3.2 and handle language-tag errors better
Signed-off-by: Elia Robyn Speer <elia@explosion.ai>
* all unknown language codes are ImportErrors
Signed-off-by: Elia Robyn Speer <elia@explosion.ai>
Co-authored-by: Elia Robyn Speer <elia@explosion.ai>