spaCy/website/docs/api
Adriane Boyd c053f158c5
Add support for floret vectors (#8909)
* Add support for fasttext-bloom hash-only vectors

Overview:

* Extend `Vectors` to have two modes: `default` and `ngram`
  * `default` is the default mode and equivalent to the current
    `Vectors`
  * `ngram` supports the hash-only ngram tables from `fasttext-bloom`
* Extend `spacy.StaticVectors.v2` to handle both modes with no changes
  for `default` vectors
* Extend `spacy init vectors` to support ngram tables

The `ngram` mode **only** supports vector tables produced by this
fork of fastText, which adds an option to represent all vectors using
only the ngram buckets table and which uses the exact same ngram
generation algorithm and hash function (`MurmurHash3_x64_128`).
`fasttext-bloom` produces an additional `.hashvec` table, which can be
loaded by `spacy init vectors --fasttext-bloom-vectors`.

https://github.com/adrianeboyd/fastText/tree/feature/bloom

Implementation details:

* `Vectors` now includes the `StringStore` as `Vectors.strings` so that
  the API can stay consistent for both `default` (which can look up from
  `str` or `int`) and `ngram` (which requires `str` to calculate the
  ngrams).

* In ngram mode `Vectors` uses a default `Vectors` object as a cache
  since the ngram vectors lookups are relatively expensive.

  * The default cache size is the same size as the provided ngram vector
    table.

  * Once the cache is full, no more entries are added. The user is
    responsible for managing the cache in cases where the initial
    documents are not representative of the texts.

  * The cache can be resized by setting `Vectors.ngram_cache_size` or
    cleared with `vectors._ngram_cache.clear()`.

* The API ends up a bit split between methods for `default` and for
  `ngram`, so functions that only make sense for `default` or `ngram`
  include warnings with custom messages suggesting alternatives where
  possible.

* `Vocab.vectors` becomes a property so that the string stores can be
  synced when assigning vectors to a vocab.

* `Vectors` serializes its own config settings as `vectors.cfg`.

* The `Vectors` serialization methods have added support for `exclude`
  so that the `Vocab` can exclude the `Vectors` strings while serializing.

Removed:

* The `minn` and `maxn` options and related code from
  `Vocab.get_vector`, which does not work in a meaningful way for default
  vector tables.

* The unused `GlobalRegistry` in `Vectors`.

* Refactor to use reduce_mean

Refactor to use reduce_mean and remove the ngram vectors cache.

* Rename to floret

* Rename to floret in error messages

* Use --vectors-mode in CLI, vector init

* Fix vectors mode in init

* Remove unused var

* Minor API and docstrings adjustments

* Rename `--vectors-mode` to `--mode` in `init vectors` CLI
* Rename `Vectors.get_floret_vectors` to `Vectors.get_batch` and support
  both modes.
* Minor updates to Vectors docstrings.

* Update API docs for Vectors and init vectors CLI

* Update types for StaticVectors
2021-10-27 14:08:31 +02:00
..
architectures.md Minor updates to spacy-transformers docs for v1.1.0 (#9496) 2021-10-18 14:55:02 +02:00
attributeruler.md Document scorers in registry and components from #8766 (#8929) 2021-08-12 12:50:03 +02:00
cli.md Add support for floret vectors (#8909) 2021-10-27 14:08:31 +02:00
corpus.md Integrate file readers 2020-10-02 01:36:06 +02:00
cython-classes.md Update docs, types and API consistency 2020-08-17 16:45:24 +02:00
cython-structs.md Update docs, types and API consistency 2020-08-17 16:45:24 +02:00
cython.md Update docs [ci skip] 2020-09-12 17:05:10 +02:00
data-formats.md Add notes on preparing training data to docs (#8964) 2021-08-16 17:37:21 +02:00
dependencymatcher.md doc fixes 2020-09-12 17:38:54 +02:00
dependencyparser.md Merge remote-tracking branch 'upstream/master' into develop 2021-09-27 09:10:45 +02:00
doc.md Document Assigned Attributes of Pipeline Components (#9041) 2021-09-01 12:09:39 +02:00
docbin.md Fix point typo on docbin docs (#9097) 2021-08-31 10:55:44 +02:00
entitylinker.md Update overwrite and scorer in API docs (#9384) 2021-10-11 10:35:07 +02:00
entityrecognizer.md Merge remote-tracking branch 'upstream/master' into develop 2021-09-27 09:10:45 +02:00
entityruler.md Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-v3.2-1 2021-10-26 11:53:50 +02:00
example.md Extend score_spans for overlapping & non-labeled spans (#7209) 2021-04-08 12:19:17 +02:00
index.md Update v3 docs 2020-07-03 16:48:21 +02:00
kb.md Tidy up docs 2021-06-28 12:08:15 +02:00
language.md Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-v3.2-1 2021-10-26 11:53:50 +02:00
legacy.md Document Assigned Attributes of Pipeline Components (#9041) 2021-09-01 12:09:39 +02:00
lemmatizer.md Merge remote-tracking branch 'upstream/master' into develop 2021-09-27 09:10:45 +02:00
lexeme.md fix 's typo's across code base (#8384) 2021-06-15 10:57:08 +02:00
lookups.md Update docs, types and API consistency 2020-08-17 16:45:24 +02:00
matcher.md Support list values and INTERSECTS in Matcher (#8784) 2021-08-02 19:39:26 +02:00
morphologizer.md Update overwrite and scorer in API docs (#9384) 2021-10-11 10:35:07 +02:00
morphology.md Document Assigned Attributes of Pipeline Components (#9041) 2021-09-01 12:09:39 +02:00
phrasematcher.md 🏷 Add Mypy check to CI and ignore all existing Mypy errors (#9167) 2021-10-14 15:21:40 +02:00
pipe.md Document scorers in registry and components from #8766 (#8929) 2021-08-12 12:50:03 +02:00
pipeline-functions.md Remove transformers model max length section (#6807) 2021-01-25 19:59:34 +08:00
scorer.md Document Assigned Attributes of Pipeline Components (#9041) 2021-09-01 12:09:39 +02:00
sentencerecognizer.md Update overwrite and scorer in API docs (#9384) 2021-10-11 10:35:07 +02:00
sentencizer.md Update overwrite and scorer in API docs (#9384) 2021-10-11 10:35:07 +02:00
span.md fix docs for Span constructor arguments (#9023) 2021-08-25 16:06:22 +02:00
spancategorizer.md Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-v3.2-1 2021-10-26 11:53:50 +02:00
spangroup.md Warn and document spangroup.doc weakref (#8980) 2021-08-20 11:06:19 +02:00
stringstore.md Update docs, types and API consistency 2020-08-17 16:45:24 +02:00
tagger.md Update overwrite and scorer in API docs (#9384) 2021-10-11 10:35:07 +02:00
textcategorizer.md Merge remote-tracking branch 'upstream/master' into develop 2021-09-27 09:10:45 +02:00
tok2vec.md Tidy up docs 2021-06-28 12:08:15 +02:00
token.md Fix UD POS docs links (fix #9013) (#9407) 2021-10-11 11:51:19 +02:00
tokenizer.md Tidy up docs 2021-06-28 12:08:15 +02:00
top-level.md Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-v3.2-1 2021-10-26 11:53:50 +02:00
transformer.md Update docs for spacy-transformers v1.1 data classes (#9361) 2021-10-18 14:16:58 +02:00
vectors.md Add support for floret vectors (#8909) 2021-10-27 14:08:31 +02:00
vocab.md 🏷 Add Mypy check to CI and ignore all existing Mypy errors (#9167) 2021-10-14 15:21:40 +02:00