spaCy

mirror of https://github.com/explosion/spaCy.git synced 2026-01-10 02:31:16 +03:00

History

Adriane Boyd c053f158c5 Add support for floret vectors (#8909 ) * Add support for fasttext-bloom hash-only vectors Overview: * Extend `Vectors` to have two modes: `default` and `ngram` * `default` is the default mode and equivalent to the current `Vectors` * `ngram` supports the hash-only ngram tables from `fasttext-bloom` * Extend `spacy.StaticVectors.v2` to handle both modes with no changes for `default` vectors * Extend `spacy init vectors` to support ngram tables The `ngram` mode only supports vector tables produced by this fork of fastText, which adds an option to represent all vectors using only the ngram buckets table and which uses the exact same ngram generation algorithm and hash function (`MurmurHash3_x64_128`). `fasttext-bloom` produces an additional `.hashvec` table, which can be loaded by `spacy init vectors --fasttext-bloom-vectors`. https://github.com/adrianeboyd/fastText/tree/feature/bloom Implementation details: * `Vectors` now includes the `StringStore` as `Vectors.strings` so that the API can stay consistent for both `default` (which can look up from `str` or `int`) and `ngram` (which requires `str` to calculate the ngrams). * In ngram mode `Vectors` uses a default `Vectors` object as a cache since the ngram vectors lookups are relatively expensive. * The default cache size is the same size as the provided ngram vector table. * Once the cache is full, no more entries are added. The user is responsible for managing the cache in cases where the initial documents are not representative of the texts. * The cache can be resized by setting `Vectors.ngram_cache_size` or cleared with `vectors._ngram_cache.clear()`. * The API ends up a bit split between methods for `default` and for `ngram`, so functions that only make sense for `default` or `ngram` include warnings with custom messages suggesting alternatives where possible. * `Vocab.vectors` becomes a property so that the string stores can be synced when assigning vectors to a vocab. * `Vectors` serializes its own config settings as `vectors.cfg`. * The `Vectors` serialization methods have added support for `exclude` so that the `Vocab` can exclude the `Vectors` strings while serializing. Removed: * The `minn` and `maxn` options and related code from `Vocab.get_vector`, which does not work in a meaningful way for default vector tables. * The unused `GlobalRegistry` in `Vectors`. * Refactor to use reduce_mean Refactor to use reduce_mean and remove the ngram vectors cache. * Rename to floret * Rename to floret in error messages * Use --vectors-mode in CLI, vector init * Fix vectors mode in init * Remove unused var * Minor API and docstrings adjustments * Rename `--vectors-mode` to `--mode` in `init vectors` CLI * Rename `Vectors.get_floret_vectors` to `Vectors.get_batch` and support both modes. * Minor updates to Vectors docstrings. * Update API docs for Vectors and init vectors CLI * Update types for StaticVectors		2021-10-27 14:08:31 +02:00
..
cli	Add support for floret vectors (#8909 )	2021-10-27 14:08:31 +02:00
displacy	🏷 Add Mypy check to CI and ignore all existing Mypy errors (#9167 )	2021-10-14 15:21:40 +02:00
lang	Rename ja morph features to Inflection and Reading (#9520 )	2021-10-27 13:13:03 +02:00
matcher	Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-v3.2-1	2021-10-26 11:53:50 +02:00
ml	Add support for floret vectors (#8909 )	2021-10-27 14:08:31 +02:00
pipeline	Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-v3.2-1	2021-10-26 11:53:50 +02:00
tests	Add support for floret vectors (#8909 )	2021-10-27 14:08:31 +02:00
tokens	Add support for floret vectors (#8909 )	2021-10-27 14:08:31 +02:00
training	Add support for floret vectors (#8909 )	2021-10-27 14:08:31 +02:00
__init__.pxd	* Seems to be working after refactor. Need to wire up more POS tag features, and wire up save/load of POS tags.	2014-10-24 02:23:42 +11:00
__init__.py	Tidy up and auto-format	2021-07-18 15:44:56 +10:00
__main__.py	Tidy up	2020-06-22 00:45:40 +02:00
about.py	bump version to 3.1.4 (#9524 )	2021-10-21 20:34:57 +02:00
attrs.pxd	Merge branch 'develop' into master-tmp	2020-05-21 18:39:06 +02:00
attrs.pyx	Update Cython string types (#9143 )	2021-09-13 17:02:17 +02:00
compat.py	Custom component types in spacy.ty (#9469 )	2021-10-21 15:31:06 +02:00
default_config_pretraining.cfg	Add new parameter for saving every n epoch in pretraining (#8912 )	2021-08-12 11:14:48 +02:00
default_config.cfg	Add training option to set annotations on update (#7767 )	2021-04-26 16:53:53 +02:00
errors.py	Add support for floret vectors (#8909 )	2021-10-27 14:08:31 +02:00
glossary.py	Add glossary entry for _SP (#8983 )	2021-08-20 12:04:02 +02:00
kb.pxd	Replace cpdef variables with cdef (#7834 )	2021-04-26 16:54:02 +02:00
kb.pyx	Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-v3.2-1	2021-10-26 11:53:50 +02:00
language.py	Add support for floret vectors (#8909 )	2021-10-27 14:08:31 +02:00
lexeme.pxd	Fix Lexeme.from_ptr	2020-08-10 16:43:37 +02:00
lexeme.pyi	Add stub files for main cython classes (#8427 )	2021-08-07 12:30:03 +02:00
lexeme.pyx	Update Cython string types (#9143 )	2021-09-13 17:02:17 +02:00
lookups.py	🏷 Add Mypy check to CI and ignore all existing Mypy errors (#9167 )	2021-10-14 15:21:40 +02:00
morphology.pxd	Clean up Morphology imports and definitions (#7441 )	2021-04-26 16:54:23 +02:00
morphology.pyx	Clean up Morphology imports and definitions (#7441 )	2021-04-26 16:54:23 +02:00
parts_of_speech.pxd	Add support for Universal Dependencies v2.0	2017-03-03 13:17:34 +01:00
parts_of_speech.pyx	Drop Python 2.7 and 3.5 (#4828 )	2019-12-22 01:53:56 +01:00
pipe_analysis.py	🏷 Add Mypy check to CI and ignore all existing Mypy errors (#9167 )	2021-10-14 15:21:40 +02:00
py.typed	Add py.typed	2021-03-16 09:48:31 +01:00
schemas.py	Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-v3.2-1	2021-10-26 11:53:50 +02:00
scorer.py	Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-v3.2-1	2021-10-26 11:53:50 +02:00
strings.pxd	Update Cython string types (#9143 )	2021-09-13 17:02:17 +02:00
strings.pyi	🏷 Add Mypy check to CI and ignore all existing Mypy errors (#9167 )	2021-10-14 15:21:40 +02:00
strings.pyx	Update Cython string types (#9143 )	2021-09-13 17:02:17 +02:00
structs.pxd	Add SpanGroup and Graph container types to represent arbitrary annotations (#6696 )	2021-01-14 17:30:41 +11:00
symbols.pxd	introduce token.has_head and refer to MISSING_DEP_ (WIP)	2021-01-12 17:17:06 +01:00
symbols.pyx	introduce token.has_head and refer to MISSING_DEP_ (WIP)	2021-01-12 17:17:06 +01:00
tokenizer.pxd	Remove two attributes marked for removal in 3.1 (#9150 )	2021-09-15 23:07:21 +02:00
tokenizer.pyx	Ignore prefix in suffix matches (#9155 )	2021-10-27 13:02:25 +02:00
ty.py	Custom component types in spacy.ty (#9469 )	2021-10-21 15:31:06 +02:00
typedefs.pxd	Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master	2020-11-25 11:49:34 +01:00
typedefs.pyx	Tidy up rest	2017-10-27 21:07:59 +02:00
util.py	Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-v3.2-1	2021-10-26 11:53:50 +02:00
vectors.pyx	Add support for floret vectors (#8909 )	2021-10-27 14:08:31 +02:00
vocab.pxd	Add support for floret vectors (#8909 )	2021-10-27 14:08:31 +02:00
vocab.pyi	Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-v3.2-1	2021-10-26 11:53:50 +02:00
vocab.pyx	Add support for floret vectors (#8909 )	2021-10-27 14:08:31 +02:00