spaCy/spacy/vocab.pxd

from libcpp.vector cimport vector
from preshed.maps cimport PreshMap
from cymem.cymem cimport Pool
from murmurhash.mrmr cimport hash64

from .structs cimport LexemeC, TokenC
from .typedefs cimport attr_t, hash_t
from .strings cimport StringStore
from .morphology cimport Morphology


cdef LexemeC EMPTY_LEXEME


cdef union LexemesOrTokens:
    const LexemeC* const* lexemes
    const TokenC* tokens


cdef struct _Cached:
    LexemesOrTokens data
    bint is_lex
    int length


cdef class Vocab:
    cdef Pool mem
    cdef readonly StringStore strings
    cdef public Morphology morphology
    cdef public object _vectors
    cdef public object _lookups
    cdef public object writing_system
    cdef public object get_noun_chunks
    cdef readonly int length
    cdef public object lex_attr_getters
    cdef public object cfg

    cdef const LexemeC* get(self, str string) except NULL
    cdef const LexemeC* get_by_orth(self, attr_t orth) except NULL
    cdef const TokenC* make_fused_token(self, substrings) except NULL

    cdef const LexemeC* _new_lexeme(self, str string) except NULL
    cdef int _add_lex_to_vocab(self, hash_t key, const LexemeC* lex) except -1

    cdef PreshMap _by_orth
* Rename Lexicon to Vocab, and move it to its own file 2014-12-19 22:54:03 +03:00			`from libcpp.vector cimport vector`
			`from preshed.maps cimport PreshMap`
			`from cymem.cymem cimport Pool`
			`from murmurhash.mrmr cimport hash64`

* Replace UniStr, using unicode objects instead 2015-07-22 05:49:39 +03:00			`from .structs cimport LexemeC, TokenC`
Use int8_t instead of char in Matcher (#6413) * Use signed char instead of char in Matcher Remove unused char* utf8_t typedef * Use int8_t instead of signed char 2020-11-23 12:26:47 +03:00			`from .typedefs cimport attr_t, hash_t`
* Rename Lexicon to Vocab, and move it to its own file 2014-12-19 22:54:03 +03:00			`from .strings cimport StringStore`
* Store Morphology class in Vocab 2015-08-26 20:21:03 +03:00			`from .morphology cimport Morphology`
* Rename Lexicon to Vocab, and move it to its own file 2014-12-19 22:54:03 +03:00

* Tmp commit. Refactoring to create a Python Lexeme class. 2015-01-12 02:26:22 +03:00			`cdef LexemeC EMPTY_LEXEME`
* Tmp 2014-12-24 09:42:00 +03:00

* Rename Lexicon to Vocab, and move it to its own file 2014-12-19 22:54:03 +03:00			`cdef union LexemesOrTokens:`
* Tmp commit. Refactoring to create a Python Lexeme class. 2015-01-12 02:26:22 +03:00			`const LexemeC* const* lexemes`
* More work on language-generic parsing 2015-08-28 03:02:33 +03:00			`const TokenC* tokens`
* Rename Lexicon to Vocab, and move it to its own file 2014-12-19 22:54:03 +03:00

			`cdef struct _Cached:`
			`LexemesOrTokens data`
			`bint is_lex`
			`int length`


			`cdef class Vocab:`
			`cdef Pool mem`
Replace cpdef variables with cdef (#7834) 2021-04-26 17:54:02 +03:00			`cdef readonly StringStore strings`
			`cdef public Morphology morphology`
Add support for floret vectors (#8909) * Add support for fasttext-bloom hash-only vectors Overview: * Extend `Vectors` to have two modes: `default` and `ngram` * `default` is the default mode and equivalent to the current `Vectors` * `ngram` supports the hash-only ngram tables from `fasttext-bloom` * Extend `spacy.StaticVectors.v2` to handle both modes with no changes for `default` vectors * Extend `spacy init vectors` to support ngram tables The `ngram` mode only supports vector tables produced by this fork of fastText, which adds an option to represent all vectors using only the ngram buckets table and which uses the exact same ngram generation algorithm and hash function (`MurmurHash3_x64_128`). `fasttext-bloom` produces an additional `.hashvec` table, which can be loaded by `spacy init vectors --fasttext-bloom-vectors`. https://github.com/adrianeboyd/fastText/tree/feature/bloom Implementation details: * `Vectors` now includes the `StringStore` as `Vectors.strings` so that the API can stay consistent for both `default` (which can look up from `str` or `int`) and `ngram` (which requires `str` to calculate the ngrams). * In ngram mode `Vectors` uses a default `Vectors` object as a cache since the ngram vectors lookups are relatively expensive. * The default cache size is the same size as the provided ngram vector table. * Once the cache is full, no more entries are added. The user is responsible for managing the cache in cases where the initial documents are not representative of the texts. * The cache can be resized by setting `Vectors.ngram_cache_size` or cleared with `vectors._ngram_cache.clear()`. * The API ends up a bit split between methods for `default` and for `ngram`, so functions that only make sense for `default` or `ngram` include warnings with custom messages suggesting alternatives where possible. * `Vocab.vectors` becomes a property so that the string stores can be synced when assigning vectors to a vocab. * `Vectors` serializes its own config settings as `vectors.cfg`. * The `Vectors` serialization methods have added support for `exclude` so that the `Vocab` can exclude the `Vectors` strings while serializing. Removed: * The `minn` and `maxn` options and related code from `Vocab.get_vector`, which does not work in a meaningful way for default vector tables. * The unused `GlobalRegistry` in `Vectors`. * Refactor to use reduce_mean Refactor to use reduce_mean and remove the ngram vectors cache. * Rename to floret * Rename to floret in error messages * Use --vectors-mode in CLI, vector init * Fix vectors mode in init * Remove unused var * Minor API and docstrings adjustments * Rename `--vectors-mode` to `--mode` in `init vectors` CLI * Rename `Vectors.get_floret_vectors` to `Vectors.get_batch` and support both modes. * Minor updates to Vectors docstrings. * Update API docs for Vectors and init vectors CLI * Update types for StaticVectors 2021-10-27 15:08:31 +03:00			`cdef public object _vectors`
Replace cpdef variables with cdef (#7834) 2021-04-26 17:54:02 +03:00			`cdef public object _lookups`
			`cdef public object writing_system`
			`cdef public object get_noun_chunks`
* Index lexemes by orth, instead of a lexemes vector. Breaks the mechanism for deciding not to own LexemeC structs during parsing. Need to reinstate this. 2015-07-18 23:42:15 +03:00			`cdef readonly int length`
Refactor so that the tokenizer data is read from Python data, rather than from disk 2016-09-25 15:49:53 +03:00			`cdef public object lex_attr_getters`
Add Vocab.cfg attr, to hold stuff like oov probs 2017-10-30 18:08:50 +03:00			`cdef public object cfg`
* Rename Lexicon to Vocab, and move it to its own file 2014-12-19 22:54:03 +03:00
Refactor lexeme mem passing (#12125) * Don't pass mem pool to new lexeme function * Remove unused mem from function args Two methods calling _new_lexeme, get and get_by_orth, took mem arguments just to call the internal method. That's no longer necessary, so this cleans it up. * prettier formatting * Remove more unused mem args 2023-01-25 06:50:21 +03:00			`cdef const LexemeC* get(self, str string) except NULL`
			`cdef const LexemeC* get_by_orth(self, attr_t orth) except NULL`
* More work on language-generic parsing 2015-08-28 03:02:33 +03:00			`cdef const TokenC* make_fused_token(self, substrings) except NULL`
Work on vectors 2017-05-31 00:34:50 +03:00
Refactor lexeme mem passing (#12125) * Don't pass mem pool to new lexeme function * Remove unused mem from function args Two methods calling _new_lexeme, get and get_by_orth, took mem arguments just to call the internal method. That's no longer necessary, so this cleans it up. * prettier formatting * Remove more unused mem args 2023-01-25 06:50:21 +03:00			`cdef const LexemeC* _new_lexeme(self, str string) except NULL`
* Tmp. Working on refactor. Compiles, must hook up lexical feats. 2015-01-13 16:03:48 +03:00			`cdef int _add_lex_to_vocab(self, hash_t key, const LexemeC* lex) except -1`
Remove trailing whitespace 2015-04-19 11:31:31 +03:00
* Index lexemes by orth, instead of a lexemes vector. Breaks the mechanism for deciding not to own LexemeC structs during parsing. Need to reinstate this. 2015-07-18 23:42:15 +03:00			`cdef PreshMap _by_orth`