spaCy/spacy/vocab.pxd

from cymem.cymem cimport Pool
from libcpp.vector cimport vector
from murmurhash.mrmr cimport hash64
from preshed.maps cimport PreshMap

from .morphology cimport Morphology
from .strings cimport StringStore
from .structs cimport LexemeC, TokenC
from .typedefs cimport attr_t, hash_t


cdef LexemeC EMPTY_LEXEME


cdef union LexemesOrTokens:
    const LexemeC* const* lexemes
    const TokenC* tokens


cdef struct _Cached:
    LexemesOrTokens data
    bint is_lex
    int length


cdef class Vocab:
    cdef Pool mem
    cdef readonly StringStore strings
    cdef public Morphology morphology
    cdef public object _vectors
    cdef public object _lookups
    cdef public object writing_system
    cdef public object get_noun_chunks
    cdef readonly int length
    cdef public object lex_attr_getters
    cdef public object cfg

    cdef const LexemeC* get(self, str string) except NULL
    cdef const LexemeC* get_by_orth(self, attr_t orth) except NULL
    cdef const TokenC* make_fused_token(self, substrings) except NULL

    cdef const LexemeC* _new_lexeme(self, str string) except NULL
    cdef int _add_lex_to_vocab(self, hash_t key, const LexemeC* lex, bint is_transient) except -1

    cdef PreshMap _by_orth
    cdef Pool _non_temp_mem
    cdef vector[attr_t] _transient_orths
* Rename Lexicon to Vocab, and move it to its own file 2014-12-19 22:54:03 +03:00			`from cymem.cymem cimport Pool`
isort all the things 2023-06-26 12:41:03 +03:00			`from libcpp.vector cimport vector`
* Rename Lexicon to Vocab, and move it to its own file 2014-12-19 22:54:03 +03:00			`from murmurhash.mrmr cimport hash64`
isort all the things 2023-06-26 12:41:03 +03:00			`from preshed.maps cimport PreshMap`
* Rename Lexicon to Vocab, and move it to its own file 2014-12-19 22:54:03 +03:00
isort all the things 2023-06-26 12:41:03 +03:00			`from .morphology cimport Morphology`
			`from .strings cimport StringStore`
* Replace UniStr, using unicode objects instead 2015-07-22 05:49:39 +03:00			`from .structs cimport LexemeC, TokenC`
Use int8_t instead of char in Matcher (#6413) * Use signed char instead of char in Matcher Remove unused char* utf8_t typedef * Use int8_t instead of signed char 2020-11-23 12:26:47 +03:00			`from .typedefs cimport attr_t, hash_t`
* Rename Lexicon to Vocab, and move it to its own file 2014-12-19 22:54:03 +03:00

* Tmp commit. Refactoring to create a Python Lexeme class. 2015-01-12 02:26:22 +03:00			`cdef LexemeC EMPTY_LEXEME`
* Tmp 2014-12-24 09:42:00 +03:00

* Rename Lexicon to Vocab, and move it to its own file 2014-12-19 22:54:03 +03:00			`cdef union LexemesOrTokens:`
* Tmp commit. Refactoring to create a Python Lexeme class. 2015-01-12 02:26:22 +03:00			`const LexemeC* const* lexemes`
* More work on language-generic parsing 2015-08-28 03:02:33 +03:00			`const TokenC* tokens`
* Rename Lexicon to Vocab, and move it to its own file 2014-12-19 22:54:03 +03:00

			`cdef struct _Cached:`
			`LexemesOrTokens data`
			`bint is_lex`
			`int length`


			`cdef class Vocab:`
			`cdef Pool mem`
Replace cpdef variables with cdef (#7834) 2021-04-26 17:54:02 +03:00			`cdef readonly StringStore strings`
			`cdef public Morphology morphology`
Add support for floret vectors (#8909) * Add support for fasttext-bloom hash-only vectors Overview: * Extend `Vectors` to have two modes: `default` and `ngram` * `default` is the default mode and equivalent to the current `Vectors` * `ngram` supports the hash-only ngram tables from `fasttext-bloom` * Extend `spacy.StaticVectors.v2` to handle both modes with no changes for `default` vectors * Extend `spacy init vectors` to support ngram tables The `ngram` mode only supports vector tables produced by this fork of fastText, which adds an option to represent all vectors using only the ngram buckets table and which uses the exact same ngram generation algorithm and hash function (`MurmurHash3_x64_128`). `fasttext-bloom` produces an additional `.hashvec` table, which can be loaded by `spacy init vectors --fasttext-bloom-vectors`. https://github.com/adrianeboyd/fastText/tree/feature/bloom Implementation details: * `Vectors` now includes the `StringStore` as `Vectors.strings` so that the API can stay consistent for both `default` (which can look up from `str` or `int`) and `ngram` (which requires `str` to calculate the ngrams). * In ngram mode `Vectors` uses a default `Vectors` object as a cache since the ngram vectors lookups are relatively expensive. * The default cache size is the same size as the provided ngram vector table. * Once the cache is full, no more entries are added. The user is responsible for managing the cache in cases where the initial documents are not representative of the texts. * The cache can be resized by setting `Vectors.ngram_cache_size` or cleared with `vectors._ngram_cache.clear()`. * The API ends up a bit split between methods for `default` and for `ngram`, so functions that only make sense for `default` or `ngram` include warnings with custom messages suggesting alternatives where possible. * `Vocab.vectors` becomes a property so that the string stores can be synced when assigning vectors to a vocab. * `Vectors` serializes its own config settings as `vectors.cfg`. * The `Vectors` serialization methods have added support for `exclude` so that the `Vocab` can exclude the `Vectors` strings while serializing. Removed: * The `minn` and `maxn` options and related code from `Vocab.get_vector`, which does not work in a meaningful way for default vector tables. * The unused `GlobalRegistry` in `Vectors`. * Refactor to use reduce_mean Refactor to use reduce_mean and remove the ngram vectors cache. * Rename to floret * Rename to floret in error messages * Use --vectors-mode in CLI, vector init * Fix vectors mode in init * Remove unused var * Minor API and docstrings adjustments * Rename `--vectors-mode` to `--mode` in `init vectors` CLI * Rename `Vectors.get_floret_vectors` to `Vectors.get_batch` and support both modes. * Minor updates to Vectors docstrings. * Update API docs for Vectors and init vectors CLI * Update types for StaticVectors 2021-10-27 15:08:31 +03:00			`cdef public object _vectors`
Replace cpdef variables with cdef (#7834) 2021-04-26 17:54:02 +03:00			`cdef public object _lookups`
			`cdef public object writing_system`
			`cdef public object get_noun_chunks`
* Index lexemes by orth, instead of a lexemes vector. Breaks the mechanism for deciding not to own LexemeC structs during parsing. Need to reinstate this. 2015-07-18 23:42:15 +03:00			`cdef readonly int length`
Refactor so that the tokenizer data is read from Python data, rather than from disk 2016-09-25 15:49:53 +03:00			`cdef public object lex_attr_getters`
Add Vocab.cfg attr, to hold stuff like oov probs 2017-10-30 18:08:50 +03:00			`cdef public object cfg`
* Rename Lexicon to Vocab, and move it to its own file 2014-12-19 22:54:03 +03:00
Refactor lexeme mem passing (#12125) * Don't pass mem pool to new lexeme function * Remove unused mem from function args Two methods calling _new_lexeme, get and get_by_orth, took mem arguments just to call the internal method. That's no longer necessary, so this cleans it up. * prettier formatting * Remove more unused mem args 2023-01-25 06:50:21 +03:00			`cdef const LexemeC* get(self, str string) except NULL`
			`cdef const LexemeC* get_by_orth(self, attr_t orth) except NULL`
* More work on language-generic parsing 2015-08-28 03:02:33 +03:00			`cdef const TokenC* make_fused_token(self, substrings) except NULL`
Work on vectors 2017-05-31 00:34:50 +03:00
Refactor lexeme mem passing (#12125) * Don't pass mem pool to new lexeme function * Remove unused mem from function args Two methods calling _new_lexeme, get and get_by_orth, took mem arguments just to call the internal method. That's no longer necessary, so this cleans it up. * prettier formatting * Remove more unused mem args 2023-01-25 06:50:21 +03:00			`cdef const LexemeC* _new_lexeme(self, str string) except NULL`
Support 'memory zones' for user memory management Add a context manage nlp.memory_zone(), which will begin memory_zone() blocks on the vocab, string store, and potentially other components. Once the memory_zone() block expires, spaCy will free any shared resources that were allocated for the text-processing that occurred within the memory_zone. If you create Doc objects within a memory zone, it's invalid to access them once the memory zone is expired. The purpose of this is that spaCy creates and stores Lexeme objects in the Vocab that can be shared between multiple Doc objects. It also interns strings. Normally, spaCy can't know when all Doc objects using a Lexeme are out-of-scope, so new Lexemes accumulate in the vocab, causing memory pressure. Memory zones solve this problem by telling spaCy "okay none of the documents allocated within this block will be accessed again". This lets spaCy free all new Lexeme objects and other data that were created during the block. The mechanism is general, so memory_zone() context managers can be added to other components that could benefit from them, e.g. pipeline components. I experimented with adding memory zone support to the tokenizer as well, for its cache. However, this seems unnecessarily complicated. It makes more sense to just stick a limit on the cache size. This lets spaCy benefit from the efficiency advantage of the cache better, because we can maintain a (bounded) cache even if only small batches of documents are being processed. 2024-09-08 14:06:54 +03:00			`cdef int _add_lex_to_vocab(self, hash_t key, const LexemeC* lex, bint is_transient) except -1`
Remove trailing whitespace 2015-04-19 11:31:31 +03:00
* Index lexemes by orth, instead of a lexemes vector. Breaks the mechanism for deciding not to own LexemeC structs during parsing. Need to reinstate this. 2015-07-18 23:42:15 +03:00			`cdef PreshMap _by_orth`
Support 'memory zones' for user memory management Add a context manage nlp.memory_zone(), which will begin memory_zone() blocks on the vocab, string store, and potentially other components. Once the memory_zone() block expires, spaCy will free any shared resources that were allocated for the text-processing that occurred within the memory_zone. If you create Doc objects within a memory zone, it's invalid to access them once the memory zone is expired. The purpose of this is that spaCy creates and stores Lexeme objects in the Vocab that can be shared between multiple Doc objects. It also interns strings. Normally, spaCy can't know when all Doc objects using a Lexeme are out-of-scope, so new Lexemes accumulate in the vocab, causing memory pressure. Memory zones solve this problem by telling spaCy "okay none of the documents allocated within this block will be accessed again". This lets spaCy free all new Lexeme objects and other data that were created during the block. The mechanism is general, so memory_zone() context managers can be added to other components that could benefit from them, e.g. pipeline components. I experimented with adding memory zone support to the tokenizer as well, for its cache. However, this seems unnecessarily complicated. It makes more sense to just stick a limit on the cache size. This lets spaCy benefit from the efficiency advantage of the cache better, because we can maintain a (bounded) cache even if only small batches of documents are being processed. 2024-09-08 14:06:54 +03:00			`cdef Pool _non_temp_mem`
			`cdef vector[attr_t] _transient_orths`