spaCy/spacy/vocab.pxd

from libcpp.vector cimport vector
from preshed.maps cimport PreshMap
from cymem.cymem cimport Pool
from murmurhash.mrmr cimport hash64

from .structs cimport LexemeC, TokenC
from .typedefs cimport attr_t, hash_t
from .strings cimport StringStore
from .morphology cimport Morphology


cdef LexemeC EMPTY_LEXEME


cdef union LexemesOrTokens:
    const LexemeC* const* lexemes
    const TokenC* tokens


cdef struct _Cached:
    LexemesOrTokens data
    bint is_lex
    int length


cdef class Vocab:
    cdef Pool mem
    cdef readonly StringStore strings
    cdef public Morphology morphology
    cdef public object _vectors
    cdef public object _lookups
    cdef public object writing_system
    cdef public object get_noun_chunks
    cdef readonly int length
    cdef public object lex_attr_getters
    cdef public object cfg

    cdef const LexemeC* get(self, Pool mem, str string) except NULL
    cdef const LexemeC* get_by_orth(self, Pool mem, attr_t orth) except NULL
    cdef const TokenC* make_fused_token(self, substrings) except NULL

    cdef const LexemeC* _new_lexeme(self, Pool mem, str string) except NULL
    cdef int _add_lex_to_vocab(self, hash_t key, const LexemeC* lex) except -1
    cdef const LexemeC* _new_lexeme(self, Pool mem, str string) except NULL

    cdef PreshMap _by_orth
* Rename Lexicon to Vocab, and move it to its own file 2014-12-19 22:54:03 +03:00			`from libcpp.vector cimport vector`
			`from preshed.maps cimport PreshMap`
			`from cymem.cymem cimport Pool`
			`from murmurhash.mrmr cimport hash64`

* Replace UniStr, using unicode objects instead 2015-07-22 05:49:39 +03:00			`from .structs cimport LexemeC, TokenC`
Use int8_t instead of char in Matcher (#6413) * Use signed char instead of char in Matcher Remove unused char* utf8_t typedef * Use int8_t instead of signed char 2020-11-23 12:26:47 +03:00			`from .typedefs cimport attr_t, hash_t`
* Rename Lexicon to Vocab, and move it to its own file 2014-12-19 22:54:03 +03:00			`from .strings cimport StringStore`
* Store Morphology class in Vocab 2015-08-26 20:21:03 +03:00			`from .morphology cimport Morphology`
* Rename Lexicon to Vocab, and move it to its own file 2014-12-19 22:54:03 +03:00

* Tmp commit. Refactoring to create a Python Lexeme class. 2015-01-12 02:26:22 +03:00			`cdef LexemeC EMPTY_LEXEME`
* Tmp 2014-12-24 09:42:00 +03:00

* Rename Lexicon to Vocab, and move it to its own file 2014-12-19 22:54:03 +03:00			`cdef union LexemesOrTokens:`
* Tmp commit. Refactoring to create a Python Lexeme class. 2015-01-12 02:26:22 +03:00			`const LexemeC* const* lexemes`
* More work on language-generic parsing 2015-08-28 03:02:33 +03:00			`const TokenC* tokens`
* Rename Lexicon to Vocab, and move it to its own file 2014-12-19 22:54:03 +03:00

			`cdef struct _Cached:`
			`LexemesOrTokens data`
			`bint is_lex`
			`int length`


			`cdef class Vocab:`
			`cdef Pool mem`
Replace cpdef variables with cdef (#7834) 2021-04-26 17:54:02 +03:00			`cdef readonly StringStore strings`
			`cdef public Morphology morphology`
Add support for floret vectors (#8909) * Add support for fasttext-bloom hash-only vectors Overview: * Extend `Vectors` to have two modes: `default` and `ngram` * `default` is the default mode and equivalent to the current `Vectors` * `ngram` supports the hash-only ngram tables from `fasttext-bloom` * Extend `spacy.StaticVectors.v2` to handle both modes with no changes for `default` vectors * Extend `spacy init vectors` to support ngram tables The `ngram` mode only supports vector tables produced by this fork of fastText, which adds an option to represent all vectors using only the ngram buckets table and which uses the exact same ngram generation algorithm and hash function (`MurmurHash3_x64_128`). `fasttext-bloom` produces an additional `.hashvec` table, which can be loaded by `spacy init vectors --fasttext-bloom-vectors`. https://github.com/adrianeboyd/fastText/tree/feature/bloom Implementation details: * `Vectors` now includes the `StringStore` as `Vectors.strings` so that the API can stay consistent for both `default` (which can look up from `str` or `int`) and `ngram` (which requires `str` to calculate the ngrams). * In ngram mode `Vectors` uses a default `Vectors` object as a cache since the ngram vectors lookups are relatively expensive. * The default cache size is the same size as the provided ngram vector table. * Once the cache is full, no more entries are added. The user is responsible for managing the cache in cases where the initial documents are not representative of the texts. * The cache can be resized by setting `Vectors.ngram_cache_size` or cleared with `vectors._ngram_cache.clear()`. * The API ends up a bit split between methods for `default` and for `ngram`, so functions that only make sense for `default` or `ngram` include warnings with custom messages suggesting alternatives where possible. * `Vocab.vectors` becomes a property so that the string stores can be synced when assigning vectors to a vocab. * `Vectors` serializes its own config settings as `vectors.cfg`. * The `Vectors` serialization methods have added support for `exclude` so that the `Vocab` can exclude the `Vectors` strings while serializing. Removed: * The `minn` and `maxn` options and related code from `Vocab.get_vector`, which does not work in a meaningful way for default vector tables. * The unused `GlobalRegistry` in `Vectors`. * Refactor to use reduce_mean Refactor to use reduce_mean and remove the ngram vectors cache. * Rename to floret * Rename to floret in error messages * Use --vectors-mode in CLI, vector init * Fix vectors mode in init * Remove unused var * Minor API and docstrings adjustments * Rename `--vectors-mode` to `--mode` in `init vectors` CLI * Rename `Vectors.get_floret_vectors` to `Vectors.get_batch` and support both modes. * Minor updates to Vectors docstrings. * Update API docs for Vectors and init vectors CLI * Update types for StaticVectors 2021-10-27 15:08:31 +03:00			`cdef public object _vectors`
Replace cpdef variables with cdef (#7834) 2021-04-26 17:54:02 +03:00			`cdef public object _lookups`
			`cdef public object writing_system`
			`cdef public object get_noun_chunks`
* Index lexemes by orth, instead of a lexemes vector. Breaks the mechanism for deciding not to own LexemeC structs during parsing. Need to reinstate this. 2015-07-18 23:42:15 +03:00			`cdef readonly int length`
Refactor so that the tokenizer data is read from Python data, rather than from disk 2016-09-25 15:49:53 +03:00			`cdef public object lex_attr_getters`
Add Vocab.cfg attr, to hold stuff like oov probs 2017-10-30 18:08:50 +03:00			`cdef public object cfg`
* Rename Lexicon to Vocab, and move it to its own file 2014-12-19 22:54:03 +03:00
Update Cython string types (#9143) * Replace all basestring references with unicode `basestring` was a compatability type introduced by Cython to make dealing with utf-8 strings in Python2 easier. In Python3 it is equivalent to the unicode (or str) type. I replaced all references to basestring with unicode, since that was used elsewhere, but we could also just replace them with str, which shoudl also be equivalent. All tests pass locally. * Replace all references to unicode type with str Since we only support python3 this is simpler. * Remove all references to unicode type This removes all references to the unicode type across the codebase and replaces them with `str`, which makes it more drastic than the prior commits. In order to make this work importing `unicode_literals` had to be removed, and one explicit unicode literal also had to be removed (it is unclear why this is necessary in Cython with language level 3, but without doing it there were errors about implicit conversion). When `unicode` is used as a type in comments it was also edited to be `str`. Additionally `coding: utf8` headers were removed from a few files. 2021-09-13 18:02:17 +03:00			`cdef const LexemeC* get(self, Pool mem, str string) except NULL`
* Add serializer property to Vocab, and lazy-load it. Add get_by_orth method. 2015-07-23 02:18:19 +03:00			`cdef const LexemeC* get_by_orth(self, Pool mem, attr_t orth) except NULL`
* More work on language-generic parsing 2015-08-28 03:02:33 +03:00			`cdef const TokenC* make_fused_token(self, substrings) except NULL`
Work on vectors 2017-05-31 00:34:50 +03:00
Update Cython string types (#9143) * Replace all basestring references with unicode `basestring` was a compatability type introduced by Cython to make dealing with utf-8 strings in Python2 easier. In Python3 it is equivalent to the unicode (or str) type. I replaced all references to basestring with unicode, since that was used elsewhere, but we could also just replace them with str, which shoudl also be equivalent. All tests pass locally. * Replace all references to unicode type with str Since we only support python3 this is simpler. * Remove all references to unicode type This removes all references to the unicode type across the codebase and replaces them with `str`, which makes it more drastic than the prior commits. In order to make this work importing `unicode_literals` had to be removed, and one explicit unicode literal also had to be removed (it is unclear why this is necessary in Cython with language level 3, but without doing it there were errors about implicit conversion). When `unicode` is used as a type in comments it was also edited to be `str`. Additionally `coding: utf8` headers were removed from a few files. 2021-09-13 18:02:17 +03:00			`cdef const LexemeC* _new_lexeme(self, Pool mem, str string) except NULL`
* Tmp. Working on refactor. Compiles, must hook up lexical feats. 2015-01-13 16:03:48 +03:00			`cdef int _add_lex_to_vocab(self, hash_t key, const LexemeC* lex) except -1`
Update Cython string types (#9143) * Replace all basestring references with unicode `basestring` was a compatability type introduced by Cython to make dealing with utf-8 strings in Python2 easier. In Python3 it is equivalent to the unicode (or str) type. I replaced all references to basestring with unicode, since that was used elsewhere, but we could also just replace them with str, which shoudl also be equivalent. All tests pass locally. * Replace all references to unicode type with str Since we only support python3 this is simpler. * Remove all references to unicode type This removes all references to the unicode type across the codebase and replaces them with `str`, which makes it more drastic than the prior commits. In order to make this work importing `unicode_literals` had to be removed, and one explicit unicode literal also had to be removed (it is unclear why this is necessary in Cython with language level 3, but without doing it there were errors about implicit conversion). When `unicode` is used as a type in comments it was also edited to be `str`. Additionally `coding: utf8` headers were removed from a few files. 2021-09-13 18:02:17 +03:00			`cdef const LexemeC* _new_lexeme(self, Pool mem, str string) except NULL`
Remove trailing whitespace 2015-04-19 11:31:31 +03:00
* Index lexemes by orth, instead of a lexemes vector. Breaks the mechanism for deciding not to own LexemeC structs during parsing. Need to reinstate this. 2015-07-18 23:42:15 +03:00			`cdef PreshMap _by_orth`