spaCy/spacy/lexeme.pxd

from numpy cimport ndarray

from .attrs cimport (
    ID,
    LANG,
    LENGTH,
    LOWER,
    NORM,
    ORTH,
    PREFIX,
    SHAPE,
    SUFFIX,
    attr_id_t,
)
from .structs cimport LexemeC
from .typedefs cimport attr_t, flags_t, hash_t, len_t, tag_t
from .vocab cimport Vocab


cdef LexemeC EMPTY_LEXEME
cdef attr_t OOV_RANK

cdef class Lexeme:
    cdef LexemeC* c
    cdef readonly Vocab vocab
    cdef readonly attr_t orth

    @staticmethod
    cdef inline Lexeme from_ptr(LexemeC* lex, Vocab vocab):
        cdef Lexeme self = Lexeme.__new__(Lexeme, vocab, lex.orth)
        self.c = lex
        self.vocab = vocab
        self.orth = lex.orth
        return self

    @staticmethod
    cdef inline void set_struct_attr(LexemeC* lex, attr_id_t name, attr_t value) nogil:
        if name < (sizeof(flags_t) * 8):
            Lexeme.c_set_flag(lex, name, value)
        elif name == ID:
            lex.id = value
        elif name == LOWER:
            lex.lower = value
        elif name == NORM:
            lex.norm = value
        elif name == SHAPE:
            lex.shape = value
        elif name == PREFIX:
            lex.prefix = value
        elif name == SUFFIX:
            lex.suffix = value
        elif name == LANG:
            lex.lang = value

    @staticmethod
    cdef inline attr_t get_struct_attr(const LexemeC* lex, attr_id_t feat_name) nogil:
        if feat_name < (sizeof(flags_t) * 8):
            if Lexeme.c_check_flag(lex, feat_name):
                return 1
            else:
                return 0
        elif feat_name == ID:
            return lex.id
        elif feat_name == ORTH:
            return lex.orth
        elif feat_name == LOWER:
            return lex.lower
        elif feat_name == NORM:
            return lex.norm
        elif feat_name == SHAPE:
            return lex.shape
        elif feat_name == PREFIX:
            return lex.prefix
        elif feat_name == SUFFIX:
            return lex.suffix
        elif feat_name == LENGTH:
            return lex.length
        elif feat_name == LANG:
            return lex.lang
        else:
            return 0

    @staticmethod
    cdef inline bint c_check_flag(const LexemeC* lexeme, attr_id_t flag_id) nogil:
        cdef flags_t one = 1
        if lexeme.flags & (one << flag_id):
            return True
        else:
            return False

    @staticmethod
    cdef inline bint c_set_flag(LexemeC* lex, attr_id_t flag_id, bint value) nogil:
        cdef flags_t one = 1
        if value:
            lex.flags |= one << flag_id
        else:
            lex.flags &= ~(one << flag_id)
Tidy up compiler flags and imports (#5071) 2020-03-02 13:48:10 +03:00			`from numpy cimport ndarray`

isort all the things 2023-06-26 12:41:03 +03:00			`from .attrs cimport (`
			`ID,`
			`LANG,`
			`LENGTH,`
			`LOWER,`
			`NORM,`
			`ORTH,`
			`PREFIX,`
			`SHAPE,`
			`SUFFIX,`
			`attr_id_t,`
			`)`
Reduce stored lexemes data, move feats to lookups (#5238) * Reduce stored lexemes data, move feats to lookups * Move non-derivable lexemes features (`norm / cluster / prob`) to `spacy-lookups-data` as lookups * Get/set `norm` in both lookups and `LexemeC`, serialize in lookups * Remove `cluster` and `prob` from `LexemesC`, get/set/serialize in lookups only * Remove serialization of lexemes data as `vocab/lexemes.bin` * Remove `SerializedLexemeC` * Remove `Lexeme.to_bytes/from_bytes` * Modify normalization exception loading: * Always create `Vocab.lookups` table `lexeme_norm` for normalization exceptions * Load base exceptions from `lang.norm_exceptions`, but load language-specific exceptions from lookups * Set `lex_attr_getter[NORM]` including new lookups table in `BaseDefaults.create_vocab()` and when deserializing `Vocab` * Remove all cached lexemes when deserializing vocab to override existing normalizations with the new normalizations (as a replacement for the previous step that replaced all lexemes data with the deserialized data) * Skip English normalization test Skip English normalization test because the data is now in `spacy-lookups-data`. * Remove norm exceptions Moved to spacy-lookups-data. * Move norm exceptions test to spacy-lookups-data * Load extra lookups from spacy-lookups-data lazily Load extra lookups (currently for cluster and prob) lazily from the entry point `lg_extra` as `Vocab.lookups_extra`. * Skip creating lexeme cache on load To improve model loading times, do not create the full lexeme cache when loading. The lexemes will be created on demand when processing. * Identify numeric values in Lexeme.set_attrs() With the removal of a special case for `PROB`, also identify `float` to avoid trying to convert it with the `StringStore`. * Skip lexeme cache init in from_bytes * Unskip and update lookups tests for python3.6+ * Update vocab pickle to include lookups_extra * Update vocab serialization tests Check strings rather than lexemes since lexemes aren't initialized automatically, account for addition of "_SP". * Re-skip lookups test because of python3.5 * Skip PROB/float values in Lexeme.set_attrs * Convert is_oov from lexeme flag to lex in vectors Instead of storing `is_oov` as a lexeme flag, `is_oov` reports whether the lexeme has a vector. Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com> 2020-05-19 16:59:14 +03:00			`from .structs cimport LexemeC`
isort all the things 2023-06-26 12:41:03 +03:00			`from .typedefs cimport attr_t, flags_t, hash_t, len_t, tag_t`
* Begin merge of Gazetteer and DE branches 2015-09-06 20:45:15 +03:00			`from .vocab cimport Vocab`
* Revising data model of lexeme. Compiles. 2014-10-09 12:53:30 +04:00
* Work on word vectors, and other stuff 2015-01-17 08:21:17 +03:00
* Tmp commit. Refactoring to create a Python Lexeme class. 2015-01-12 02:26:22 +03:00			`cdef LexemeC EMPTY_LEXEME`
Use max(uint64) for OOV lexeme rank (#5303) * Use max(uint64) for OOV lexeme rank * Add test for default OOV rank * Revert back to thinc==7.4.0 Requiring the updated version of thinc was unnecessary. * Define OOV_RANK in one place Define OOV_RANK in one place in `util`. * Fix formatting [ci skip] * Switch to external definitions of max(uint64) Switch to external defintions of max(uint64) and confirm that they are equal. 2020-04-15 14:49:47 +03:00			`cdef attr_t OOV_RANK`
* Restoring Lexeme-as-struct 2014-09-10 22:41:37 +04:00
* Tmp. Refactoring, introducing a Lexeme PyObject. 2015-01-12 03:23:44 +03:00			`cdef class Lexeme:`
* Tmp 2015-08-22 23:04:34 +03:00			`cdef LexemeC* c`
			`cdef readonly Vocab vocab`
* Rename sic to orth 2015-01-22 18:08:25 +03:00			`cdef readonly attr_t orth`
* Tmp. Refactoring, introducing a Lexeme PyObject. 2015-01-12 03:23:44 +03:00
* Begin merge of Gazetteer and DE branches 2015-09-06 20:45:15 +03:00			`@staticmethod`
Fix Lexeme.from_ptr 2020-08-10 17:43:37 +03:00			`cdef inline Lexeme from_ptr(LexemeC* lex, Vocab vocab):`
* Begin merge of Gazetteer and DE branches 2015-09-06 20:45:15 +03:00			`cdef Lexeme self = Lexeme.__new__(Lexeme, vocab, lex.orth)`
			`self.c = lex`
			`self.vocab = vocab`
			`self.orth = lex.orth`
Fix Lexeme.from_ptr 2020-08-10 17:43:37 +03:00			`return self`
Get spaCy train command working with neural network * Integrate models into pipeline * Add basic serialization (maybe incorrect) * Fix pickle on vocab 2017-05-17 13:04:50 +03:00
* Begin merge of Gazetteer and DE branches 2015-09-06 20:45:15 +03:00			`@staticmethod`
			`cdef inline void set_struct_attr(LexemeC* lex, attr_id_t name, attr_t value) nogil:`
			`if name < (sizeof(flags_t) * 8):`
* Fix ugly py_check_flag and py_set_flag functions in Lexeme 2015-09-15 06:06:18 +03:00			`Lexeme.c_set_flag(lex, name, value)`
* Begin merge of Gazetteer and DE branches 2015-09-06 20:45:15 +03:00			`elif name == ID:`
			`lex.id = value`
			`elif name == LOWER:`
			`lex.lower = value`
			`elif name == NORM:`
			`lex.norm = value`
			`elif name == SHAPE:`
			`lex.shape = value`
			`elif name == PREFIX:`
			`lex.prefix = value`
			`elif name == SUFFIX:`
			`lex.suffix = value`
introduce lang field for LexemeC to hold language id put noun_chunk logic into iterators.py for each language separately 2016-03-10 15:01:34 +03:00			`elif name == LANG:`
			`lex.lang = value`
* Work on word vectors, and other stuff 2015-01-17 08:21:17 +03:00
* Tmp 2015-08-22 23:04:34 +03:00			`@staticmethod`
			`cdef inline attr_t get_struct_attr(const LexemeC* lex, attr_id_t feat_name) nogil:`
			`if feat_name < (sizeof(flags_t) * 8):`
* Fix ugly py_check_flag and py_set_flag functions in Lexeme 2015-09-15 06:06:18 +03:00			`if Lexeme.c_check_flag(lex, feat_name):`
* Ensure Lexeme.check_flag returns a boolean value 2015-09-06 18:52:32 +03:00			`return 1`
			`else:`
			`return 0`
* Tmp 2015-08-22 23:04:34 +03:00			`elif feat_name == ID:`
			`return lex.id`
			`elif feat_name == ORTH:`
			`return lex.orth`
			`elif feat_name == LOWER:`
			`return lex.lower`
			`elif feat_name == NORM:`
			`return lex.norm`
			`elif feat_name == SHAPE:`
			`return lex.shape`
			`elif feat_name == PREFIX:`
			`return lex.prefix`
			`elif feat_name == SUFFIX:`
			`return lex.suffix`
			`elif feat_name == LENGTH:`
			`return lex.length`
introduce lang field for LexemeC to hold language id put noun_chunk logic into iterators.py for each language separately 2016-03-10 15:01:34 +03:00			`elif feat_name == LANG:`
			`return lex.lang`
* Tmp 2015-08-22 23:04:34 +03:00			`else:`
			`return 0`
💫 Support lexical attributes in retokenizer attrs (closes #2390) (#3325) * Fix formatting and whitespace * Add support for lexical attributes (closes #2390) * Document lexical attribute setting during retokenization * Assign variable oputside of nested loop 2019-02-24 23:13:51 +03:00
* Begin merge of Gazetteer and DE branches 2015-09-06 20:45:15 +03:00			`@staticmethod`
* Fix ugly py_check_flag and py_set_flag functions in Lexeme 2015-09-15 06:06:18 +03:00			`cdef inline bint c_check_flag(const LexemeC* lexeme, attr_id_t flag_id) nogil:`
* Fix Lexeme.check_flag 2015-09-10 15:45:43 +03:00			`cdef flags_t one = 1`
* Fix ugly py_check_flag and py_set_flag functions in Lexeme 2015-09-15 06:06:18 +03:00			`if lexeme.flags & (one << flag_id):`
			`return True`
			`else:`
			`return False`
* Begin merge of Gazetteer and DE branches 2015-09-06 20:45:15 +03:00
* Work on language-independent refactoring 2015-08-23 21:49:18 +03:00			`@staticmethod`
* Fix ugly py_check_flag and py_set_flag functions in Lexeme 2015-09-15 06:06:18 +03:00			`cdef inline bint c_set_flag(LexemeC* lex, attr_id_t flag_id, bint value) nogil:`
* Work on language-independent refactoring 2015-08-23 21:49:18 +03:00			`cdef flags_t one = 1`
			`if value:`
			`lex.flags \|= one << flag_id`
			`else:`
			`lex.flags &= ~(one << flag_id)`