spaCy/spacy/lexeme.pxd

from .typedefs cimport attr_t, hash_t, flags_t, len_t, tag_t
from .attrs cimport attr_id_t
from .attrs cimport ID, ORTH, LOWER, NORM, SHAPE, PREFIX, SUFFIX, LENGTH, LANG

from .structs cimport LexemeC
from .strings cimport StringStore
from .vocab cimport Vocab

from numpy cimport ndarray


cdef LexemeC EMPTY_LEXEME
cdef attr_t OOV_RANK

cdef class Lexeme:
    cdef LexemeC* c
    cdef readonly Vocab vocab
    cdef readonly attr_t orth

    @staticmethod
    cdef inline Lexeme from_ptr(LexemeC* lex, Vocab vocab, int vector_length):
        cdef Lexeme self = Lexeme.__new__(Lexeme, vocab, lex.orth)
        self.c = lex
        self.vocab = vocab
        self.orth = lex.orth

    @staticmethod
    cdef inline void set_struct_attr(LexemeC* lex, attr_id_t name, attr_t value) nogil:
        if name < (sizeof(flags_t) * 8):
            Lexeme.c_set_flag(lex, name, value)
        elif name == ID:
            lex.id = value
        elif name == LOWER:
            lex.lower = value
        elif name == NORM:
            lex.norm = value
        elif name == SHAPE:
            lex.shape = value
        elif name == PREFIX:
            lex.prefix = value
        elif name == SUFFIX:
            lex.suffix = value
        elif name == LANG:
            lex.lang = value

    @staticmethod
    cdef inline attr_t get_struct_attr(const LexemeC* lex, attr_id_t feat_name) nogil:
        if feat_name < (sizeof(flags_t) * 8):
            if Lexeme.c_check_flag(lex, feat_name):
                return 1
            else:
                return 0
        elif feat_name == ID:
            return lex.id
        elif feat_name == ORTH:
            return lex.orth
        elif feat_name == LOWER:
            return lex.lower
        elif feat_name == NORM:
            return lex.norm
        elif feat_name == SHAPE:
            return lex.shape
        elif feat_name == PREFIX:
            return lex.prefix
        elif feat_name == SUFFIX:
            return lex.suffix
        elif feat_name == LENGTH:
            return lex.length
        elif feat_name == LANG:
            return lex.lang
        else:
            return 0

    @staticmethod
    cdef inline bint c_check_flag(const LexemeC* lexeme, attr_id_t flag_id) nogil:
        cdef flags_t one = 1
        if lexeme.flags & (one << flag_id):
            return True
        else:
            return False

    @staticmethod
    cdef inline bint c_set_flag(LexemeC* lex, attr_id_t flag_id, bint value) nogil:
        cdef flags_t one = 1
        if value:
            lex.flags |= one << flag_id
        else:
            lex.flags &= ~(one << flag_id)
* Fix type declarations for attr_t. Remove unused id_t. 2015-07-18 23:39:57 +03:00			`from .typedefs cimport attr_t, hash_t, flags_t, len_t, tag_t`
* Remove redundant attr_id_t from typedefs.pxd 2015-07-16 01:58:51 +03:00			`from .attrs cimport attr_id_t`
Reduce stored lexemes data, move feats to lookups (#5238) * Reduce stored lexemes data, move feats to lookups * Move non-derivable lexemes features (`norm / cluster / prob`) to `spacy-lookups-data` as lookups * Get/set `norm` in both lookups and `LexemeC`, serialize in lookups * Remove `cluster` and `prob` from `LexemesC`, get/set/serialize in lookups only * Remove serialization of lexemes data as `vocab/lexemes.bin` * Remove `SerializedLexemeC` * Remove `Lexeme.to_bytes/from_bytes` * Modify normalization exception loading: * Always create `Vocab.lookups` table `lexeme_norm` for normalization exceptions * Load base exceptions from `lang.norm_exceptions`, but load language-specific exceptions from lookups * Set `lex_attr_getter[NORM]` including new lookups table in `BaseDefaults.create_vocab()` and when deserializing `Vocab` * Remove all cached lexemes when deserializing vocab to override existing normalizations with the new normalizations (as a replacement for the previous step that replaced all lexemes data with the deserialized data) * Skip English normalization test Skip English normalization test because the data is now in `spacy-lookups-data`. * Remove norm exceptions Moved to spacy-lookups-data. * Move norm exceptions test to spacy-lookups-data * Load extra lookups from spacy-lookups-data lazily Load extra lookups (currently for cluster and prob) lazily from the entry point `lg_extra` as `Vocab.lookups_extra`. * Skip creating lexeme cache on load To improve model loading times, do not create the full lexeme cache when loading. The lexemes will be created on demand when processing. * Identify numeric values in Lexeme.set_attrs() With the removal of a special case for `PROB`, also identify `float` to avoid trying to convert it with the `StringStore`. * Skip lexeme cache init in from_bytes * Unskip and update lookups tests for python3.6+ * Update vocab pickle to include lookups_extra * Update vocab serialization tests Check strings rather than lexemes since lexemes aren't initialized automatically, account for addition of "_SP". * Re-skip lookups test because of python3.5 * Skip PROB/float values in Lexeme.set_attrs * Convert is_oov from lexeme flag to lex in vectors Instead of storing `is_oov` as a lexeme flag, `is_oov` reports whether the lexeme has a vector. Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com> 2020-05-19 16:59:14 +03:00			`from .attrs cimport ID, ORTH, LOWER, NORM, SHAPE, PREFIX, SUFFIX, LENGTH, LANG`
* Remove redundant attr_id_t from typedefs.pxd 2015-07-16 01:58:51 +03:00
Reduce stored lexemes data, move feats to lookups (#5238) * Reduce stored lexemes data, move feats to lookups * Move non-derivable lexemes features (`norm / cluster / prob`) to `spacy-lookups-data` as lookups * Get/set `norm` in both lookups and `LexemeC`, serialize in lookups * Remove `cluster` and `prob` from `LexemesC`, get/set/serialize in lookups only * Remove serialization of lexemes data as `vocab/lexemes.bin` * Remove `SerializedLexemeC` * Remove `Lexeme.to_bytes/from_bytes` * Modify normalization exception loading: * Always create `Vocab.lookups` table `lexeme_norm` for normalization exceptions * Load base exceptions from `lang.norm_exceptions`, but load language-specific exceptions from lookups * Set `lex_attr_getter[NORM]` including new lookups table in `BaseDefaults.create_vocab()` and when deserializing `Vocab` * Remove all cached lexemes when deserializing vocab to override existing normalizations with the new normalizations (as a replacement for the previous step that replaced all lexemes data with the deserialized data) * Skip English normalization test Skip English normalization test because the data is now in `spacy-lookups-data`. * Remove norm exceptions Moved to spacy-lookups-data. * Move norm exceptions test to spacy-lookups-data * Load extra lookups from spacy-lookups-data lazily Load extra lookups (currently for cluster and prob) lazily from the entry point `lg_extra` as `Vocab.lookups_extra`. * Skip creating lexeme cache on load To improve model loading times, do not create the full lexeme cache when loading. The lexemes will be created on demand when processing. * Identify numeric values in Lexeme.set_attrs() With the removal of a special case for `PROB`, also identify `float` to avoid trying to convert it with the `StringStore`. * Skip lexeme cache init in from_bytes * Unskip and update lookups tests for python3.6+ * Update vocab pickle to include lookups_extra * Update vocab serialization tests Check strings rather than lexemes since lexemes aren't initialized automatically, account for addition of "_SP". * Re-skip lookups test because of python3.5 * Skip PROB/float values in Lexeme.set_attrs * Convert is_oov from lexeme flag to lex in vectors Instead of storing `is_oov` as a lexeme flag, `is_oov` reports whether the lexeme has a vector. Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com> 2020-05-19 16:59:14 +03:00			`from .structs cimport LexemeC`
* Refactor, move Lexeme struct to structs.pxd 2014-12-19 22:51:03 +03:00			`from .strings cimport StringStore`
* Begin merge of Gazetteer and DE branches 2015-09-06 20:45:15 +03:00			`from .vocab cimport Vocab`
* Revising data model of lexeme. Compiles. 2014-10-09 12:53:30 +04:00
* Work on word vectors, and other stuff 2015-01-17 08:21:17 +03:00			`from numpy cimport ndarray`


* Tmp commit. Refactoring to create a Python Lexeme class. 2015-01-12 02:26:22 +03:00			`cdef LexemeC EMPTY_LEXEME`
Use max(uint64) for OOV lexeme rank (#5303) * Use max(uint64) for OOV lexeme rank * Add test for default OOV rank * Revert back to thinc==7.4.0 Requiring the updated version of thinc was unnecessary. * Define OOV_RANK in one place Define OOV_RANK in one place in `util`. * Fix formatting [ci skip] * Switch to external definitions of max(uint64) Switch to external defintions of max(uint64) and confirm that they are equal. 2020-04-15 14:49:47 +03:00			`cdef attr_t OOV_RANK`
* Restoring Lexeme-as-struct 2014-09-10 22:41:37 +04:00
* Tmp. Refactoring, introducing a Lexeme PyObject. 2015-01-12 03:23:44 +03:00			`cdef class Lexeme:`
* Tmp 2015-08-22 23:04:34 +03:00			`cdef LexemeC* c`
			`cdef readonly Vocab vocab`
* Rename sic to orth 2015-01-22 18:08:25 +03:00			`cdef readonly attr_t orth`
* Tmp. Refactoring, introducing a Lexeme PyObject. 2015-01-12 03:23:44 +03:00
* Begin merge of Gazetteer and DE branches 2015-09-06 20:45:15 +03:00			`@staticmethod`
			`cdef inline Lexeme from_ptr(LexemeC* lex, Vocab vocab, int vector_length):`
			`cdef Lexeme self = Lexeme.__new__(Lexeme, vocab, lex.orth)`
			`self.c = lex`
			`self.vocab = vocab`
			`self.orth = lex.orth`
Get spaCy train command working with neural network * Integrate models into pipeline * Add basic serialization (maybe incorrect) * Fix pickle on vocab 2017-05-17 13:04:50 +03:00
* Begin merge of Gazetteer and DE branches 2015-09-06 20:45:15 +03:00			`@staticmethod`
			`cdef inline void set_struct_attr(LexemeC* lex, attr_id_t name, attr_t value) nogil:`
			`if name < (sizeof(flags_t) * 8):`
* Fix ugly py_check_flag and py_set_flag functions in Lexeme 2015-09-15 06:06:18 +03:00			`Lexeme.c_set_flag(lex, name, value)`
* Begin merge of Gazetteer and DE branches 2015-09-06 20:45:15 +03:00			`elif name == ID:`
			`lex.id = value`
			`elif name == LOWER:`
			`lex.lower = value`
			`elif name == NORM:`
			`lex.norm = value`
			`elif name == SHAPE:`
			`lex.shape = value`
			`elif name == PREFIX:`
			`lex.prefix = value`
			`elif name == SUFFIX:`
			`lex.suffix = value`
introduce lang field for LexemeC to hold language id put noun_chunk logic into iterators.py for each language separately 2016-03-10 15:01:34 +03:00			`elif name == LANG:`
			`lex.lang = value`
* Work on word vectors, and other stuff 2015-01-17 08:21:17 +03:00
* Tmp 2015-08-22 23:04:34 +03:00			`@staticmethod`
			`cdef inline attr_t get_struct_attr(const LexemeC* lex, attr_id_t feat_name) nogil:`
			`if feat_name < (sizeof(flags_t) * 8):`
* Fix ugly py_check_flag and py_set_flag functions in Lexeme 2015-09-15 06:06:18 +03:00			`if Lexeme.c_check_flag(lex, feat_name):`
* Ensure Lexeme.check_flag returns a boolean value 2015-09-06 18:52:32 +03:00			`return 1`
			`else:`
			`return 0`
* Tmp 2015-08-22 23:04:34 +03:00			`elif feat_name == ID:`
			`return lex.id`
			`elif feat_name == ORTH:`
			`return lex.orth`
			`elif feat_name == LOWER:`
			`return lex.lower`
			`elif feat_name == NORM:`
			`return lex.norm`
			`elif feat_name == SHAPE:`
			`return lex.shape`
			`elif feat_name == PREFIX:`
			`return lex.prefix`
			`elif feat_name == SUFFIX:`
			`return lex.suffix`
			`elif feat_name == LENGTH:`
			`return lex.length`
introduce lang field for LexemeC to hold language id put noun_chunk logic into iterators.py for each language separately 2016-03-10 15:01:34 +03:00			`elif feat_name == LANG:`
			`return lex.lang`
* Tmp 2015-08-22 23:04:34 +03:00			`else:`
			`return 0`
💫 Support lexical attributes in retokenizer attrs (closes #2390) (#3325) * Fix formatting and whitespace * Add support for lexical attributes (closes #2390) * Document lexical attribute setting during retokenization * Assign variable oputside of nested loop 2019-02-24 23:13:51 +03:00
* Begin merge of Gazetteer and DE branches 2015-09-06 20:45:15 +03:00			`@staticmethod`
* Fix ugly py_check_flag and py_set_flag functions in Lexeme 2015-09-15 06:06:18 +03:00			`cdef inline bint c_check_flag(const LexemeC* lexeme, attr_id_t flag_id) nogil:`
* Fix Lexeme.check_flag 2015-09-10 15:45:43 +03:00			`cdef flags_t one = 1`
* Fix ugly py_check_flag and py_set_flag functions in Lexeme 2015-09-15 06:06:18 +03:00			`if lexeme.flags & (one << flag_id):`
			`return True`
			`else:`
			`return False`
* Begin merge of Gazetteer and DE branches 2015-09-06 20:45:15 +03:00
* Work on language-independent refactoring 2015-08-23 21:49:18 +03:00			`@staticmethod`
* Fix ugly py_check_flag and py_set_flag functions in Lexeme 2015-09-15 06:06:18 +03:00			`cdef inline bint c_set_flag(LexemeC* lex, attr_id_t flag_id, bint value) nogil:`
* Work on language-independent refactoring 2015-08-23 21:49:18 +03:00			`cdef flags_t one = 1`
			`if value:`
			`lex.flags \|= one << flag_id`
			`else:`
			`lex.flags &= ~(one << flag_id)`