spaCy/spacy/attrs.pxd

# Reserve 64 values for flag features
from . cimport symbols


cdef enum attr_id_t:
    NULL_ATTR
    IS_ALPHA
    IS_ASCII
    IS_DIGIT
    IS_LOWER
    IS_PUNCT
    IS_SPACE
    IS_TITLE
    IS_UPPER
    LIKE_URL
    LIKE_NUM
    LIKE_EMAIL
    IS_STOP
    IS_OOV_DEPRECATED
    IS_BRACKET
    IS_QUOTE
    IS_LEFT_PUNCT
    IS_RIGHT_PUNCT
    IS_CURRENCY

    FLAG19 = 19
    FLAG20
    FLAG21
    FLAG22
    FLAG23
    FLAG24
    FLAG25
    FLAG26
    FLAG27
    FLAG28
    FLAG29
    FLAG30
    FLAG31
    FLAG32
    FLAG33
    FLAG34
    FLAG35
    FLAG36
    FLAG37
    FLAG38
    FLAG39
    FLAG40
    FLAG41
    FLAG42
    FLAG43
    FLAG44
    FLAG45
    FLAG46
    FLAG47
    FLAG48
    FLAG49
    FLAG50
    FLAG51
    FLAG52
    FLAG53
    FLAG54
    FLAG55
    FLAG56
    FLAG57
    FLAG58
    FLAG59
    FLAG60
    FLAG61
    FLAG62
    FLAG63

    ID
    ORTH
    LOWER
    NORM
    SHAPE
    PREFIX
    SUFFIX

    LENGTH
    CLUSTER
    LEMMA
    POS
    TAG
    DEP
    ENT_IOB
    ENT_TYPE
    HEAD
    SENT_START
    SPACY
    PROB

    LANG
    ENT_KB_ID = symbols.ENT_KB_ID
    MORPH
    ENT_ID = symbols.ENT_ID

    IDX
    SENT_END
* Add attrs.pxd 2015-01-26 14:22:09 +03:00			`# Reserve 64 values for flag features`
Fix attrs alignment 2019-07-12 18:59:47 +03:00			`from . cimport symbols`

Configure isort to use the Black profile, recursively isort the `spacy` module (#12721) * Use isort with Black profile * isort all the things * Fix import cycles as a result of import sorting * Add DOCBIN_ALL_ATTRS type definition * Add isort to requirements * Remove isort from build dependencies check * Typo 2023-06-14 18:48:41 +03:00
Fix cpdef enum in attrs.pyx 2017-09-17 20:28:53 +03:00			`cdef enum attr_id_t:`
* Refactor symbols, so that frequency rank can be derived from the orth id of a word. 2015-10-06 16:39:50 +03:00			`NULL_ATTR`
* Host IS_ flags in attrs.pxd, and add properties for them on Token and Lexeme objects 2015-07-26 17:37:16 +03:00			`IS_ALPHA`
			`IS_ASCII`
			`IS_DIGIT`
			`IS_LOWER`
			`IS_PUNCT`
			`IS_SPACE`
			`IS_TITLE`
			`IS_UPPER`
			`LIKE_URL`
			`LIKE_NUM`
			`LIKE_EMAIL`
			`IS_STOP`
Reduce stored lexemes data, move feats to lookups (#5238) * Reduce stored lexemes data, move feats to lookups * Move non-derivable lexemes features (`norm / cluster / prob`) to `spacy-lookups-data` as lookups * Get/set `norm` in both lookups and `LexemeC`, serialize in lookups * Remove `cluster` and `prob` from `LexemesC`, get/set/serialize in lookups only * Remove serialization of lexemes data as `vocab/lexemes.bin` * Remove `SerializedLexemeC` * Remove `Lexeme.to_bytes/from_bytes` * Modify normalization exception loading: * Always create `Vocab.lookups` table `lexeme_norm` for normalization exceptions * Load base exceptions from `lang.norm_exceptions`, but load language-specific exceptions from lookups * Set `lex_attr_getter[NORM]` including new lookups table in `BaseDefaults.create_vocab()` and when deserializing `Vocab` * Remove all cached lexemes when deserializing vocab to override existing normalizations with the new normalizations (as a replacement for the previous step that replaced all lexemes data with the deserialized data) * Skip English normalization test Skip English normalization test because the data is now in `spacy-lookups-data`. * Remove norm exceptions Moved to spacy-lookups-data. * Move norm exceptions test to spacy-lookups-data * Load extra lookups from spacy-lookups-data lazily Load extra lookups (currently for cluster and prob) lazily from the entry point `lg_extra` as `Vocab.lookups_extra`. * Skip creating lexeme cache on load To improve model loading times, do not create the full lexeme cache when loading. The lexemes will be created on demand when processing. * Identify numeric values in Lexeme.set_attrs() With the removal of a special case for `PROB`, also identify `float` to avoid trying to convert it with the `StringStore`. * Skip lexeme cache init in from_bytes * Unskip and update lookups tests for python3.6+ * Update vocab pickle to include lookups_extra * Update vocab serialization tests Check strings rather than lexemes since lexemes aren't initialized automatically, account for addition of "_SP". * Re-skip lookups test because of python3.5 * Skip PROB/float values in Lexeme.set_attrs * Convert is_oov from lexeme flag to lex in vectors Instead of storing `is_oov` as a lexeme flag, `is_oov` reports whether the lexeme has a vector. Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com> 2020-05-19 16:59:14 +03:00			`IS_OOV_DEPRECATED`
introduce lang field for LexemeC to hold language id put noun_chunk logic into iterators.py for each language separately 2016-03-10 15:01:34 +03:00			`IS_BRACKET`
			`IS_QUOTE`
			`IS_LEFT_PUNCT`
			`IS_RIGHT_PUNCT`
removed 18 and replaced 18 with is_currency 2018-02-11 20:51:09 +03:00			`IS_CURRENCY`
introduce lang field for LexemeC to hold language id put noun_chunk logic into iterators.py for each language separately 2016-03-10 15:01:34 +03:00
removed 18 and replaced 18 with is_currency 2018-02-11 20:51:09 +03:00			`FLAG19 = 19`
* Add attrs.pxd 2015-01-26 14:22:09 +03:00			`FLAG20`
			`FLAG21`
			`FLAG22`
			`FLAG23`
			`FLAG24`
			`FLAG25`
			`FLAG26`
			`FLAG27`
			`FLAG28`
			`FLAG29`
			`FLAG30`
			`FLAG31`
			`FLAG32`
			`FLAG33`
			`FLAG34`
			`FLAG35`
			`FLAG36`
			`FLAG37`
			`FLAG38`
			`FLAG39`
			`FLAG40`
			`FLAG41`
			`FLAG42`
			`FLAG43`
			`FLAG44`
			`FLAG45`
			`FLAG46`
			`FLAG47`
			`FLAG48`
			`FLAG49`
			`FLAG50`
			`FLAG51`
			`FLAG52`
			`FLAG53`
			`FLAG54`
			`FLAG55`
			`FLAG56`
			`FLAG57`
			`FLAG58`
			`FLAG59`
			`FLAG60`
			`FLAG61`
			`FLAG62`
			`FLAG63`

			`ID`
			`ORTH`
			`LOWER`
			`NORM`
			`SHAPE`
			`PREFIX`
			`SUFFIX`

			`LENGTH`
			`CLUSTER`
			`LEMMA`
			`POS`
			`TAG`
* Fix Issue #43: TAG attr not supported. Also add DEP attr, while I'm at it. Need better way of ensuring future changes don't break in similar way. 2015-04-07 07:00:43 +03:00			`DEP`
* Upd attrs id list 2015-07-16 02:26:54 +03:00			`ENT_IOB`
			`ENT_TYPE`
			`HEAD`
Add SENT_START attribute, for custom sentence boundary detection 2016-05-05 13:11:57 +03:00			`SENT_START`
* Upd attrs id list 2015-07-16 02:26:54 +03:00			`SPACY`
* Add PROB attribute in attrs.pxd 2015-08-26 20:14:19 +03:00			`PROB`
introduce lang field for LexemeC to hold language id put noun_chunk logic into iterators.py for each language separately 2016-03-10 15:01:34 +03:00
Fix LANG symbol 2018-02-17 20:10:50 +03:00			`LANG`
Fix attrs alignment 2019-07-12 18:59:47 +03:00			`ENT_KB_ID = symbols.ENT_KB_ID`
Add MORPH attr, add support in retokenizer (#4947) * Add MORPH attr / symbol for token attrs * Update retokenizer for MORPH 2020-01-29 19:45:46 +03:00			`MORPH`
serialize ENT_ID (#4852) * expand serialization test for custom token attribute * add failing test for issue 4849 * define ENT_ID as attr and use in doc serialization * fix few typos 2020-01-06 16:57:34 +03:00			`ENT_ID = symbols.ENT_ID`
make idx available via to_array (#5030) 2020-02-22 16:13:06 +03:00
			`IDX`
Add is_sent_end token property (#5375) Reconstruction of the original PR #4697 by @MiniLau. Removes unused `SENT_END` symbol and `IS_SENT_END` from `Matcher` schema because the Matcher is only going to be able to support `IS_SENT_START`. 2020-04-29 13:53:16 +03:00			`SENT_END`