spaCy/spacy/symbols.pxd

470 lines
6.8 KiB
Cython
Raw Normal View History

cdef enum symbol_t:
2015-10-10 14:11:38 +03:00
NIL
IS_ALPHA
IS_ASCII
IS_DIGIT
IS_LOWER
IS_PUNCT
IS_SPACE
IS_TITLE
IS_UPPER
LIKE_URL
LIKE_NUM
LIKE_EMAIL
IS_STOP
Reduce stored lexemes data, move feats to lookups (#5238) * Reduce stored lexemes data, move feats to lookups * Move non-derivable lexemes features (`norm / cluster / prob`) to `spacy-lookups-data` as lookups * Get/set `norm` in both lookups and `LexemeC`, serialize in lookups * Remove `cluster` and `prob` from `LexemesC`, get/set/serialize in lookups only * Remove serialization of lexemes data as `vocab/lexemes.bin` * Remove `SerializedLexemeC` * Remove `Lexeme.to_bytes/from_bytes` * Modify normalization exception loading: * Always create `Vocab.lookups` table `lexeme_norm` for normalization exceptions * Load base exceptions from `lang.norm_exceptions`, but load language-specific exceptions from lookups * Set `lex_attr_getter[NORM]` including new lookups table in `BaseDefaults.create_vocab()` and when deserializing `Vocab` * Remove all cached lexemes when deserializing vocab to override existing normalizations with the new normalizations (as a replacement for the previous step that replaced all lexemes data with the deserialized data) * Skip English normalization test Skip English normalization test because the data is now in `spacy-lookups-data`. * Remove norm exceptions Moved to spacy-lookups-data. * Move norm exceptions test to spacy-lookups-data * Load extra lookups from spacy-lookups-data lazily Load extra lookups (currently for cluster and prob) lazily from the entry point `lg_extra` as `Vocab.lookups_extra`. * Skip creating lexeme cache on load To improve model loading times, do not create the full lexeme cache when loading. The lexemes will be created on demand when processing. * Identify numeric values in Lexeme.set_attrs() With the removal of a special case for `PROB`, also identify `float` to avoid trying to convert it with the `StringStore`. * Skip lexeme cache init in from_bytes * Unskip and update lookups tests for python3.6+ * Update vocab pickle to include lookups_extra * Update vocab serialization tests Check strings rather than lexemes since lexemes aren't initialized automatically, account for addition of "_SP". * Re-skip lookups test because of python3.5 * Skip PROB/float values in Lexeme.set_attrs * Convert is_oov from lexeme flag to lex in vectors Instead of storing `is_oov` as a lexeme flag, `is_oov` reports whether the lexeme has a vector. Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2020-05-19 16:59:14 +03:00
IS_OOV_DEPRECATED
IS_BRACKET
IS_QUOTE
IS_LEFT_PUNCT
IS_RIGHT_PUNCT
2018-02-11 20:55:32 +03:00
IS_CURRENCY
2018-02-11 20:55:32 +03:00
FLAG19 = 19
2015-10-10 14:11:38 +03:00
FLAG20
FLAG21
FLAG22
FLAG23
FLAG24
FLAG25
FLAG26
FLAG27
FLAG28
FLAG29
FLAG30
FLAG31
FLAG32
FLAG33
FLAG34
FLAG35
FLAG36
FLAG37
FLAG38
FLAG39
FLAG40
FLAG41
FLAG42
FLAG43
FLAG44
FLAG45
FLAG46
FLAG47
FLAG48
FLAG49
FLAG50
FLAG51
FLAG52
FLAG53
FLAG54
FLAG55
FLAG56
FLAG57
FLAG58
FLAG59
FLAG60
FLAG61
FLAG62
FLAG63
2015-10-06 16:41:17 +03:00
2015-10-10 14:11:38 +03:00
ID
ORTH
LOWER
NORM
SHAPE
PREFIX
SUFFIX
2015-10-06 16:41:17 +03:00
2015-10-10 14:11:38 +03:00
LENGTH
CLUSTER
LEMMA
POS
TAG
DEP
ENT_IOB
ENT_TYPE
HEAD
SENT_START
2015-10-10 14:11:38 +03:00
SPACY
PROB
LANG
2015-10-06 16:41:17 +03:00
2015-10-10 14:11:38 +03:00
ADJ
ADP
ADV
AUX
CONJ
CCONJ # U20
2015-10-10 14:11:38 +03:00
DET
INTJ
NOUN
NUM
PART
PRON
PROPN
PUNCT
SCONJ
SYM
VERB
X
EOL
SPACE
2015-10-06 16:41:17 +03:00
Modify morphology to support arbitrary features (#4932) * Restructure tag maps for MorphAnalysis changes Prepare tag maps for upcoming MorphAnalysis changes that allow arbritrary features. * Use default tag map rather than duplicating for ca / uk / vi * Import tag map into defaults for ga * Modify tag maps so all morphological fields and features are strings * Move features from `"Other"` to the top level * Rewrite tuples as strings separated by `","` * Rewrite morph symbols for fr lemmatizer as strings * Export MorphAnalysis under spacy.tokens * Modify morphology to support arbitrary features Modify `Morphology` and `MorphAnalysis` so that arbitrary features are supported. * Modify `MorphAnalysisC` so that it can support arbitrary features and multiple values per field. `MorphAnalysisC` is redesigned to contain: * key: hash of UD FEATS string of morphological features * array of `MorphFeatureC` structs that each contain a hash of `Field` and `Field=Value` for a given morphological feature, which makes it possible to: * find features by field * represent multiple values for a given field * `get_field()` is renamed to `get_by_field()` and is no longer `nogil`. Instead a new helper function `get_n_by_field()` is `nogil` and returns `n` features by field. * `MorphAnalysis.get()` returns all possible values for a field as a list of individual features such as `["Tense=Pres", "Tense=Past"]`. * `MorphAnalysis`'s `str()` and `repr()` are the UD FEATS string. * `Morphology.feats_to_dict()` converts a UD FEATS string to a dict where: * Each field has one entry in the dict * Multiple values remain separated by a separator in the value string * `Token.morph_` returns the UD FEATS string and you can set `Token.morph_` with a UD FEATS string or with a tag map dict. * Modify get_by_field to use np.ndarray Modify `get_by_field()` to use np.ndarray. Remove `max_results` from `get_n_by_field()` and always iterate over all the fields. * Rewrite without MorphFeatureC * Add shortcut for existing feats strings as keys Add shortcut for existing feats strings as keys in `Morphology.add()`. * Check for '_' as empty analysis when adding morphs * Extend helper converters in Morphology Add and extend helper converters that convert and normalize between: * UD FEATS strings (`"Case=dat,gen|Number=sing"`) * per-field dict of feats (`{"Case": "dat,gen", "Number": "sing"}`) * list of individual features (`["Case=dat", "Case=gen", "Number=sing"]`) All converters sort fields and values where applicable.
2020-01-24 00:01:54 +03:00
DEPRECATED001
DEPRECATED002
DEPRECATED003
DEPRECATED004
DEPRECATED005
DEPRECATED006
DEPRECATED007
DEPRECATED008
DEPRECATED009
DEPRECATED010
DEPRECATED011
DEPRECATED012
DEPRECATED013
DEPRECATED014
DEPRECATED015
DEPRECATED016
DEPRECATED017
DEPRECATED018
DEPRECATED019
DEPRECATED020
DEPRECATED021
DEPRECATED022
DEPRECATED023
DEPRECATED024
DEPRECATED025
DEPRECATED026
DEPRECATED027
DEPRECATED028
DEPRECATED029
DEPRECATED030
DEPRECATED031
DEPRECATED032
DEPRECATED033
DEPRECATED034
DEPRECATED035
DEPRECATED036
DEPRECATED037
DEPRECATED038
DEPRECATED039
DEPRECATED040
DEPRECATED041
DEPRECATED042
DEPRECATED043
DEPRECATED044
DEPRECATED045
DEPRECATED046
DEPRECATED047
DEPRECATED048
DEPRECATED049
DEPRECATED050
DEPRECATED051
DEPRECATED052
DEPRECATED053
DEPRECATED054
DEPRECATED055
DEPRECATED056
DEPRECATED057
DEPRECATED058
DEPRECATED059
DEPRECATED060
DEPRECATED061
DEPRECATED062
DEPRECATED063
DEPRECATED064
DEPRECATED065
DEPRECATED066
DEPRECATED067
DEPRECATED068
DEPRECATED069
DEPRECATED070
DEPRECATED071
DEPRECATED072
DEPRECATED073
DEPRECATED074
DEPRECATED075
DEPRECATED076
DEPRECATED077
DEPRECATED078
DEPRECATED079
DEPRECATED080
DEPRECATED081
DEPRECATED082
DEPRECATED083
DEPRECATED084
DEPRECATED085
DEPRECATED086
DEPRECATED087
DEPRECATED088
DEPRECATED089
DEPRECATED090
DEPRECATED091
DEPRECATED092
DEPRECATED093
DEPRECATED094
DEPRECATED095
DEPRECATED096
DEPRECATED097
DEPRECATED098
DEPRECATED099
DEPRECATED100
DEPRECATED101
DEPRECATED102
DEPRECATED103
DEPRECATED104
DEPRECATED105
DEPRECATED106
DEPRECATED107
DEPRECATED108
DEPRECATED109
DEPRECATED110
DEPRECATED111
DEPRECATED112
DEPRECATED113
DEPRECATED114
DEPRECATED115
DEPRECATED116
DEPRECATED117
DEPRECATED118
DEPRECATED119
DEPRECATED120
DEPRECATED121
DEPRECATED122
DEPRECATED123
DEPRECATED124
DEPRECATED125
DEPRECATED126
DEPRECATED127
DEPRECATED128
DEPRECATED129
DEPRECATED130
DEPRECATED131
DEPRECATED132
DEPRECATED133
DEPRECATED134
DEPRECATED135
DEPRECATED136
DEPRECATED137
DEPRECATED138
DEPRECATED139
DEPRECATED140
DEPRECATED141
DEPRECATED142
DEPRECATED143
DEPRECATED144
DEPRECATED145
DEPRECATED146
DEPRECATED147
DEPRECATED148
DEPRECATED149
DEPRECATED150
DEPRECATED151
DEPRECATED152
DEPRECATED153
DEPRECATED154
DEPRECATED155
DEPRECATED156
DEPRECATED157
DEPRECATED158
DEPRECATED159
DEPRECATED160
DEPRECATED161
DEPRECATED162
DEPRECATED163
DEPRECATED164
DEPRECATED165
DEPRECATED166
DEPRECATED167
DEPRECATED168
DEPRECATED169
DEPRECATED170
DEPRECATED171
DEPRECATED172
DEPRECATED173
DEPRECATED174
DEPRECATED175
DEPRECATED176
DEPRECATED177
DEPRECATED178
DEPRECATED179
DEPRECATED180
DEPRECATED181
DEPRECATED182
DEPRECATED183
DEPRECATED184
DEPRECATED185
DEPRECATED186
DEPRECATED187
DEPRECATED188
DEPRECATED189
DEPRECATED190
DEPRECATED191
DEPRECATED192
DEPRECATED193
DEPRECATED194
DEPRECATED195
DEPRECATED196
DEPRECATED197
DEPRECATED198
DEPRECATED199
DEPRECATED200
DEPRECATED201
DEPRECATED202
DEPRECATED203
DEPRECATED204
DEPRECATED205
DEPRECATED206
DEPRECATED207
DEPRECATED208
DEPRECATED209
DEPRECATED210
DEPRECATED211
DEPRECATED212
DEPRECATED213
DEPRECATED214
DEPRECATED215
DEPRECATED216
DEPRECATED217
DEPRECATED218
DEPRECATED219
DEPRECATED220
DEPRECATED221
DEPRECATED222
DEPRECATED223
DEPRECATED224
DEPRECATED225
DEPRECATED226
DEPRECATED227
DEPRECATED228
DEPRECATED229
DEPRECATED230
DEPRECATED231
DEPRECATED232
DEPRECATED233
DEPRECATED234
DEPRECATED235
DEPRECATED236
DEPRECATED237
DEPRECATED238
DEPRECATED239
DEPRECATED240
DEPRECATED241
DEPRECATED242
DEPRECATED243
DEPRECATED244
DEPRECATED245
DEPRECATED246
DEPRECATED247
DEPRECATED248
DEPRECATED249
DEPRECATED250
DEPRECATED251
DEPRECATED252
DEPRECATED253
DEPRECATED254
DEPRECATED255
DEPRECATED256
DEPRECATED257
DEPRECATED258
DEPRECATED259
DEPRECATED260
DEPRECATED261
DEPRECATED262
DEPRECATED263
DEPRECATED264
DEPRECATED265
DEPRECATED266
DEPRECATED267
DEPRECATED268
DEPRECATED269
DEPRECATED270
DEPRECATED271
DEPRECATED272
DEPRECATED273
DEPRECATED274
DEPRECATED275
DEPRECATED276
2015-10-06 16:41:17 +03:00
PERSON
NORP
FACILITY
ORG
GPE
LOC
PRODUCT
EVENT
WORK_OF_ART
LANGUAGE
LAW
2015-10-06 16:41:17 +03:00
DATE
TIME
PERCENT
MONEY
QUANTITY
ORDINAL
CARDINAL
2015-10-06 16:41:17 +03:00
acomp
advcl
advmod
agent
amod
appos
attr
aux
auxpass
cc
ccomp
complm
conj
cop # U20
csubj
csubjpass
dep
det
dobj
expl
hmod
hyph
infmod
intj
iobj
mark
meta
neg
nmod
nn
npadvmod
nsubj
nsubjpass
num
number
oprd
obj # U20
obl # U20
parataxis
partmod
pcomp
pobj
poss
possessive
preconj
prep
prt
punct
quantmod
relcl
rcmod
root
xcomp
acl
2019-07-12 18:48:16 +03:00
ENT_KB_ID
MORPH
ENT_ID
IDX
_