mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-12 10:16:27 +03:00
a5cd203284
* Reduce stored lexemes data, move feats to lookups * Move non-derivable lexemes features (`norm / cluster / prob`) to `spacy-lookups-data` as lookups * Get/set `norm` in both lookups and `LexemeC`, serialize in lookups * Remove `cluster` and `prob` from `LexemesC`, get/set/serialize in lookups only * Remove serialization of lexemes data as `vocab/lexemes.bin` * Remove `SerializedLexemeC` * Remove `Lexeme.to_bytes/from_bytes` * Modify normalization exception loading: * Always create `Vocab.lookups` table `lexeme_norm` for normalization exceptions * Load base exceptions from `lang.norm_exceptions`, but load language-specific exceptions from lookups * Set `lex_attr_getter[NORM]` including new lookups table in `BaseDefaults.create_vocab()` and when deserializing `Vocab` * Remove all cached lexemes when deserializing vocab to override existing normalizations with the new normalizations (as a replacement for the previous step that replaced all lexemes data with the deserialized data) * Skip English normalization test Skip English normalization test because the data is now in `spacy-lookups-data`. * Remove norm exceptions Moved to spacy-lookups-data. * Move norm exceptions test to spacy-lookups-data * Load extra lookups from spacy-lookups-data lazily Load extra lookups (currently for cluster and prob) lazily from the entry point `lg_extra` as `Vocab.lookups_extra`. * Skip creating lexeme cache on load To improve model loading times, do not create the full lexeme cache when loading. The lexemes will be created on demand when processing. * Identify numeric values in Lexeme.set_attrs() With the removal of a special case for `PROB`, also identify `float` to avoid trying to convert it with the `StringStore`. * Skip lexeme cache init in from_bytes * Unskip and update lookups tests for python3.6+ * Update vocab pickle to include lookups_extra * Update vocab serialization tests Check strings rather than lexemes since lexemes aren't initialized automatically, account for addition of "_SP". * Re-skip lookups test because of python3.5 * Skip PROB/float values in Lexeme.set_attrs * Convert is_oov from lexeme flag to lex in vectors Instead of storing `is_oov` as a lexeme flag, `is_oov` reports whether the lexeme has a vector. Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com> |
||
---|---|---|
.. | ||
af | ||
ar | ||
bg | ||
bn | ||
ca | ||
cs | ||
da | ||
de | ||
el | ||
en | ||
es | ||
et | ||
eu | ||
fa | ||
fi | ||
fr | ||
ga | ||
gu | ||
he | ||
hi | ||
hr | ||
hu | ||
hy | ||
id | ||
is | ||
it | ||
ja | ||
kn | ||
ko | ||
lb | ||
lij | ||
lt | ||
lv | ||
ml | ||
mr | ||
nb | ||
nl | ||
pl | ||
pt | ||
ro | ||
ru | ||
si | ||
sk | ||
sl | ||
sq | ||
sr | ||
sv | ||
ta | ||
te | ||
th | ||
tl | ||
tr | ||
tt | ||
uk | ||
ur | ||
vi | ||
xx | ||
yo | ||
zh | ||
__init__.py | ||
char_classes.py | ||
lex_attrs.py | ||
norm_exceptions.py | ||
punctuation.py | ||
tag_map.py | ||
tokenizer_exceptions.py |