mirror of
https://github.com/explosion/spaCy.git
synced 2025-12-09 19:24:22 +03:00
* Reduce stored lexemes data, move feats to lookups
* Move non-derivable lexemes features (`norm / cluster / prob`) to
`spacy-lookups-data` as lookups
* Get/set `norm` in both lookups and `LexemeC`, serialize in lookups
* Remove `cluster` and `prob` from `LexemesC`, get/set/serialize in
lookups only
* Remove serialization of lexemes data as `vocab/lexemes.bin`
* Remove `SerializedLexemeC`
* Remove `Lexeme.to_bytes/from_bytes`
* Modify normalization exception loading:
* Always create `Vocab.lookups` table `lexeme_norm` for
normalization exceptions
* Load base exceptions from `lang.norm_exceptions`, but load
language-specific exceptions from lookups
* Set `lex_attr_getter[NORM]` including new lookups table in
`BaseDefaults.create_vocab()` and when deserializing `Vocab`
* Remove all cached lexemes when deserializing vocab to override
existing normalizations with the new normalizations (as a replacement
for the previous step that replaced all lexemes data with the
deserialized data)
* Skip English normalization test
Skip English normalization test because the data is now in
`spacy-lookups-data`.
* Remove norm exceptions
Moved to spacy-lookups-data.
* Move norm exceptions test to spacy-lookups-data
* Load extra lookups from spacy-lookups-data lazily
Load extra lookups (currently for cluster and prob) lazily from the
entry point `lg_extra` as `Vocab.lookups_extra`.
* Skip creating lexeme cache on load
To improve model loading times, do not create the full lexeme cache when
loading. The lexemes will be created on demand when processing.
* Identify numeric values in Lexeme.set_attrs()
With the removal of a special case for `PROB`, also identify `float` to
avoid trying to convert it with the `StringStore`.
* Skip lexeme cache init in from_bytes
* Unskip and update lookups tests for python3.6+
* Update vocab pickle to include lookups_extra
* Update vocab serialization tests
Check strings rather than lexemes since lexemes aren't initialized
automatically, account for addition of "_SP".
* Re-skip lookups test because of python3.5
* Skip PROB/float values in Lexeme.set_attrs
* Convert is_oov from lexeme flag to lex in vectors
Instead of storing `is_oov` as a lexeme flag, `is_oov` reports whether
the lexeme has a vector.
Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
|
||
|---|---|---|
| .. | ||
| af | ||
| ar | ||
| bg | ||
| bn | ||
| ca | ||
| cs | ||
| da | ||
| de | ||
| el | ||
| en | ||
| es | ||
| et | ||
| eu | ||
| fa | ||
| fi | ||
| fr | ||
| ga | ||
| gu | ||
| he | ||
| hi | ||
| hr | ||
| hu | ||
| hy | ||
| id | ||
| is | ||
| it | ||
| ja | ||
| kn | ||
| ko | ||
| lb | ||
| lij | ||
| lt | ||
| lv | ||
| ml | ||
| mr | ||
| nb | ||
| nl | ||
| pl | ||
| pt | ||
| ro | ||
| ru | ||
| si | ||
| sk | ||
| sl | ||
| sq | ||
| sr | ||
| sv | ||
| ta | ||
| te | ||
| th | ||
| tl | ||
| tr | ||
| tt | ||
| uk | ||
| ur | ||
| vi | ||
| xx | ||
| yo | ||
| zh | ||
| __init__.py | ||
| char_classes.py | ||
| lex_attrs.py | ||
| norm_exceptions.py | ||
| punctuation.py | ||
| tag_map.py | ||
| tokenizer_exceptions.py | ||