spaCy

mirror of https://github.com/explosion/spaCy.git synced 2026-01-03 15:33:33 +03:00

History

Paul O'Leary McCann 7d8df69158 Bloom-filter backed Lookup Tables (#4268 ) * Improve load_language_data helper * WIP: Add Lookups implementation * Start moving lemma data over to JSON * WIP: move data over for more languages * Convert more languages * Fix lemmatizer fixtures in tests * Finish conversion * Auto-format JSON files * Fix test for now * Make sure tables are stored on instance * Update docstrings * Update docstrings and errors * Update test * Add Lookups.__len__ * Add serialization methods * Add Lookups.remove_table * Use msgpack for serialization to disk * Fix file exists check * Try using OrderedDict for everything * Update .flake8 [ci skip] * Try fixing serialization * Update test_lookups.py * Update test_serialize_vocab_strings.py * Lookups / Tables now work This implements the stubs in the Lookups/Table classes. Currently this is in Cython but with no type declarations, so that could be improved. * Add lookups to setup.py * Actually add lookups pyx The previous commit added the old py file... * Lookups work-in-progress * Move from pyx back to py * Add string based lookups, fix serialization * Update tests, language/lemmatizer to work with string lookups There are some outstanding issues here: - a pickling-related test fails due to the bloom filter - some custom lemmatizers (fr/nl at least) have issues More generally, there's a question of how to deal with the case where you have a string but want to use the lookup table. Currently the table allows access by string or id, but that's getting pretty awkward. * Change lemmatizer lookup method to pass (orth, string) * Fix token lookup * Fix French lookup * Fix lt lemmatizer test * Fix Dutch lemmatizer * Fix lemmatizer lookup test This was using a normal dict instead of a Table, so checks for the string instead of an integer key failed. * Make uk/nl/ru lemmatizer lookup methods consistent The mentioned tokenizers all have their own implementation of the `lookup` method, which accesses a `Lookups` table. The way that was called in `token.pyx` was changed so this should be updated to have the same arguments as `lookup` in `lemmatizer.py` (specificially (orth/id, string)). Prior to this change tests weren't failing, but there would probably be issues with normal use of a model. More tests should proably be added. Additionally, the language-specific `lookup` implementations seem like they might not be needed, since they handle things like lower-casing that aren't actually language specific. * Make recently added Greek method compatible * Remove redundant class/method Leftovers from a merge not cleaned up adequately.		2019-09-12 17:26:11 +02:00
..
af	💫 Add base Language classes for more languages (#3276 )	2019-02-15 01:31:19 +11:00
ar	Add writing_system to ArabicDefaults (experimental)	2019-03-11 14:22:23 +01:00
bg	💫 Add base Language classes for more languages (#3276 )	2019-02-15 01:31:19 +11:00
bn	💫 WIP: Basic lookup class scaffolding and JSON for all lemmati… (#4167 )	2019-08-22 14:21:32 +02:00
ca	💫 WIP: Basic lookup class scaffolding and JSON for all lemmati… (#4167 )	2019-08-22 14:21:32 +02:00
cs	💫 Add base Language classes for more languages (#3276 )	2019-02-15 01:31:19 +11:00
da	💫 WIP: Basic lookup class scaffolding and JSON for all lemmati… (#4167 )	2019-08-22 14:21:32 +02:00
de	Tidy up and auto-format	2019-09-11 14:00:36 +02:00
el	Bloom-filter backed Lookup Tables (#4268 )	2019-09-12 17:26:11 +02:00
en	Tidy up and auto-format	2019-09-11 14:00:36 +02:00
es	💫 WIP: Basic lookup class scaffolding and JSON for all lemmati… (#4167 )	2019-08-22 14:21:32 +02:00
et	💫 Add base Language classes for more languages (#3276 )	2019-02-15 01:31:19 +11:00
fa	💫 WIP: Basic lookup class scaffolding and JSON for all lemmati… (#4167 )	2019-08-22 14:21:32 +02:00
fi	💫 Tidy up and auto-format .py files (#2983 )	2018-11-30 17:03:03 +01:00
fr	Bloom-filter backed Lookup Tables (#4268 )	2019-09-12 17:26:11 +02:00
ga	💫 Tidy up and auto-format .py files (#2983 )	2018-11-30 17:03:03 +01:00
he	Auto-format [ci skip]	2019-03-11 17:10:50 +01:00
hi	💫 Tidy up and auto-format .py files (#2983 )	2018-11-30 17:03:03 +01:00
hr	adds Croatian lemma_lookup.json, license file and corresponding tests (#4252 )	2019-09-08 13:40:45 +02:00
hu	💫 WIP: Basic lookup class scaffolding and JSON for all lemmati… (#4167 )	2019-08-22 14:21:32 +02:00
id	💫 WIP: Basic lookup class scaffolding and JSON for all lemmati… (#4167 )	2019-08-22 14:21:32 +02:00
is	💫 Add base Language classes for more languages (#3276 )	2019-02-15 01:31:19 +11:00
it	💫 WIP: Basic lookup class scaffolding and JSON for all lemmati… (#4167 )	2019-08-22 14:21:32 +02:00
ja	Don't set extension attribute in Japanese (closes #3398 )	2019-03-12 13:30:33 +01:00
kn	Enhancing Kannada language Resources (#3755 )	2019-05-20 12:56:10 +02:00
ko	Fix ValueError exception on empty Korean text. (#4245 )	2019-09-06 10:29:40 +02:00
lt	Update Lithuanian tag map	2019-09-08 20:57:58 +02:00
lv	💫 Add base Language classes for more languages (#3276 )	2019-02-15 01:31:19 +11:00
mr	Tidy up [ci skip]	2019-06-12 13:38:23 +02:00
nb	💫 WIP: Basic lookup class scaffolding and JSON for all lemmati… (#4167 )	2019-08-22 14:21:32 +02:00
nl	Bloom-filter backed Lookup Tables (#4268 )	2019-09-12 17:26:11 +02:00
pl	Tidy up and auto-format	2019-08-20 17:36:34 +02:00
pt	Tidy up and auto-format	2019-09-11 11:38:22 +02:00
ro	Tidy up and auto-format	2019-09-11 11:38:22 +02:00
ru	Bloom-filter backed Lookup Tables (#4268 )	2019-09-12 17:26:11 +02:00
si	💫 Tidy up and auto-format .py files (#2983 )	2018-11-30 17:03:03 +01:00
sk	💫 Add base Language classes for more languages (#3276 )	2019-02-15 01:31:19 +11:00
sl	💫 Add base Language classes for more languages (#3276 )	2019-02-15 01:31:19 +11:00
sq	Update languages and examples (see #1107 )	2019-06-26 16:19:17 +02:00
sr	Tidy up and auto-format	2019-09-11 11:38:22 +02:00
sv	💫 WIP: Basic lookup class scaffolding and JSON for all lemmati… (#4167 )	2019-08-22 14:21:32 +02:00
ta	Remove stray print statement (closes #3342 )	2019-02-27 15:35:04 +01:00
te	💫 Tidy up and auto-format .py files (#2983 )	2018-11-30 17:03:03 +01:00
th	fix thai bug (#3693 )	2019-05-10 14:21:34 +02:00
tl	💫 WIP: Basic lookup class scaffolding and JSON for all lemmati… (#4167 )	2019-08-22 14:21:32 +02:00
tr	💫 WIP: Basic lookup class scaffolding and JSON for all lemmati… (#4167 )	2019-08-22 14:21:32 +02:00
tt	Tidy up and auto-format	2019-08-20 17:36:34 +02:00
uk	Bloom-filter backed Lookup Tables (#4268 )	2019-09-12 17:26:11 +02:00
ur	💫 WIP: Basic lookup class scaffolding and JSON for all lemmati… (#4167 )	2019-08-22 14:21:32 +02:00
vi	💫 Tidy up and auto-format .py files (#2983 )	2018-11-30 17:03:03 +01:00
xx	💫 Tidy up and auto-format .py files (#2983 )	2018-11-30 17:03:03 +01:00
zh	Tidy up and auto-format	2019-08-20 17:36:34 +02:00
__init__.py	Remove imports in /lang/__init__.py	2017-05-08 23:58:07 +02:00
char_classes.py	added missing punctuation following conventions. (#4066 )	2019-08-04 13:41:18 +02:00
lex_attrs.py	Replacing regex library with re to increase tokenization speed (#3218 )	2019-02-01 18:05:22 +11:00
norm_exceptions.py	Update norm_exceptions.py (#3778 )	2019-05-27 11:52:52 +02:00
punctuation.py	Allow period as suffix following punctuation (#4248 )	2019-09-09 19:19:22 +02:00
tag_map.py	💫 Tidy up and auto-format .py files (#2983 )	2018-11-30 17:03:03 +01:00
tokenizer_exceptions.py	Make the emoticon list a raw string (#4139 )	2019-08-18 15:17:13 +02:00