Remove corpus-specific tag maps from the language data for languages
without custom tokenizers. For languages with custom word segmenters
that also provide tags (Japanese and Korean), the tag maps for the
custom tokenizers are kept as the default.
The default tag maps for languages without custom tokenizers are now the
default tag map from `lang/tag_map/py`, UPOS -> UPOS.
* Add default to util.get_entry_point
* Tidy up entry points
* Read lookups from entry points
* Remove lookup tables and related tests
* Add lookups install option
* Remove lemmatizer tests
* Remove logic to process language data files
* Update setup.cfg
* Improve load_language_data helper
* WIP: Add Lookups implementation
* Start moving lemma data over to JSON
* WIP: move data over for more languages
* Convert more languages
* Fix lemmatizer fixtures in tests
* Finish conversion
* Auto-format JSON files
* Fix test for now
* Make sure tables are stored on instance
* Move Turkish lemmas to a json file
Rather than a large dict in Python source, the data is now a big json
file. This includes a method for loading the json file, falling back to
a compressed file, and an update to MANIFEST.in that excludes json in
the spacy/lang directory.
This focuses on Turkish specifically because it has the most language
data in core.
* Transition all lemmatizer.py files to json
This covers all lemmatizer.py files of a significant size (>500k or so).
Small files were left alone.
None of the affected files have logic, so this was pretty
straightforward.
One unusual thing is that the lemma data for Urdu doesn't seem to be
used anywhere. That may require further investigation.
* Move large lang data to json for fr/nb/nl/sv
These are the languages that use a lemmatizer directory (rather than a
single file) and are larger than English.
For most of these languages there were many language data files, in
which case only the large ones (>500k or so) were converted to json. It
may or may not be a good idea to migrate the remaining Python files to
json in the future.
* Fix id lemmas.json
The contents of this file were originally just copied from the Python
source, but that used single quotes, so it had to be properly converted
to json first.
* Add .json.gz to gitignore
This covers the json.gz files built as part of distribution.
* Add language data gzip to build process
Currently this gzip data on every build; it works, but it should be
changed to only gzip when the source file has been updated.
* Remove Danish lemmatizer.py
Missed this when I added the json.
* Update to match latest explosion/srsly#9
The way gzipped json is loaded/saved in srsly changed a bit.
* Only compress language data if necessary
If a .json.gz file exists and is newer than the corresponding json file,
it's not recompressed.
* Move en/el language data to json
This only affected files >500kb, which was nouns for both languages and
the generic lookup table for English.
* Remove empty files in Norwegian tokenizer
It's unclear why, but the Norwegian (nb) tokenizer had empty files for
adj/adv/noun/verb lemmas. This may have been a result of copying the
structure of the English lemmatizer.
This removed the files, but still creates the empty sets in the
lemmatizer. That may not actually be necessary.
* Remove dubious entries in English lookup.json
" furthest" and " skilled" - both prefixed with a space - were in the
English lookup table. That seems obviously wrong so I have removed them.
* Fix small issues with en/fr lemmatizers
The en tokenizer was including the removed _nouns.py file, so that's
removed.
The fr tokenizer is unusual in that it has a lemmatizer directory with
both __init__.py and lemmatizer.py. lemmatizer.py had not been converted
to load the json language data, so that was fixed.
* Auto-format
* Auto-format
* Update srsly pin
* Consistently use pathlib paths
* initial LT lang support
* Added more stopwords. Started setting up some basic test environment (not complete)
* Initial morph rules for LT lang
* Closes#1 Adds tokenizer exceptions for Lithuanian
* Closes#5 Punctuation rules. Closes#6 Lexical Attributes
* test: add native examples to basic tests
* feat: add tag map for lt lang
* fix: remove undefined tag attribute 'Definite'
* feat: add lemmatizer for lt lang
* refactor: add new instances to lt lang morph rules; use tags from tag map
* refactor: add morph rules to lt lang defaults
* refactor: only keep nouns, verbs, adverbs and adjectives in lt lang lemmatizer lookup
* refactor: add capitalized words to lt lang lemmatizer
* refactor: add more num words to lt lang lex attrs
* refactor: update lt lang stop word set
* refactor: add new instances to lt lang tokenizer exceptions
* refactor: remove comments form lt lang init file
* refactor: use function instead of lambda in lt lex lang getter
* refactor: remove conversion to dict in lt init when dict is already provided
* chore: rename lt 'test_basic' to 'test_text'
* feat: add more lt text tests
* feat: add lemmatizer tests
* refactor: remove unused imports, add newline to end of file
* chore: add contributor agreement
* chore: change 'en' to 'lt' in lt example description
* fix: add missing encoding info
* style: add newline to end of file
* refactor: use python2 compatible syntax
* style: reformat code using black
* Add base classes for more languages
* Add test for language class initialization
Make sure language can be initialize – otherwise, it's difficult to catch serious errors in the test suite, because languages are lazy-loaded