* Reduce stored lexemes data, move feats to lookups
* Move non-derivable lexemes features (`norm / cluster / prob`) to
`spacy-lookups-data` as lookups
* Get/set `norm` in both lookups and `LexemeC`, serialize in lookups
* Remove `cluster` and `prob` from `LexemesC`, get/set/serialize in
lookups only
* Remove serialization of lexemes data as `vocab/lexemes.bin`
* Remove `SerializedLexemeC`
* Remove `Lexeme.to_bytes/from_bytes`
* Modify normalization exception loading:
* Always create `Vocab.lookups` table `lexeme_norm` for
normalization exceptions
* Load base exceptions from `lang.norm_exceptions`, but load
language-specific exceptions from lookups
* Set `lex_attr_getter[NORM]` including new lookups table in
`BaseDefaults.create_vocab()` and when deserializing `Vocab`
* Remove all cached lexemes when deserializing vocab to override
existing normalizations with the new normalizations (as a replacement
for the previous step that replaced all lexemes data with the
deserialized data)
* Skip English normalization test
Skip English normalization test because the data is now in
`spacy-lookups-data`.
* Remove norm exceptions
Moved to spacy-lookups-data.
* Move norm exceptions test to spacy-lookups-data
* Load extra lookups from spacy-lookups-data lazily
Load extra lookups (currently for cluster and prob) lazily from the
entry point `lg_extra` as `Vocab.lookups_extra`.
* Skip creating lexeme cache on load
To improve model loading times, do not create the full lexeme cache when
loading. The lexemes will be created on demand when processing.
* Identify numeric values in Lexeme.set_attrs()
With the removal of a special case for `PROB`, also identify `float` to
avoid trying to convert it with the `StringStore`.
* Skip lexeme cache init in from_bytes
* Unskip and update lookups tests for python3.6+
* Update vocab pickle to include lookups_extra
* Update vocab serialization tests
Check strings rather than lexemes since lexemes aren't initialized
automatically, account for addition of "_SP".
* Re-skip lookups test because of python3.5
* Skip PROB/float values in Lexeme.set_attrs
* Convert is_oov from lexeme flag to lex in vectors
Instead of storing `is_oov` as a lexeme flag, `is_oov` reports whether
the lexeme has a vector.
Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
* Limiting noun_chunks for specific langauges
* Limiting noun_chunks for specific languages
Contributor Agreement
* Addressing review comments
* Removed unused fixtures and imports
* Add fa_tokenizer in test suite
* Use fa_tokenizer in test
* Undo extraneous reformatting
Co-authored-by: adrianeboyd <adrianeboyd@gmail.com>
* Rename `tag_map.py` to `tag_map_fine.py` to indicate that it's not the
default tag map
* Remove duplicate generic UD tag map and load `../tag_map.py` instead
* Restructure tag maps for MorphAnalysis changes
Prepare tag maps for upcoming MorphAnalysis changes that allow
arbritrary features.
* Use default tag map rather than duplicating for ca / uk / vi
* Import tag map into defaults for ga
* Modify tag maps so all morphological fields and features are strings
* Move features from `"Other"` to the top level
* Rewrite tuples as strings separated by `","`
* Rewrite morph symbols for fr lemmatizer as strings
* Export MorphAnalysis under spacy.tokens
* Modify morphology to support arbitrary features
Modify `Morphology` and `MorphAnalysis` so that arbitrary features are
supported.
* Modify `MorphAnalysisC` so that it can support arbitrary features and
multiple values per field. `MorphAnalysisC` is redesigned to contain:
* key: hash of UD FEATS string of morphological features
* array of `MorphFeatureC` structs that each contain a hash of `Field`
and `Field=Value` for a given morphological feature, which makes it
possible to:
* find features by field
* represent multiple values for a given field
* `get_field()` is renamed to `get_by_field()` and is no longer `nogil`.
Instead a new helper function `get_n_by_field()` is `nogil` and returns
`n` features by field.
* `MorphAnalysis.get()` returns all possible values for a field as a
list of individual features such as `["Tense=Pres", "Tense=Past"]`.
* `MorphAnalysis`'s `str()` and `repr()` are the UD FEATS string.
* `Morphology.feats_to_dict()` converts a UD FEATS string to a dict
where:
* Each field has one entry in the dict
* Multiple values remain separated by a separator in the value string
* `Token.morph_` returns the UD FEATS string and you can set
`Token.morph_` with a UD FEATS string or with a tag map dict.
* Modify get_by_field to use np.ndarray
Modify `get_by_field()` to use np.ndarray. Remove `max_results` from
`get_n_by_field()` and always iterate over all the fields.
* Rewrite without MorphFeatureC
* Add shortcut for existing feats strings as keys
Add shortcut for existing feats strings as keys in `Morphology.add()`.
* Check for '_' as empty analysis when adding morphs
* Extend helper converters in Morphology
Add and extend helper converters that convert and normalize between:
* UD FEATS strings (`"Case=dat,gen|Number=sing"`)
* per-field dict of feats (`{"Case": "dat,gen", "Number": "sing"}`)
* list of individual features (`["Case=dat", "Case=gen",
"Number=sing"]`)
All converters sort fields and values where applicable.
* Move test
* Allow default in Lookups.get_table
* Start with blank tables in Lookups.from_bytes
* Refactor lemmatizer to hold instance of Lookups
* Get lookups table within the lemmatization methods to make sure it references the correct table (even if the table was replaced or modified, e.g. when loading a model from disk)
* Deprecate other arguments on Lemmatizer.__init__ and expect Lookups for consistency
* Remove old and unsupported Lemmatizer.load classmethod
* Refactor language-specific lemmatizers to inherit as much as possible from base class and override only what they need
* Update tests and docs
* Fix more tests
* Fix lemmatizer
* Upgrade pytest to try and fix weird CI errors
* Try pytest 4.6.5
* Add default to util.get_entry_point
* Tidy up entry points
* Read lookups from entry points
* Remove lookup tables and related tests
* Add lookups install option
* Remove lemmatizer tests
* Remove logic to process language data files
* Update setup.cfg
* Adjust Table API and add docs
* Add attributes and update description [ci skip]
* Use strings.get_string_id instead of hash_string
* Fix table method calls
* Make orth arg in Lemmatizer.lookup optional
Fall back to string, which is now handled by Table.__contains__ out-of-the-box
* Fix method name
* Auto-format
* Improve load_language_data helper
* WIP: Add Lookups implementation
* Start moving lemma data over to JSON
* WIP: move data over for more languages
* Convert more languages
* Fix lemmatizer fixtures in tests
* Finish conversion
* Auto-format JSON files
* Fix test for now
* Make sure tables are stored on instance
* Update docstrings
* Update docstrings and errors
* Update test
* Add Lookups.__len__
* Add serialization methods
* Add Lookups.remove_table
* Use msgpack for serialization to disk
* Fix file exists check
* Try using OrderedDict for everything
* Update .flake8 [ci skip]
* Try fixing serialization
* Update test_lookups.py
* Update test_serialize_vocab_strings.py
* Lookups / Tables now work
This implements the stubs in the Lookups/Table classes. Currently this
is in Cython but with no type declarations, so that could be improved.
* Add lookups to setup.py
* Actually add lookups pyx
The previous commit added the old py file...
* Lookups work-in-progress
* Move from pyx back to py
* Add string based lookups, fix serialization
* Update tests, language/lemmatizer to work with string lookups
There are some outstanding issues here:
- a pickling-related test fails due to the bloom filter
- some custom lemmatizers (fr/nl at least) have issues
More generally, there's a question of how to deal with the case where
you have a string but want to use the lookup table. Currently the table
allows access by string or id, but that's getting pretty awkward.
* Change lemmatizer lookup method to pass (orth, string)
* Fix token lookup
* Fix French lookup
* Fix lt lemmatizer test
* Fix Dutch lemmatizer
* Fix lemmatizer lookup test
This was using a normal dict instead of a Table, so checks for the
string instead of an integer key failed.
* Make uk/nl/ru lemmatizer lookup methods consistent
The mentioned tokenizers all have their own implementation of the
`lookup` method, which accesses a `Lookups` table. The way that was
called in `token.pyx` was changed so this should be updated to have the
same arguments as `lookup` in `lemmatizer.py` (specificially (orth/id,
string)).
Prior to this change tests weren't failing, but there would probably be
issues with normal use of a model. More tests should proably be added.
Additionally, the language-specific `lookup` implementations seem like
they might not be needed, since they handle things like lower-casing
that aren't actually language specific.
* Make recently added Greek method compatible
* Remove redundant class/method
Leftovers from a merge not cleaned up adequately.
* Improve load_language_data helper
* WIP: Add Lookups implementation
* Start moving lemma data over to JSON
* WIP: move data over for more languages
* Convert more languages
* Fix lemmatizer fixtures in tests
* Finish conversion
* Auto-format JSON files
* Fix test for now
* Make sure tables are stored on instance
* Move Turkish lemmas to a json file
Rather than a large dict in Python source, the data is now a big json
file. This includes a method for loading the json file, falling back to
a compressed file, and an update to MANIFEST.in that excludes json in
the spacy/lang directory.
This focuses on Turkish specifically because it has the most language
data in core.
* Transition all lemmatizer.py files to json
This covers all lemmatizer.py files of a significant size (>500k or so).
Small files were left alone.
None of the affected files have logic, so this was pretty
straightforward.
One unusual thing is that the lemma data for Urdu doesn't seem to be
used anywhere. That may require further investigation.
* Move large lang data to json for fr/nb/nl/sv
These are the languages that use a lemmatizer directory (rather than a
single file) and are larger than English.
For most of these languages there were many language data files, in
which case only the large ones (>500k or so) were converted to json. It
may or may not be a good idea to migrate the remaining Python files to
json in the future.
* Fix id lemmas.json
The contents of this file were originally just copied from the Python
source, but that used single quotes, so it had to be properly converted
to json first.
* Add .json.gz to gitignore
This covers the json.gz files built as part of distribution.
* Add language data gzip to build process
Currently this gzip data on every build; it works, but it should be
changed to only gzip when the source file has been updated.
* Remove Danish lemmatizer.py
Missed this when I added the json.
* Update to match latest explosion/srsly#9
The way gzipped json is loaded/saved in srsly changed a bit.
* Only compress language data if necessary
If a .json.gz file exists and is newer than the corresponding json file,
it's not recompressed.
* Move en/el language data to json
This only affected files >500kb, which was nouns for both languages and
the generic lookup table for English.
* Remove empty files in Norwegian tokenizer
It's unclear why, but the Norwegian (nb) tokenizer had empty files for
adj/adv/noun/verb lemmas. This may have been a result of copying the
structure of the English lemmatizer.
This removed the files, but still creates the empty sets in the
lemmatizer. That may not actually be necessary.
* Remove dubious entries in English lookup.json
" furthest" and " skilled" - both prefixed with a space - were in the
English lookup table. That seems obviously wrong so I have removed them.
* Fix small issues with en/fr lemmatizers
The en tokenizer was including the removed _nouns.py file, so that's
removed.
The fr tokenizer is unusual in that it has a lemmatizer directory with
both __init__.py and lemmatizer.py. lemmatizer.py had not been converted
to load the json language data, so that was fixed.
* Auto-format
* Auto-format
* Update srsly pin
* Consistently use pathlib paths
* replace unicode categories with raw list of code points
* simplifying ranges
* fixing variable length quotes
* removing redundant regular expression
* small cleanup of regexp notations
* quotes and alpha as ranges instead of alterations
* removed most regexp dependencies and features
* exponential backtracking - unit tests
* rewrote expression with pathological backtracking
* disabling double hyphen tests for now
* test additional variants of repeating punctuation
* remove regex and redundant backslashes from load_reddit script
* small typo fixes
* disable double punctuation test for russian
* clean up old comments
* format block code
* final cleanup
* naming consistency
* french strings as unicode for python 2 support
* french regular expression case insensitive
<!--- Provide a general summary of your changes in the title. -->
## Description
- [x] Use [`black`](https://github.com/ambv/black) to auto-format all `.py` files.
- [x] Update flake8 config to exclude very large files (lemmatization tables etc.)
- [x] Update code to be compatible with flake8 rules
- [x] Fix various small bugs, inconsistencies and messy stuff in the language data
- [x] Update docs to explain new code style (`black`, `flake8`, when to use `# fmt: off` and `# fmt: on` and what `# noqa` means)
Once #2932 is merged, which auto-formats and tidies up the CLI, we'll be able to run `flake8 spacy` actually get meaningful results.
At the moment, the code style and linting isn't applied automatically, but I'm hoping that the new [GitHub Actions](https://github.com/features/actions) will let us auto-format pull requests and post comments with relevant linting information.
### Types of change
enhancement, code style
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.