* Make vocab update in get_docs deterministic
The attribute `DocBin.strings` is a set. In `DocBin.get_docs`
a given vocab is updated by iterating over this set.
Iteration over a python set produces an arbitrary ordering,
therefore vocab is updated non-deterministically.
When training (fine-tuning) a spacy model, the base model's
vocabulary will be updated with the new vocabulary in the
training data in exactly the way described above. After
serialization, the file `model/vocab/strings.json` will
be sorted in an arbitrary way. This prevents reproducible
model training.
* Revert "Make vocab update in get_docs deterministic"
This reverts commit d6b87a2f55.
* Sort strings in StringStore serialization
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Add Lemmatizer and simplify related components
* Add `Lemmatizer` pipe with `lookup` and `rule` modes using the
`Lookups` tables.
* Reduce `Tagger` to a simple tagger that sets `Token.tag` (no pos or lemma)
* Reduce `Morphology` to only keep track of morph tags (no tag map, lemmatizer,
or morph rules)
* Remove lemmatizer from `Vocab`
* Adjust many many tests
Differences:
* No default lookup lemmas
* No special treatment of TAG in `from_array` and similar required
* Easier to modify labels in a `Tagger`
* No extra strings added from morphology / tag map
* Fix test
* Initial fix for Lemmatizer config/serialization
* Adjust init test to be more generic
* Adjust init test to force empty Lookups
* Add simple cache to rule-based lemmatizer
* Convert language-specific lemmatizers
Convert language-specific lemmatizers to component lemmatizers. Remove
previous lemmatizer class.
* Fix French and Polish lemmatizers
* Remove outdated UPOS conversions
* Update Russian lemmatizer init in tests
* Add minimal init/run tests for custom lemmatizers
* Add option to overwrite existing lemmas
* Update mode setting, lookup loading, and caching
* Make `mode` an immutable property
* Only enforce strict `load_lookups` for known supported modes
* Move caching into individual `_lemmatize` methods
* Implement strict when lang is not found in lookups
* Fix tables/lookups in make_lemmatizer
* Reallow provided lookups and allow for stricter checks
* Add lookups asset to all Lemmatizer pipe tests
* Rename lookups in lemmatizer init test
* Clean up merge
* Refactor lookup table loading
* Add helper from `load_lemmatizer_lookups` that loads required and
optional lookups tables based on settings provided by a config.
Additional slight refactor of lookups:
* Add `Lookups.set_table` to set a table from a provided `Table`
* Reorder class definitions to be able to specify type as `Table`
* Move registry assets into test methods
* Refactor lookups tables config
Use class methods within `Lemmatizer` to provide the config for
particular modes and to load the lookups from a config.
* Add pipe and score to lemmatizer
* Simplify Tagger.score
* Add missing import
* Clean up imports and auto-format
* Remove unused kwarg
* Tidy up and auto-format
* Update docstrings for Lemmatizer
Update docstrings for Lemmatizer.
Additionally modify `is_base_form` API to take `Token` instead of
individual features.
* Update docstrings
* Remove tag map values from Tagger.add_label
* Update API docs
* Fix relative link in Lemmatizer API docs
* step_through tests: skip instead of xfail
* test_empty_doc should be fixed with new Thinc version
* remove outdated test (there are other misaligned tests now)
* xfail reason
* fix test according to french exceptions
* clarified some skipped tests
* skip ukranian test instead of xfail
* skip instead of xfail
* skip + reason instead of xfail
* removed obsolete tests referring to removed "set_frozen" functionality
* fix test 999
* remove unused AlignmentError
* remove xfail where possible, skip otherwise
* increment thinc release for empty_doc test
* Reduce stored lexemes data, move feats to lookups
* Move non-derivable lexemes features (`norm / cluster / prob`) to
`spacy-lookups-data` as lookups
* Get/set `norm` in both lookups and `LexemeC`, serialize in lookups
* Remove `cluster` and `prob` from `LexemesC`, get/set/serialize in
lookups only
* Remove serialization of lexemes data as `vocab/lexemes.bin`
* Remove `SerializedLexemeC`
* Remove `Lexeme.to_bytes/from_bytes`
* Modify normalization exception loading:
* Always create `Vocab.lookups` table `lexeme_norm` for
normalization exceptions
* Load base exceptions from `lang.norm_exceptions`, but load
language-specific exceptions from lookups
* Set `lex_attr_getter[NORM]` including new lookups table in
`BaseDefaults.create_vocab()` and when deserializing `Vocab`
* Remove all cached lexemes when deserializing vocab to override
existing normalizations with the new normalizations (as a replacement
for the previous step that replaced all lexemes data with the
deserialized data)
* Skip English normalization test
Skip English normalization test because the data is now in
`spacy-lookups-data`.
* Remove norm exceptions
Moved to spacy-lookups-data.
* Move norm exceptions test to spacy-lookups-data
* Load extra lookups from spacy-lookups-data lazily
Load extra lookups (currently for cluster and prob) lazily from the
entry point `lg_extra` as `Vocab.lookups_extra`.
* Skip creating lexeme cache on load
To improve model loading times, do not create the full lexeme cache when
loading. The lexemes will be created on demand when processing.
* Identify numeric values in Lexeme.set_attrs()
With the removal of a special case for `PROB`, also identify `float` to
avoid trying to convert it with the `StringStore`.
* Skip lexeme cache init in from_bytes
* Unskip and update lookups tests for python3.6+
* Update vocab pickle to include lookups_extra
* Update vocab serialization tests
Check strings rather than lexemes since lexemes aren't initialized
automatically, account for addition of "_SP".
* Re-skip lookups test because of python3.5
* Skip PROB/float values in Lexeme.set_attrs
* Convert is_oov from lexeme flag to lex in vectors
Instead of storing `is_oov` as a lexeme flag, `is_oov` reports whether
the lexeme has a vector.
Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
* Improve load_language_data helper
* WIP: Add Lookups implementation
* Start moving lemma data over to JSON
* WIP: move data over for more languages
* Convert more languages
* Fix lemmatizer fixtures in tests
* Finish conversion
* Auto-format JSON files
* Fix test for now
* Make sure tables are stored on instance
* Update docstrings
* Update docstrings and errors
* Update test
* Add Lookups.__len__
* Add serialization methods
* Add Lookups.remove_table
* Use msgpack for serialization to disk
* Fix file exists check
* Try using OrderedDict for everything
* Update .flake8 [ci skip]
* Try fixing serialization
* Update test_lookups.py
* Update test_serialize_vocab_strings.py
* Fix serialization for lookups
* Fix lookups
* Fix lookups
* Fix lookups
* Try to fix serialization
* Try to fix serialization
* Try to fix serialization
* Try to fix serialization
* Give up on serialization test
* Xfail more serialization tests for 3.5
* Fix lookups for 2.7
* Auto-format tests with black
* Add flake8 config
* Tidy up and remove unused imports
* Fix redefinitions of test functions
* Replace orths_and_spaces with words and spaces
* Fix compatibility with pytest 4.0
* xfail test for now
Test was previously overwritten by following test due to naming conflict, so failure wasn't reported
* Unfail passing test
* Only use fixture via arguments
Fixes pytest 4.0 compatibility
## Description
Related issues: #2379 (should be fixed by separating model tests)
* **total execution time down from > 300 seconds to under 60 seconds** 🎉
* removed all model-specific tests that could only really be run manually anyway – those will now live in a separate test suite in the [`spacy-models`](https://github.com/explosion/spacy-models) repository and are already integrated into our new model training infrastructure
* changed all relative imports to absolute imports to prepare for moving the test suite from `/spacy/tests` to `/tests` (it'll now always test against the installed version)
* merged old regression tests into collections, e.g. `test_issue1001-1500.py` (about 90% of the regression tests are very short anyways)
* tidied up and rewrote existing tests wherever possible
### Todo
- [ ] move tests to `/tests` and adjust CI commands accordingly
- [x] move model test suite from internal repo to `spacy-models`
- [x] ~~investigate why `pipeline/test_textcat.py` is flakey~~
- [x] review old regression tests (leftover files) and see if they can be merged, simplified or deleted
- [ ] update documentation on how to run tests
### Types of change
enhancement, tests
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [ ] My changes don't require a change to the documentation, or if they do, I've added all required information.