* Add Lemmatizer and simplify related components
* Add `Lemmatizer` pipe with `lookup` and `rule` modes using the
`Lookups` tables.
* Reduce `Tagger` to a simple tagger that sets `Token.tag` (no pos or lemma)
* Reduce `Morphology` to only keep track of morph tags (no tag map, lemmatizer,
or morph rules)
* Remove lemmatizer from `Vocab`
* Adjust many many tests
Differences:
* No default lookup lemmas
* No special treatment of TAG in `from_array` and similar required
* Easier to modify labels in a `Tagger`
* No extra strings added from morphology / tag map
* Fix test
* Initial fix for Lemmatizer config/serialization
* Adjust init test to be more generic
* Adjust init test to force empty Lookups
* Add simple cache to rule-based lemmatizer
* Convert language-specific lemmatizers
Convert language-specific lemmatizers to component lemmatizers. Remove
previous lemmatizer class.
* Fix French and Polish lemmatizers
* Remove outdated UPOS conversions
* Update Russian lemmatizer init in tests
* Add minimal init/run tests for custom lemmatizers
* Add option to overwrite existing lemmas
* Update mode setting, lookup loading, and caching
* Make `mode` an immutable property
* Only enforce strict `load_lookups` for known supported modes
* Move caching into individual `_lemmatize` methods
* Implement strict when lang is not found in lookups
* Fix tables/lookups in make_lemmatizer
* Reallow provided lookups and allow for stricter checks
* Add lookups asset to all Lemmatizer pipe tests
* Rename lookups in lemmatizer init test
* Clean up merge
* Refactor lookup table loading
* Add helper from `load_lemmatizer_lookups` that loads required and
optional lookups tables based on settings provided by a config.
Additional slight refactor of lookups:
* Add `Lookups.set_table` to set a table from a provided `Table`
* Reorder class definitions to be able to specify type as `Table`
* Move registry assets into test methods
* Refactor lookups tables config
Use class methods within `Lemmatizer` to provide the config for
particular modes and to load the lookups from a config.
* Add pipe and score to lemmatizer
* Simplify Tagger.score
* Add missing import
* Clean up imports and auto-format
* Remove unused kwarg
* Tidy up and auto-format
* Update docstrings for Lemmatizer
Update docstrings for Lemmatizer.
Additionally modify `is_base_form` API to take `Token` instead of
individual features.
* Update docstrings
* Remove tag map values from Tagger.add_label
* Update API docs
* Fix relative link in Lemmatizer API docs
* Update with WIP
* Update with WIP
* Update with pipeline serialization
* Update types and pipe factories
* Add deep merge, tidy up and add tests
* Fix pipe creation from config
* Don't validate default configs on load
* Update spacy/language.py
Co-authored-by: Ines Montani <ines@ines.io>
* Adjust factory/component meta error
* Clean up factory args and remove defaults
* Add test for failing empty dict defaults
* Update pipeline handling and methods
* provide KB as registry function instead of as object
* small change in test to make functionality more clear
* update example script for EL configuration
* Fix typo
* Simplify test
* Simplify test
* splitting pipes.pyx into separate files
* moving default configs to each component file
* fix batch_size type
* removing default values from component constructors where possible (TODO: test 4725)
* skip instead of xfail
* Add test for config -> nlp with multiple instances
* pipeline.pipes -> pipeline.pipe
* Tidy up, document, remove kwargs
* small cleanup/generalization for Tok2VecListener
* use DEFAULT_UPSTREAM field
* revert to avoid circular imports
* Fix tests
* Replace deprecated arg
* Make model dirs require config
* fix pickling of keyword-only arguments in constructor
* WIP: clean up and integrate full config
* Add helper to handle function args more reliably
Now also includes keyword-only args
* Fix config composition and serialization
* Improve config debugging and add visual diff
* Remove unused defaults and fix type
* Remove pipeline and factories from meta
* Update spacy/default_config.cfg
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Update spacy/default_config.cfg
* small UX edits
* avoid printing stack trace for debug CLI commands
* Add support for language-specific factories
* specify the section of the config which holds the model to debug
* WIP: add Language.from_config
* Update with language data refactor WIP
* Auto-format
* Add backwards-compat handling for Language.factories
* Update morphologizer.pyx
* Fix morphologizer
* Update and simplify lemmatizers
* Fix Japanese tests
* Port over tagger changes
* Fix Chinese and tests
* Update to latest Thinc
* WIP: xfail first Russian lemmatizer test
* Fix component-specific overrides
* fix nO for output layers in debug_model
* Fix default value
* Fix tests and don't pass objects in config
* Fix deep merging
* Fix lemma lookup data registry
Only load the lookups if an entry is available in the registry (and if spacy-lookups-data is installed)
* Add types
* Add Vocab.from_config
* Fix typo
* Fix tests
* Make config copying more elegant
* Fix pipe analysis
* Fix lemmatizers and is_base_form
* WIP: move language defaults to config
* Fix morphology type
* Fix vocab
* Remove comment
* Update to latest Thinc
* Add morph rules to config
* Tidy up
* Remove set_morphology option from tagger factory
* Hack use_gpu
* Move [pipeline] to top-level block and make [nlp.pipeline] list
Allows separating component blocks from component order – otherwise, ordering the config would mean a changed component order, which is bad. Also allows initial config to define more components and not use all of them
* Fix use_gpu and resume in CLI
* Auto-format
* Remove resume from config
* Fix formatting and error
* [pipeline] -> [components]
* Fix types
* Fix tagger test: requires set_morphology?
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>
Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
* Update errors
* Remove beam for now (maybe)
Remove beam_utils
Update setup.py
Remove beam
* Remove GoldParse
WIP on removing goldparse
Get ArcEager compiling after GoldParse excise
Update setup.py
Get spacy.syntax compiling after removing GoldParse
Rename NewExample -> Example and clean up
Clean html files
Start updating tests
Update Morphologizer
* fix error numbers
* fix merge conflict
* informative error when calling to_array with wrong field
* fix error catching
* fixing language and scoring tests
* start testing get_aligned
* additional tests for new get_aligned function
* Draft create_gold_state for arc_eager oracle
* Fix import
* Fix import
* Remove TokenAnnotation code from nonproj
* fixing NER one-to-many alignment
* Fix many-to-one IOB codes
* fix test for misaligned
* attempt to fix cases with weird spaces
* fix spaces
* test_gold_biluo_different_tokenization works
* allow None as BILUO annotation
* fixed some tests + WIP roundtrip unit test
* add spaces to json output format
* minibatch utiltiy can deal with strings, docs or examples
* fix augment (needs further testing)
* various fixes in scripts - needs to be further tested
* fix test_cli
* cleanup
* correct silly typo
* add support for MORPH in to/from_array, fix morphologizer overfitting test
* fix tagger
* fix entity linker
* ensure test keeps working with non-linked entities
* pipe() takes docs, not examples
* small bug fix
* textcat bugfix
* throw informative error when running the components with the wrong type of objects
* fix parser tests to work with example (most still failing)
* fix BiluoPushDown parsing entities
* small fixes
* bugfix tok2vec
* fix renames and simple_ner labels
* various small fixes
* prevent writing dummy values like deps because that could interfer with sent_start values
* fix the fix
* implement split_sent with aligned SENT_START attribute
* test for split sentences with various alignment issues, works
* Return ArcEagerGoldParse from ArcEager
* Update parser and NER gold stuff
* Draft new GoldCorpus class
* add links to to_dict
* clean up
* fix test checking for variants
* Fix oracles
* Start updating converters
* Move converters under spacy.gold
* Move things around
* Fix naming
* Fix name
* Update converter to produce DocBin
* Update converters
* Allow DocBin to take list of Doc objects.
* Make spacy convert output docbin
* Fix import
* Fix docbin
* Fix compile in ArcEager
* Fix import
* Serialize all attrs by default
* Update converter
* Remove jsonl converter
* Add json2docs converter
* Draft Corpus class for DocBin
* Work on train script
* Update Corpus
* Update DocBin
* Allocate Doc before starting to add words
* Make doc.from_array several times faster
* Update train.py
* Fix Corpus
* Fix parser model
* Start debugging arc_eager oracle
* Update header
* Fix parser declaration
* Xfail some tests
* Skip tests that cause crashes
* Skip test causing segfault
* Remove GoldCorpus
* Update imports
* Update after removing GoldCorpus
* Fix module name of corpus
* Fix mimport
* Work on parser oracle
* Update arc_eager oracle
* Restore ArcEager.get_cost function
* Update transition system
* Update test_arc_eager_oracle
* Remove beam test
* Update test
* Unskip
* Unskip tests
* add links to to_dict
* clean up
* fix test checking for variants
* Allow DocBin to take list of Doc objects.
* Fix compile in ArcEager
* Serialize all attrs by default
Move converters under spacy.gold
Move things around
Fix naming
Fix name
Update converter to produce DocBin
Update converters
Make spacy convert output docbin
Fix import
Fix docbin
Fix import
Update converter
Remove jsonl converter
Add json2docs converter
* Allocate Doc before starting to add words
* Make doc.from_array several times faster
* Start updating converters
* Work on train script
* Draft Corpus class for DocBin
Update Corpus
Fix Corpus
* Update DocBin
Add missing strings when serializing
* Update train.py
* Fix parser model
* Start debugging arc_eager oracle
* Update header
* Fix parser declaration
* Xfail some tests
Skip tests that cause crashes
Skip test causing segfault
* Remove GoldCorpus
Update imports
Update after removing GoldCorpus
Fix module name of corpus
Fix mimport
* Work on parser oracle
Update arc_eager oracle
Restore ArcEager.get_cost function
Update transition system
* Update tests
Remove beam test
Update test
Unskip
Unskip tests
* Add get_aligned_parse method in Example
Fix Example.get_aligned_parse
* Add kwargs to Corpus.dev_dataset to match train_dataset
* Update nonproj
* Use get_aligned_parse in ArcEager
* Add another arc-eager oracle test
* Remove Example.doc property
Remove Example.doc
Remove Example.doc
Remove Example.doc
Remove Example.doc
* Update ArcEager oracle
Fix Break oracle
* Debugging
* Fix Corpus
* Fix eg.doc
* Format
* small fixes
* limit arg for Corpus
* fix test_roundtrip_docs_to_docbin
* fix test_make_orth_variants
* fix add_label test
* Update tests
* avoid writing temp dir in json2docs, fixing 4402 test
* Update test
* Add missing costs to NER oracle
* Update test
* Work on Example.get_aligned_ner method
* Clean up debugging
* Xfail tests
* Remove prints
* Remove print
* Xfail some tests
* Replace unseen labels for parser
* Update test
* Update test
* Xfail test
* Fix Corpus
* fix imports
* fix docs_to_json
* various small fixes
* cleanup
* Support gold_preproc in Corpus
* Support gold_preproc
* Pass gold_preproc setting into corpus
* Remove debugging
* Fix gold_preproc
* Fix json2docs converter
* Fix convert command
* Fix flake8
* Fix import
* fix output_dir (converted to Path by typer)
* fix var
* bugfix: update states after creating golds to avoid out of bounds indexing
* Improve efficiency of ArEager oracle
* pull merge_sent into iob2docs to avoid Doc creation for each line
* fix asserts
* bugfix excl Span.end in iob2docs
* Support max_length in Corpus
* Fix arc_eager oracle
* Filter out uannotated sentences in NER
* Remove debugging in parser
* Simplify NER alignment
* Fix conversion of NER data
* Fix NER init_gold_batch
* Tweak efficiency of precomputable affine
* Update onto-json default
* Update gold test for NER
* Fix parser test
* Update test
* Add NER data test
* Fix convert for single file
* Fix test
* Hack scorer to avoid evaluating non-nered data
* Fix handling of NER data in Example
* Output unlabelled spans from O biluo tags in iob_utils
* Fix unset variable
* Return kept examples from init_gold_batch
* Return examples from init_gold_batch
* Dont return Example from init_gold_batch
* Set spaces on gold doc after conversion
* Add test
* Fix spaces reading
* Improve NER alignment
* Improve handling of missing values in NER
* Restore the 'cutting' in parser training
* Add assertion
* Print epochs
* Restore random cuts in parser/ner training
* Implement Doc.copy
* Implement Example.copy
* Copy examples at the start of Language.update
* Don't unset example docs
* Tweak parser model slightly
* attempt to fix _guess_spaces
* _add_entities_to_doc first, so that links don't get overwritten
* fixing get_aligned_ner for one-to-many
* fix indexing into x_text
* small fix biluo_tags_from_offsets
* Add onto-ner config
* Simplify NER alignment
* Fix NER scoring for partially annotated documents
* fix indexing into x_text
* fix test_cli failing tests by ignoring spans in doc.ents with empty label
* Fix limit
* Improve NER alignment
* Fix count_train
* Remove print statement
* fix tests, we're not having nothing but None
* fix clumsy fingers
* Fix tests
* Fix doc.ents
* Remove empty docs in Corpus and improve limit
* Update config
Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>
* verbose and tag_map options
* adding init_tok2vec option and only changing the tok2vec that is specified
* adding omit_extra_lookups and verifying textcat config
* wip
* pretrain bugfix
* add replace and resume options
* train_textcat fix
* raw text functionality
* improve UX when KeyError or when input data can't be parsed
* avoid unnecessary access to goldparse in TextCat pipe
* save performance information in nlp.meta
* add noise_level to config
* move nn_parser's defaults to config file
* multitask in config - doesn't work yet
* scorer offering both F and AUC options, need to be specified in config
* add textcat verification code from old train script
* small fixes to config files
* clean up
* set default config for ner/parser to allow create_pipe to work as before
* two more test fixes
* small fixes
* cleanup
* fix NER pickling + additional unit test
* create_pipe as before
* fix grad_clip naming
* cleaning up pretrained_vectors out of cfg
* further refactoring Model init's
* move Model building out of pipes
* further refactor to require a model config when creating a pipe
* small fixes
* making cfg in nn_parser more consistent
* fixing nr_class for parser
* fixing nn_parser's nO
* fix printing of loss
* architectures in own file per type, consistent naming
* convenience methods default_tagger_config and default_tok2vec_config
* let create_pipe access default config if available for that component
* default_parser_config
* move defaults to separate folder
* allow reading nlp from package or dir with argument 'name'
* architecture spacy.VocabVectors.v1 to read static vectors from file
* cleanup
* default configs for nel, textcat, morphologizer, tensorizer
* fix imports
* fixing unit tests
* fixes and clean up
* fixing defaults, nO, fix unit tests
* restore parser IO
* fix IO
* 'fix' serialization test
* add *.cfg to manifest
* fix example configs with additional arguments
* replace Morpohologizer with Tagger
* add IO bit when testing overfitting of tagger (currently failing)
* fix IO - don't initialize when reading from disk
* expand overfitting tests to also check IO goes OK
* remove dropout from HashEmbed to fix Tagger performance
* add defaults for sentrec
* update thinc
* always pass a Model instance to a Pipe
* fix piped_added statement
* remove obsolete W029
* remove obsolete errors
* restore byte checking tests (work again)
* clean up test
* further test cleanup
* convert from config to Model in create_pipe
* bring back error when component is not initialized
* cleanup
* remove calls for nlp2.begin_training
* use thinc.api in imports
* allow setting charembed's nM and nC
* fix for hardcoded nM/nC + unit test
* formatting fixes
* trigger build
* Switch to train_dataset() function in train CLI
* Fixes for pipe() methods in pipeline components
* Don't clobber `examples` variable with `as_example` in pipe() methods
* Remove unnecessary traversals of `examples`
* Update Parser.pipe() for Examples
* Add `as_examples` kwarg to `pipe()` with implementation to return
`Example`s
* Accept `Doc` or `Example` in `pipe()` with `_get_doc()` (copied from
`Pipe`)
* Fixes to Example implementation in spacy.gold
* Move `make_projective` from an attribute of Example to an argument of
`Example.get_gold_parses()`
* Head of 0 are not treated as unset
* Unset heads are set to self rather than `None` (which causes problems
while projectivizing)
* Check for `Doc` (not just not `None`) when creating GoldParses for
pre-merged example
* Don't clobber `examples` variable in `iter_gold_docs()`
* Add/modify gold tests for handling projectivity
* In JSON roundtrip compare results from `dev_dataset` rather than
`train_dataset` to avoid projectivization (and other potential
modifications)
* Add test for projective train vs. nonprojective dev versions of the
same `Doc`
* Handle ignore_misaligned as arg rather than attr
Move `ignore_misaligned` from an attribute of `Example` to an argument
to `Example.get_gold_parses()`, which makes it parallel to
`make_projective`.
Add test with old and new align that checks whether `ignore_misaligned`
errors are raised as expected (only for new align).
* Remove unused attrs from gold.pxd
Remove `ignore_misaligned` and `make_projective` from `gold.pxd`
* Restructure Example with merged sents as default
An `Example` now includes a single `TokenAnnotation` that includes all
the information from one `Doc` (=JSON `paragraph`). If required, the
individual sentences can be returned as a list of examples with
`Example.split_sents()` with no raw text available.
* Input/output a single `Example.token_annotation`
* Add `sent_starts` to `TokenAnnotation` to handle sentence boundaries
* Replace `Example.merge_sents()` with `Example.split_sents()`
* Modify components to use a single `Example.token_annotation`
* Pipeline components
* conllu2json converter
* Rework/rename `add_token_annotation()` and `add_doc_annotation()` to
`set_token_annotation()` and `set_doc_annotation()`, functions that set
rather then appending/extending.
* Rename `morphology` to `morphs` in `TokenAnnotation` and `GoldParse`
* Add getters to `TokenAnnotation` to supply default values when a given
attribute is not available
* `Example.get_gold_parses()` in `spacy.gold._make_golds()` is only
applied on single examples, so the `GoldParse` is returned saved in the
provided `Example` rather than creating a new `Example` with no other
internal annotation
* Update tests for API changes and `merge_sents()` vs. `split_sents()`
* Refer to Example.goldparse in iter_gold_docs()
Use `Example.goldparse` in `iter_gold_docs()` instead of `Example.gold`
because a `None` `GoldParse` is generated with ignore_misaligned and
generating it on-the-fly can raise an unwanted AlignmentError
* Fix make_orth_variants()
Fix bug in make_orth_variants() related to conversion from multiple to
one TokenAnnotation per Example.
* Add basic test for make_orth_variants()
* Replace try/except with conditionals
* Replace default morph value with set
* OrigAnnot class instead of gold.orig_annot list of zipped tuples
* from_orig to replace from_annot_tuples
* rename to RawAnnot
* some unit tests for GoldParse creation and internal format
* removing orig_annot and switching to lists instead of tuple
* rewriting tuples to use RawAnnot (+ debug statements, WIP)
* fix pop() changing the data
* small fixes
* pop-append fixes
* return RawAnnot for existing GoldParse to have uniform interface
* clean up imports
* fix merge_sents
* add unit test for 4402 with new structure (not working yet)
* introduce DocAnnot
* typo fixes
* add unit test for merge_sents
* rename from_orig to from_raw
* fixing unit tests
* fix nn parser
* read_annots to produce text, doc_annot pairs
* _make_golds fix
* rename golds_to_gold_annots
* small fixes
* fix encoding
* have golds_to_gold_annots use DocAnnot
* missed a spot
* merge_sents as function in DocAnnot
* allow specifying only part of the token-level annotations
* refactor with Example class + underlying dicts
* pipeline components to work with Example objects (wip)
* input checking
* fix yielding
* fix calls to update
* small fixes
* fix scorer unit test with new format
* fix kwargs order
* fixes for ud and conllu scripts
* fix reading data for conllu script
* add in proper errors (not fixed numbering yet to avoid merge conflicts)
* fixing few more small bugs
* fix EL script
* Implement new API for {Phrase}Matcher.add (backwards-compatible)
* Update docs
* Also update DependencyMatcher.add
* Update internals
* Rewrite tests to use new API
* Add basic check for common mistake
Raise error with suggestion if user likely passed in a pattern instead of a list of patterns
* Fix typo [ci skip]
* Move test
* Allow default in Lookups.get_table
* Start with blank tables in Lookups.from_bytes
* Refactor lemmatizer to hold instance of Lookups
* Get lookups table within the lemmatization methods to make sure it references the correct table (even if the table was replaced or modified, e.g. when loading a model from disk)
* Deprecate other arguments on Lemmatizer.__init__ and expect Lookups for consistency
* Remove old and unsupported Lemmatizer.load classmethod
* Refactor language-specific lemmatizers to inherit as much as possible from base class and override only what they need
* Update tests and docs
* Fix more tests
* Fix lemmatizer
* Upgrade pytest to try and fix weird CI errors
* Try pytest 4.6.5
The doc.retokenize() context manager wasn't resizing doc.tensor, leading to a mismatch between the number of tokens in the doc and the number of rows in the tensor. We fix this by deleting rows from the tensor. Merged spans are represented by the vector of their last token.
* Add test for resizing doc.tensor when merging
* Add test for resizing doc.tensor when merging. Closes#1963
* Update get_lca_matrix test for develop
* Fix retokenize if tensor unset
* Auto-format tests with black
* Add flake8 config
* Tidy up and remove unused imports
* Fix redefinitions of test functions
* Replace orths_and_spaces with words and spaces
* Fix compatibility with pytest 4.0
* xfail test for now
Test was previously overwritten by following test due to naming conflict, so failure wasn't reported
* Unfail passing test
* Only use fixture via arguments
Fixes pytest 4.0 compatibility
## Description
Related issues: #2379 (should be fixed by separating model tests)
* **total execution time down from > 300 seconds to under 60 seconds** 🎉
* removed all model-specific tests that could only really be run manually anyway – those will now live in a separate test suite in the [`spacy-models`](https://github.com/explosion/spacy-models) repository and are already integrated into our new model training infrastructure
* changed all relative imports to absolute imports to prepare for moving the test suite from `/spacy/tests` to `/tests` (it'll now always test against the installed version)
* merged old regression tests into collections, e.g. `test_issue1001-1500.py` (about 90% of the regression tests are very short anyways)
* tidied up and rewrote existing tests wherever possible
### Todo
- [ ] move tests to `/tests` and adjust CI commands accordingly
- [x] move model test suite from internal repo to `spacy-models`
- [x] ~~investigate why `pipeline/test_textcat.py` is flakey~~
- [x] review old regression tests (leftover files) and see if they can be merged, simplified or deleted
- [ ] update documentation on how to run tests
### Types of change
enhancement, tests
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [ ] My changes don't require a change to the documentation, or if they do, I've added all required information.