* Add scorer option to components
Add an optional `scorer` parameter to all pipeline components. If a
scoring function is provided, it overrides the default scoring method
for that component.
* Add registered scorers for all components
* Add `scorers` registry
* Move all scoring methods outside of components as independent
functions and register
* Use the registered scoring methods as defaults in configs and inits
Additional:
* The scoring methods no longer have access to the full component, so
use settings from `cfg` as default scorer options to handle settings
such as `labels`, `threshold`, and `positive_label`
* The `attribute_ruler` scoring method no longer has access to the
patterns, so all scoring methods are called
* Bug fix: `spancat` scoring method is updated to set `allow_overlap` to
score overlapping spans correctly
* Update Russian lemmatizer to use direct score method
* Check type of cfg in Pipe.score
* Fix check
* Update spacy/pipeline/sentencizer.pyx
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Remove validate_examples from scoring functions
* Use Pipe.labels instead of Pipe.cfg["labels"]
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Show warning if entity_ruler runs without patterns
* Show warning if matcher runs without patterns
* fix wording
* unit test for warning once (WIP)
* warn W036 only once
* cleanup
* create filter_warning helper
* Don't add duplicate patterns (fix#8216)
* Refactor EntityRuler init
This simplifies the EntityRuler init code. This is helpful as prep for
allowing the EntityRuler to reset itself.
* Make EntityRuler.clear reset matchers
Includes a new test for this.
* Tidy PhraseMatcher instantiation
Since the attr can be None safely now, the guard if is no longer
required here.
Also renamed the `_validate` attr. Maybe it's not needed?
* Fix NER test
* Add test to make sure patterns aren't increasing
* Move test to regression tests
* Show warning if entity_ruler runs without patterns
* Show warning if matcher runs without patterns
* fix wording
* unit test for warning once (WIP)
* warn W036 only once
* cleanup
* create filter_warning helper
* add error handler for pipe methods
* add unit tests
* remove pipe method that are the same as their base class
* have Language keep track of a default error handler
* cleanup
* formatting
* small refactor
* add documentation
* Handle missing reference values in scorer
Handle missing values in reference doc during scoring where it is
possible to detect an unset state for the attribute. If no reference
docs contain annotation, `None` is returned instead of a score. `spacy
evaluate` displays `-` for missing scores and the missing scores are
saved as `None`/`null` in the metrics.
Attributes without unset states:
* `token.head`: relies on `token.dep` to recognize unset values
* `doc.cats`: unable to handle missing annotation
Additional changes:
* add optional `has_annotation` check to `score_scans` to replace
`doc.sents` hack
* update `score_token_attr_per_feat` to handle missing and empty morph
representations
* fix bug in `Doc.has_annotation` for normalization of `IS_SENT_START`
vs. `SENT_START`
* Fix import
* Update return types
* rename Pipe to TrainablePipe
* split functionality between Pipe and TrainablePipe
* remove unnecessary methods from certain components
* cleanup
* hasattr(component, "pipe") should be sufficient again
* remove serialization and vocab/cfg from Pipe
* unify _ensure_examples and validate_examples
* small fixes
* hasattr checks for self.cfg and self.vocab
* make is_resizable and is_trainable properties
* serialize strings.json instead of vocab
* fix KB IO + tests
* fix typos
* more typos
* _added_strings as a set
* few more tests specifically for _added_strings field
* bump to 3.0.0a36
Add and update `score` methods, provided `scores`, and default weights
`default_score_weights` for pipeline components.
* `scores` provides all top-level keys returned by `score` (merely informative, similar to `assigns`).
* `default_score_weights` provides the default weights for a default config.
* The keys from `default_score_weights` determine which values will be
shown in the `spacy train` output, so keys with weight `0.0` will be
displayed but not counted toward the overall score.
* Update with WIP
* Update with WIP
* Update with pipeline serialization
* Update types and pipe factories
* Add deep merge, tidy up and add tests
* Fix pipe creation from config
* Don't validate default configs on load
* Update spacy/language.py
Co-authored-by: Ines Montani <ines@ines.io>
* Adjust factory/component meta error
* Clean up factory args and remove defaults
* Add test for failing empty dict defaults
* Update pipeline handling and methods
* provide KB as registry function instead of as object
* small change in test to make functionality more clear
* update example script for EL configuration
* Fix typo
* Simplify test
* Simplify test
* splitting pipes.pyx into separate files
* moving default configs to each component file
* fix batch_size type
* removing default values from component constructors where possible (TODO: test 4725)
* skip instead of xfail
* Add test for config -> nlp with multiple instances
* pipeline.pipes -> pipeline.pipe
* Tidy up, document, remove kwargs
* small cleanup/generalization for Tok2VecListener
* use DEFAULT_UPSTREAM field
* revert to avoid circular imports
* Fix tests
* Replace deprecated arg
* Make model dirs require config
* fix pickling of keyword-only arguments in constructor
* WIP: clean up and integrate full config
* Add helper to handle function args more reliably
Now also includes keyword-only args
* Fix config composition and serialization
* Improve config debugging and add visual diff
* Remove unused defaults and fix type
* Remove pipeline and factories from meta
* Update spacy/default_config.cfg
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Update spacy/default_config.cfg
* small UX edits
* avoid printing stack trace for debug CLI commands
* Add support for language-specific factories
* specify the section of the config which holds the model to debug
* WIP: add Language.from_config
* Update with language data refactor WIP
* Auto-format
* Add backwards-compat handling for Language.factories
* Update morphologizer.pyx
* Fix morphologizer
* Update and simplify lemmatizers
* Fix Japanese tests
* Port over tagger changes
* Fix Chinese and tests
* Update to latest Thinc
* WIP: xfail first Russian lemmatizer test
* Fix component-specific overrides
* fix nO for output layers in debug_model
* Fix default value
* Fix tests and don't pass objects in config
* Fix deep merging
* Fix lemma lookup data registry
Only load the lookups if an entry is available in the registry (and if spacy-lookups-data is installed)
* Add types
* Add Vocab.from_config
* Fix typo
* Fix tests
* Make config copying more elegant
* Fix pipe analysis
* Fix lemmatizers and is_base_form
* WIP: move language defaults to config
* Fix morphology type
* Fix vocab
* Remove comment
* Update to latest Thinc
* Add morph rules to config
* Tidy up
* Remove set_morphology option from tagger factory
* Hack use_gpu
* Move [pipeline] to top-level block and make [nlp.pipeline] list
Allows separating component blocks from component order – otherwise, ordering the config would mean a changed component order, which is bad. Also allows initial config to define more components and not use all of them
* Fix use_gpu and resume in CLI
* Auto-format
* Remove resume from config
* Fix formatting and error
* [pipeline] -> [components]
* Fix types
* Fix tagger test: requires set_morphology?
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>
Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
* make disable_pipes deprecated in favour of the new toggle_pipes
* rewrite disable_pipes statements
* update documentation
* remove bin/wiki_entity_linking folder
* one more fix
* remove deprecated link to documentation
* few more doc fixes
* add note about name change to the docs
* restore original disable_pipes
* small fixes
* fix typo
* fix error number to W096
* rename to select_pipes
* also make changes to the documentation
Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
* fix grad_clip naming
* cleaning up pretrained_vectors out of cfg
* further refactoring Model init's
* move Model building out of pipes
* further refactor to require a model config when creating a pipe
* small fixes
* making cfg in nn_parser more consistent
* fixing nr_class for parser
* fixing nn_parser's nO
* fix printing of loss
* architectures in own file per type, consistent naming
* convenience methods default_tagger_config and default_tok2vec_config
* let create_pipe access default config if available for that component
* default_parser_config
* move defaults to separate folder
* allow reading nlp from package or dir with argument 'name'
* architecture spacy.VocabVectors.v1 to read static vectors from file
* cleanup
* default configs for nel, textcat, morphologizer, tensorizer
* fix imports
* fixing unit tests
* fixes and clean up
* fixing defaults, nO, fix unit tests
* restore parser IO
* fix IO
* 'fix' serialization test
* add *.cfg to manifest
* fix example configs with additional arguments
* replace Morpohologizer with Tagger
* add IO bit when testing overfitting of tagger (currently failing)
* fix IO - don't initialize when reading from disk
* expand overfitting tests to also check IO goes OK
* remove dropout from HashEmbed to fix Tagger performance
* add defaults for sentrec
* update thinc
* always pass a Model instance to a Pipe
* fix piped_added statement
* remove obsolete W029
* remove obsolete errors
* restore byte checking tests (work again)
* clean up test
* further test cleanup
* convert from config to Model in create_pipe
* bring back error when component is not initialized
* cleanup
* remove calls for nlp2.begin_training
* use thinc.api in imports
* allow setting charembed's nM and nC
* fix for hardcoded nM/nC + unit test
* formatting fixes
* trigger build
* Fix ent_ids and labels properties when id attribute used in patterns
* use set for labels
* sort end_ids for comparison in entity_ruler tests
* fixing entity_ruler ent_ids test
* add to set
* Run make_doc optimistically if using phrase matcher patterns.
* remove unused coveragerc I was testing with
* format
* Refactor EntityRuler.add_patterns to use nlp.pipe for phrase patterns. Improves speed substantially.
* Removing old add_patterns function
* Fixing spacing
* Make sure token_patterns loaded as well, before generator was being emptied in from_disk
* Fix ent_ids and labels properties when id attribute used in patterns
* use set for labels
* sort end_ids for comparison in entity_ruler tests
* fixing entity_ruler ent_ids test
* add to set