The intent of this was that it would be a component pipeline that used
entities as input, but that's now covered by the get_mentions function
as a pipeline arg.
This is closer to the traditional evaluation method. That uses an
average of three scores, this is just using the bcubed metric for now
(nothing special about bcubed, just picked one).
The scoring implementation comes from the coval project. It relies on
scipy, which is one issue, and is rather involved, which is another.
Besides being comparable with traditional evaluations, this scoring is
relatively fast.
This includes the coref code that was being tested separately, modified
to work in spaCy. It hasn't been tested yet and presumably still needs
fixes.
In particular, the evaluation code is currently omitted. It's unclear at
the moment whether we want to use a complex scorer similar to the
official one, or a simpler scorer using more modern evaluation methods.
* Preserve existing ENT_KB_ID annotation in NER
Preserve `ent_kb_id` annotation on existing entity spans, which is not
preserved by the transition system.
* Simplify kb_id assignment
* Simplify further
* Replace negative rows with 0 in StaticVectors
Replace negative row indices with 0-vectors in `StaticVectors`.
* Increase versions related to StaticVectors
* Increase versions of all architctures and layers related to
`StaticVectors`
* Improve efficiency of 0-vector operations
Parallel `spacy-legacy` PR: https://github.com/explosion/spacy-legacy/pull/5
* Update config defaults to new versions
* Update docs
* Add util method for check
* Add new languages to list with lexeme norm tables
* Add check to all relevant components
* Add config details to warning message
Note that we're not actually inspecting the model config to see if
`NORM` is used as an attribute, so it may warn in cases where it's not
relevant.
* add multi-label textcat to menu
* add infobox on textcat API
* add info to v3 migration guide
* small edits
* further fixes in doc strings
* add infobox to textcat architectures
* add textcat_multilabel to overview of built-in components
* spelling
* fix unrelated warn msg
* Add textcat_multilabel to quickstart [ci skip]
* remove separate documentation page for multilabel_textcategorizer
* small edits
* positive label clarification
* avoid duplicating information in self.cfg and fix textcat.score
* fix multilabel textcat too
* revert threshold to storage in cfg
* revert threshold stuff for multi-textcat
Co-authored-by: Ines Montani <ines@ines.io>
* initial coref_er pipe
* matcher more flexible
* base coref component without actual model
* initial setup of coref_er.score
* rename to include_label
* preliminary score_clusters method
* apply scoring in coref component
* IO fix
* return None loss for now
* rename to CoreferenceResolver
* some preliminary unit tests
* use registry as callable
* Add test for #7035
* Update test for issue 7056
* Fix test
* Fix transitions method used in testing
* Fix state eol detection when rebuffer
* Clean up redundant fix
* add error handler for pipe methods
* add unit tests
* remove pipe method that are the same as their base class
* have Language keep track of a default error handler
* cleanup
* formatting
* small refactor
* add documentation
* warn when frozen components break listener pattern
* few notes in the documentation
* update arg name
* formatting
* cleanup
* specify listeners return type
* Add long_token_splitter component
Add a `long_token_splitter` component for use with transformer
pipelines. This component splits up long tokens like URLs into smaller
tokens. This is particularly relevant for pretrained pipelines with
`strided_spans`, since the user can't change the length of the span
`window` and may not wish to preprocess the input texts.
The `long_token_splitter` splits tokens that are at least
`long_token_length` tokens long into smaller tokens of `split_length`
size.
Notes:
* Since this is intended for use as the first component in a pipeline,
the token splitter does not try to preserve any token annotation.
* API docs to come when the API is stable.
* Adjust API, add test
* Fix name in factory
* Handle unset token.morph in Morphologizer
Handle unset `token.morph` in `Morphologizer.initialize` and
`Morphologizer.get_loss`. If both `token.morph` and `token.pos` are
unset, treat the annotation as missing rather than empty.
* Add token.has_morph()