* Convert Candidate from Cython to Python class.
* Format.
* Fix .entity_ typo in _add_activations() usage.
* Change type for mentions to look up entity candidates for to SpanGroup from Iterable[Span].
* Update docs.
* Update spacy/kb/candidate.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Update doc string of BaseCandidate.__init__().
* Update spacy/kb/candidate.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Rename Candidate to InMemoryCandidate, BaseCandidate to Candidate.
* Adjust Candidate to support and mandate numerical entity IDs.
* Format.
* Fix docstring and docs.
* Update website/docs/api/kb.mdx
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Rename alias -> mention.
* Refactor Candidate attribute names. Update docs and tests accordingly.
* Refacor Candidate attributes and their usage.
* Format.
* Fix mypy error.
* Update error code in line with v4 convention.
* Reverse erroneous changes during merge.
* Update return type in EL tests.
* Re-add Candidate to setup.py.
* Format updated docs.
---------
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Make empty_kb() configurable.
* Format.
* Update docs.
* Be more specific in KB serialization test.
* Update KB serialization tests. Update docs.
* Remove doc update for batched candidate generation.
* Fix serialization of subclassed KB in tests.
* Format.
* Update docstring.
* Update docstring.
* Switch from pickle to json for custom field serialization.
* Improve the correctness of _parse_patch
* If there are no more actions, do not attempt to make further
transitions, even if not all states are final.
* Assert that the number of actions for a step is the same as
the number of states.
* Reimplement distillation with oracle cut size
The code for distillation with an oracle cut size was not reimplemented
after the parser refactor. We did not notice, because we did not have
tests for this functionality. This change brings back the functionality
and adds this to the parser tests.
* Rename states2actions to _states_to_actions for consistency
* Test distillation max cuts in NER
* Mark parser/NER tests as slow
* Typo
* Fix invariant in _states_diff_to_actions
* Rename _init_batch -> _init_batch_from_teacher
* Ninja edit the ninja edit
* Check that we raise an exception when we pass the incorrect number or actions
* Remove unnecessary get
Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
* Write out condition more explicitly
---------
Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
* Remove backwards-compatible overwrite from Entity Linker
This also adds a docstring about overwrite, since it wasn't present.
* Fix docstring
* Remove backward compat settings in Morphologizer
This also needed a docstring added.
For this component it's less clear what the right overwrite settings
are.
* Remove backward compat from sentencizer
This was simple
* Remove backward compat from senter
Another simple one
* Remove backward compat setting from tagger
* Add docstrings
* Update spacy/pipeline/morphologizer.pyx
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Update docs
---------
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Move Entity Linker v1 component to spacy-legacy
This is a follow up to #11889 that moves the component instead of
removing it.
In general, we never import from spacy-legacy in spaCy proper. However,
to use this component, that kind of import will be necessary. I was able
to test this without issues, but is this current import strategy
acceptable? Or should we put the component in a registry?
* Use spacy-legacy pr for CI
This will need to be reverted before merging.
* Add temporary step to log installed spacy-legacy version
* Modify requirements.txt to trigger tests
* Add comment to Python to trigger tests
* TODO REVERT This is a commit with logic changes to trigger tests
* Remove pipe from YAML
Works locally, but possibly this is causing a quoting error or
something.
* Revert "TODO REVERT This is a commit with logic changes to trigger tests"
This reverts commit 689fae71f3.
* Revert "Add comment to Python to trigger tests"
This reverts commit 11840fc598.
* Add more logging
* Try installing directly in workflow
* Try explicitly uninstalling spacy-legacy first
* Cat requirements.txt to confirm contents
In the branch, the thinc version spec is `thinc>=8.1.0,<8.2.0`. But in
the logs, it's clear that a development release of 9.0 is being
installed. It's not clear why that would happen.
* Log requirements at start of build
* TODO REVERT Change thinc spec
Want to see what happens to the installed thinc spec with this change.
* Update thinc requirements
This makes it the same as it was before the merge, >=8.1.0,<8.2.0.
* Use same thinc version as v4 branch
* TODO REVERT Mark dependency check as xfail
spacy-legacy is specified as a git checkout in requirements.txt while
this PR is in progress, which makes the consistency check here fail.
* Remove debugging output / install step
* Revert "Remove debugging output / install step"
This reverts commit 923ea7448b.
* Clean up debugging output
The manual install step with the URL fragment seems to have caused
issues on Windows due to the = in the URL being misinterpreted. On the
other hand, removing it seems to mean the git version of spacy-legacy
isn't actually installed.
This PR removes the URL fragment but keeps the direct command-line
install. Additionally, since it looks like this job is configured to use
the default shell (and not bash), it removes a comment that upsets the
Windows cmd shell.
* Revert "TODO REVERT Mark dependency check as xfail"
This reverts commit d4863ec156.
* Fix requirements.txt, increasing spacy-legacy version
* Raise spacy legacy version in setup.cfg
* Remove azure build workarounds
* make spacy-legacy version explicit in error message
* Remove debugging line
* Suggestions from code review
* Add `Language.distill`
This method is the distillation counterpart of `Language.update`. It
takes a teacher `Language` instance and distills the student pipes on
the teacher pipes.
* Apply suggestions from code review
Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
* Clarify that how Example is used in distillation
* Update transition parser distill docstring for examples argument
* Pass optimizer to `TrainablePipe.distill`
* Annotate pipe before update
As discussed internally, we want to let a pipe annotate before doing an
update with gold/silver data. Otherwise, the output may be (too)
informed by the gold/silver data.
* Rename `component_map` to `student_to_teacher`
* Better synopsis in `Language.distill` docstring
* `name` -> `student_name`
* Fix labels type in docstring
* Mark distill test as slow
* Fix `student_to_teacher` type in docs
---------
Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
* Refactor _scores2guesses
* Handle arrays on GPU
* Convert argmax result to raw integer
Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
* Use NumpyOps() to copy data to CPU
Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
* Changes based on review comments
* Use different _scores2guesses depending on tree_k
* Add tests for corner cases
* Add empty line for consistency
* Improve naming
Co-authored-by: Daniël de Kok <me@github.danieldk.eu>
* Improve naming
Co-authored-by: Daniël de Kok <me@github.danieldk.eu>
Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
Co-authored-by: Daniël de Kok <me@github.danieldk.eu>
* Try to fix doc.copy
* Set dev version
* Make vocab always own lexemes
* Change version
* Add SpanGroups.copy method
* Fix set_annotations during Parser.update
* Fix dict proxy copy
* Upd version
* Fix copying SpanGroups
* Fix set_annotations in parser.update
* Fix parser set_annotations during update
* Revert "Fix parser set_annotations during update"
This reverts commit eb138c89ed.
* Revert "Fix set_annotations in parser.update"
This reverts commit c6df0eafd0.
* Fix set_annotations during parser update
* Inc version
* Handle final states in get_oracle_sequence
* Inc version
* Try to fix parser training
* Inc version
* Fix
* Inc version
* Fix parser oracle
* Inc version
* Inc version
* Fix transition has_gold
* Inc version
* Try to use real histories, not oracle
* Inc version
* Upd parser
* Inc version
* WIP on rewrite parser
* WIP refactor parser
* New progress on parser model refactor
* Prepare to remove parser_model.pyx
* Convert parser from cdef class
* Delete spacy.ml.parser_model
* Delete _precomputable_affine module
* Wire up tb_framework to new parser model
* Wire up parser model
* Uncython ner.pyx and dep_parser.pyx
* Uncython
* Work on parser model
* Support unseen_classes in parser model
* Support unseen classes in parser
* Cleaner handling of unseen classes
* Work through tests
* Keep working through errors
* Keep working through errors
* Work on parser. 15 tests failing
* Xfail beam stuff. 9 failures
* More xfail. 7 failures
* Xfail. 6 failures
* cleanup
* formatting
* fixes
* pass nO through
* Fix empty doc in update
* Hackishly fix resizing. 3 failures
* Fix redundant test. 2 failures
* Add reference version
* black formatting
* Get tests passing with reference implementation
* Fix missing prints
* Add missing file
* Improve indexing on reference implementation
* Get non-reference forward func working
* Start rigging beam back up
* removing redundant tests, cf #8106
* black formatting
* temporarily xfailing issue 4314
* make flake8 happy again
* mypy fixes
* ensure labels are added upon predict
* cleanup remnants from merge conflicts
* Improve unseen label masking
Two changes to speed up masking by ~10%:
- Use a bool array rather than an array of float32.
- Let the mask indicate whether a label was seen, rather than
unseen. The mask is most frequently used to index scores for
seen labels. However, since the mask marked unseen labels,
this required computing an intermittent flipped mask.
* Write moves costs directly into numpy array (#10163)
This avoids elementwise indexing and the allocation of an additional
array.
Gives a ~15% speed improvement when using batch_by_sequence with size
32.
* Temporarily disable ner and rehearse tests
Until rehearse is implemented again in the refactored parser.
* Fix loss serialization issue (#10600)
* Fix loss serialization issue
Serialization of a model fails with:
TypeError: array(738.3855, dtype=float32) is not JSON serializable
Fix this using float conversion.
* Disable CI steps that require spacy.TransitionBasedParser.v2
After finishing the refactor, TransitionBasedParser.v2 should be
provided for backwards compat.
* Add back support for beam parsing to the refactored parser (#10633)
* Add back support for beam parsing
Beam parsing was already implemented as part of the `BeamBatch` class.
This change makes its counterpart `GreedyBatch`. Both classes are hooked
up in `TransitionModel`, selecting `GreedyBatch` when the beam size is
one, or `BeamBatch` otherwise.
* Use kwarg for beam width
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Avoid implicit default for beam_width and beam_density
* Parser.{beam,greedy}_parse: ensure labels are added
* Remove 'deprecated' comments
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Parser `StateC` optimizations (#10746)
* `StateC`: Optimizations
Avoid GIL acquisition in `__init__`
Increase default buffer capacities on init
Reduce C++ exception overhead
* Fix typo
* Replace `set::count` with `set::find`
* Add exception attribute to c'tor
* Remove unused import
* Use a power-of-two value for initial capacity
Use default-insert to init `_heads` and `_unshiftable`
* Merge `cdef` variable declarations and assignments
* Vectorize `example.get_aligned_parses` (#10789)
* `example`: Vectorize `get_aligned_parse`
Rename `numpy` import
* Convert aligned array to lists before returning
* Revert import renaming
* Elide slice arguments when selecting the entire range
* Tagger/morphologizer alignment performance optimizations (#10798)
* `example`: Unwrap `numpy` scalar arrays before passing them to `StringStore.__getitem__`
* `AlignmentArray`: Use native list as staging buffer for offset calculation
* `example`: Vectorize `get_aligned`
* Hoist inner functions out of `get_aligned`
* Replace inline `if..else` clause in assignment statement
* `AlignmentArray`: Use raw indexing into offset and data `numpy` arrays
* `example`: Replace array unique value check with `groupby`
* `example`: Correctly exclude tokens with no alignment in `_get_aligned_vectorized`
Simplify `_get_aligned_non_vectorized`
* `util`: Update `all_equal` docstring
* Explicitly use `int32_t*`
* Restore C CPU inference in the refactored parser (#10747)
* Bring back the C parsing model
The C parsing model is used for CPU inference and is still faster for
CPU inference than the forward pass of the Thinc model.
* Use C sgemm provided by the Ops implementation
* Make tb_framework module Cython, merge in C forward implementation
* TransitionModel: raise in backprop returned from forward_cpu
* Re-enable greedy parse test
* Return transition scores when forward_cpu is used
* Apply suggestions from code review
Import `Model` from `thinc.api`
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Use relative imports in tb_framework
* Don't assume a default for beam_width
* We don't have a direct dependency on BLIS anymore
* Rename forwards to _forward_{fallback,greedy_cpu}
* Require thinc >=8.1.0,<8.2.0
* tb_framework: clean up imports
* Fix return type of _get_seen_mask
* Move up _forward_greedy_cpu
* Style fixes.
* Lower thinc lowerbound to 8.1.0.dev0
* Formatting fix
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Reimplement parser rehearsal function (#10878)
* Reimplement parser rehearsal function
Before the parser refactor, rehearsal was driven by a loop in the
`rehearse` method itself. For each parsing step, the loops would:
1. Get the predictions of the teacher.
2. Get the predictions and backprop function of the student.
3. Compute the loss and backprop into the student.
4. Move the teacher and student forward with the predictions of
the student.
In the refactored parser, we cannot perform search stepwise rehearsal
anymore, since the model now predicts all parsing steps at once.
Therefore, rehearsal is performed in the following steps:
1. Get the predictions of all parsing steps from the student, along
with its backprop function.
2. Get the predictions from the teacher, but use the predictions of
the student to advance the parser while doing so.
3. Compute the loss and backprop into the student.
To support the second step a new method, `advance_with_actions` is
added to `GreedyBatch`, which performs the provided parsing steps.
* tb_framework: wrap upper_W and upper_b in Linear
Thinc's Optimizer cannot handle resizing of existing parameters. Until
it does, we work around this by wrapping the weights/biases of the upper
layer of the parser model in Linear. When the upper layer is resized, we
copy over the existing parameters into a new Linear instance. This does
not trigger an error in Optimizer, because it sees the resized layer as
a new set of parameters.
* Add test for TransitionSystem.apply_actions
* Better FIXME marker
Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
* Fixes from Madeesh
* Apply suggestions from Sofie
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Remove useless assignment
Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Rename some identifiers in the parser refactor (#10935)
* Rename _parseC to _parse_batch
* tb_framework: prefix many auxiliary functions with underscore
To clearly state the intent that they are private.
* Rename `lower` to `hidden`, `upper` to `output`
* Parser slow test fixup
We don't have TransitionBasedParser.{v1,v2} until we bring it back as a
legacy option.
* Remove last vestiges of PrecomputableAffine
This does not exist anymore as a separate layer.
* ner: re-enable sentence boundary checks
* Re-enable test that works now.
* test_ner: make loss test more strict again
* Remove commented line
* Re-enable some more beam parser tests
* Remove unused _forward_reference function
* Update for CBlas changes in Thinc 8.1.0.dev2
Bump thinc dependency to 8.1.0.dev3.
* Remove references to spacy.TransitionBasedParser.{v1,v2}
Since they will not be offered starting with spaCy v4.
* `tb_framework`: Replace references to `thinc.backends.linalg` with `CBlas`
* dont use get_array_module (#11056) (#11293)
Co-authored-by: kadarakos <kadar.akos@gmail.com>
* Move `thinc.extra.search` to `spacy.pipeline._parser_internals` (#11317)
* `search`: Move from `thinc.extra.search`
Fix NPE in `Beam.__dealloc__`
* `pytest`: Add support for executing Cython tests
Move `search` tests from thinc and patch them to run with `pytest`
* `mypy` fix
* Update comment
* `conftest`: Expose `register_cython_tests`
* Remove unused import
* Move `argmax` impls to new `_parser_utils` Cython module (#11410)
* Parser does not have to be a cdef class anymore
This also fixes validation of the initialization schema.
* Add back spacy.TransitionBasedParser.v2
* Fix a rename that was missed in #10878.
So that rehearsal tests pass.
* Remove module from setup.py that got added during the merge
* Bring back support for `update_with_oracle_cut_size` (#12086)
* Bring back support for `update_with_oracle_cut_size`
This option was available in the pre-refactor parser, but was never
implemented in the refactored parser. This option cuts transition
sequences that are longer than `update_with_oracle_cut` size into
separate sequences that have at most `update_with_oracle_cut`
transitions. The oracle (gold standard) transition sequence is used to
determine the cuts and the initial states for the additional sequences.
Applying this cut makes the batches more homogeneous in the transition
sequence lengths, making forward passes (and as a consequence training)
much faster.
Training time 1000 steps on de_core_news_lg:
- Before this change: 149s
- After this change: 68s
- Pre-refactor parser: 81s
* Fix a rename that was missed in #10878.
So that rehearsal tests pass.
* Apply suggestions from @shadeMe
* Use chained conditional
* Test with update_with_oracle_cut_size={0, 1, 5, 100}
And fix a git that occurs with a cut size of 1.
* Fix up some merge fall out
* Update parser distillation for the refactor
In the old parser, we'd iterate over the transitions in the distill
function and compute the loss/gradients on the go. In the refactored
parser, we first let the student model parse the inputs. Then we'll let
the teacher compute the transition probabilities of the states in the
student's transition sequence. We can then compute the gradients of the
student given the teacher.
* Add back spacy.TransitionBasedParser.v1 references
- Accordion in the architecture docs.
- Test in test_parse, but disabled until we have a spacy-legacy release.
Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
Co-authored-by: svlandeg <svlandeg@github.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: kadarakos <kadar.akos@gmail.com>
* Add `TrainablePipe.{distill,get_teacher_student_loss}`
This change adds two methods:
- `TrainablePipe::distill` which performs a training step of a
student pipe on a teacher pipe, giving a batch of `Doc`s.
- `TrainablePipe::get_teacher_student_loss` computes the loss
of a student relative to the teacher.
The `distill` or `get_teacher_student_loss` methods are also implemented
in the tagger, edit tree lemmatizer, and parser pipes, to enable
distillation in those pipes and as an example for other pipes.
* Fix stray `Beam` import
* Fix incorrect import
* Apply suggestions from code review
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Apply suggestions from code review
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* TrainablePipe.distill: use `Iterable[Example]`
* Add Pipe.is_distillable method
* Add `validate_distillation_examples`
This first calls `validate_examples` and then checks that the
student/teacher tokens are the same.
* Update distill documentation
* Add distill documentation for all pipes that support distillation
* Fix incorrect identifier
* Apply suggestions from code review
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Add comment to explain `is_distillable`
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* enable fuzzy matching
* add fuzzy param to EntityMatcher
* include rapidfuzz_capi
not yet used
* fix type
* add FUZZY predicate
* add fuzzy attribute list
* fix type properly
* tidying
* remove unnecessary dependency
* handle fuzzy sets
* simplify fuzzy sets
* case fix
* switch to FUZZYn predicates
use Levenshtein distance.
remove fuzzy param.
remove rapidfuzz_capi.
* revert changes added for fuzzy param
* switch to polyleven
(Python package)
* enable fuzzy matching
* add fuzzy param to EntityMatcher
* include rapidfuzz_capi
not yet used
* fix type
* add FUZZY predicate
* add fuzzy attribute list
* fix type properly
* tidying
* remove unnecessary dependency
* handle fuzzy sets
* simplify fuzzy sets
* case fix
* switch to FUZZYn predicates
use Levenshtein distance.
remove fuzzy param.
remove rapidfuzz_capi.
* revert changes added for fuzzy param
* switch to polyleven
(Python package)
* fuzzy match only on oov tokens
* remove polyleven
* exclude whitespace tokens
* don't allow more edits than characters
* fix min distance
* reinstate FUZZY operator
with length-based distance function
* handle sets inside regex operator
* remove is_oov check
* attempt build fix
no mypy failure locally
* re-attempt build fix
* don't overwrite fuzzy param value
* move fuzzy_match
to its own Python module to allow patching
* move fuzzy_match back inside Matcher
simplify logic and add tests
* Format tests
* Parametrize fuzzyn tests
* Parametrize and merge fuzzy+set tests
* Format
* Move fuzzy_match to a standalone method
* Change regex kwarg type to bool
* Add types for fuzzy_match
- Refactor variable names
- Add test for symmetrical behavior
* Parametrize fuzzyn+set tests
* Minor refactoring for fuzz/fuzzy
* Make fuzzy_match a Matcher kwarg
* Update type for _default_fuzzy_match
* don't overwrite function param
* Rename to fuzzy_compare
* Update fuzzy_compare default argument declarations
* allow fuzzy_compare override from EntityRuler
* define new Matcher keyword arg
* fix type definition
* Implement fuzzy_compare config option for EntityRuler and SpanRuler
* Rename _default_fuzzy_compare to fuzzy_compare, remove from reexported objects
* Use simpler fuzzy_compare algorithm
* Update types
* Increase minimum to 2 in fuzzy_compare to allow one transposition
* Fix predicate keys and matching for SetPredicate with FUZZY and REGEX
* Add FUZZY6..9
* Add initial docs
* Increase default fuzzy to rounded 30% of pattern length
* Update docs for fuzzy_compare in components
* Update EntityRuler and SpanRuler API docs
* Rename EntityRuler and SpanRuler setting to matcher_fuzzy_compare
To having naming similar to `phrase_matcher_attr`, rename
`fuzzy_compare` setting for `EntityRuler` and `SpanRuler` to
`matcher_fuzzy_compare. Organize next to `phrase_matcher_attr` in docs.
* Fix schema aliases
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Fix typo
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Add FUZZY6-9 operators and update tests
* Parameterize test over greedy
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Fix type for fuzzy_compare to remove Optional
* Rename to spacy.levenshtein_compare.v1, move to spacy.matcher.levenshtein
* Update docs following levenshtein_compare renaming
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* add test for running evaluate on an nlp pipeline with two distinct textcat components
* cleanup
* merge dicts instead of overwrite
* don't add more labels to the given set
* Revert "merge dicts instead of overwrite"
This reverts commit 89bee0ed77.
* Switch tests to separate scorer keys rather than merged dicts
* Revert unrelated edits
* Switch textcat scorers to v2
* formatting
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Remove experimental multi-task components
These are incomplete implementations and are not usable in their current state.
* Remove orphaned error message
* Switch ubuntu-latest to ubuntu-20.04 in main tests (#11928)
* Switch ubuntu-latest to ubuntu-20.04 in main tests
* Only use 20.04 for 3.6
* Revert "Switch ubuntu-latest to ubuntu-20.04 in main tests (#11928)"
This reverts commit 77c0fd7b17.
Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
Strings in replacement nodes where not added to the `StringStore`
when `EditTreeLemmatizer` was initialized from a set of labels. The
corresponding test did not capture this because it added the strings
through the examples that were passed to the initialization.
This change fixes both this bug in the initialization as the 'shadowing'
of the bug in the test.
* Check textcat values for validity
* Fix error numbers
* Clean up vals reference
* Check category value validity through training
The _validate_categories is called in update, which for multilabel is
inherited from the single label component.
* Formatting
* Update textcat scorer threshold behavior
For `textcat` (with exclusive classes) the scorer should always use a
threshold of 0.0 because there should be one predicted label per doc and
the numeric score for that particular label should not matter.
* Rename to test_textcat_multilabel_threshold
* Remove all uses of threshold for multi_label=False
* Update Scorer.score_cats API docs
* Add tests for score_cats with thresholds
* Update textcat API docs
* Fix types
* Convert threshold back to float
* Fix threshold type in docstring
* Improve formatting in Scorer API docs
* Replace EntityRuler with SpanRuler implementation
Remove `EntityRuler` and rename the `SpanRuler`-based
`future_entity_ruler` to `entity_ruler`.
Main changes:
* It is no longer possible to load patterns on init as with
`EntityRuler(patterns=)`.
* The older serialization formats (`patterns.jsonl`) are no longer
supported and the related tests are removed.
* The config settings are only stored in the config, not in the
serialized component (in particular the `phrase_matcher_attr` and
overwrite settings).
* Add migration guide to EntityRuler API docs
* docs update
* Minor edit
Co-authored-by: svlandeg <svlandeg@github.com>