* Init
* fix tests
* Update spacy/errors.py
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Fix test_blank_languages
* Rename xx to mul in docs
* Format _util with black
* prettier formatting
---------
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Refactor _scores2guesses
* Handle arrays on GPU
* Convert argmax result to raw integer
Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
* Use NumpyOps() to copy data to CPU
Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
* Changes based on review comments
* Use different _scores2guesses depending on tree_k
* Add tests for corner cases
* Add empty line for consistency
* Improve naming
Co-authored-by: Daniël de Kok <me@github.danieldk.eu>
* Improve naming
Co-authored-by: Daniël de Kok <me@github.danieldk.eu>
Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
Co-authored-by: Daniël de Kok <me@github.danieldk.eu>
* Fix batching regression
Some time ago, the spaCy v4 branch switched to the new Thinc v9
schedule. However, this introduced an error in how batching is handed.
In the PR, the batchers were changed to keep track of their step,
so that the step can be passed to the schedule. However, the issue
is that the training loop repeatedly calls the batching functions
(rather than using an infinite generator/iterator). So, the step and
therefore the schedule would be reset each epoch. Before the schedule
switch we didn't have this issue, because the old schedules were
stateful.
This PR fixes this issue by reverting the batching functions to use
a (stateful) generator. Their registry functions do accept a `Schedule`
and we convert `Schedule`s to generators.
* Update batcher docs
* Docstring fixes
* Make minibatch take iterables again as well
* Bump thinc requirement to 9.0.0.dev2
* Use type declaration
* Convert another comment into a proper type declaration
* Try to fix doc.copy
* Set dev version
* Make vocab always own lexemes
* Change version
* Add SpanGroups.copy method
* Fix set_annotations during Parser.update
* Fix dict proxy copy
* Upd version
* Fix copying SpanGroups
* Fix set_annotations in parser.update
* Fix parser set_annotations during update
* Revert "Fix parser set_annotations during update"
This reverts commit eb138c89ed.
* Revert "Fix set_annotations in parser.update"
This reverts commit c6df0eafd0.
* Fix set_annotations during parser update
* Inc version
* Handle final states in get_oracle_sequence
* Inc version
* Try to fix parser training
* Inc version
* Fix
* Inc version
* Fix parser oracle
* Inc version
* Inc version
* Fix transition has_gold
* Inc version
* Try to use real histories, not oracle
* Inc version
* Upd parser
* Inc version
* WIP on rewrite parser
* WIP refactor parser
* New progress on parser model refactor
* Prepare to remove parser_model.pyx
* Convert parser from cdef class
* Delete spacy.ml.parser_model
* Delete _precomputable_affine module
* Wire up tb_framework to new parser model
* Wire up parser model
* Uncython ner.pyx and dep_parser.pyx
* Uncython
* Work on parser model
* Support unseen_classes in parser model
* Support unseen classes in parser
* Cleaner handling of unseen classes
* Work through tests
* Keep working through errors
* Keep working through errors
* Work on parser. 15 tests failing
* Xfail beam stuff. 9 failures
* More xfail. 7 failures
* Xfail. 6 failures
* cleanup
* formatting
* fixes
* pass nO through
* Fix empty doc in update
* Hackishly fix resizing. 3 failures
* Fix redundant test. 2 failures
* Add reference version
* black formatting
* Get tests passing with reference implementation
* Fix missing prints
* Add missing file
* Improve indexing on reference implementation
* Get non-reference forward func working
* Start rigging beam back up
* removing redundant tests, cf #8106
* black formatting
* temporarily xfailing issue 4314
* make flake8 happy again
* mypy fixes
* ensure labels are added upon predict
* cleanup remnants from merge conflicts
* Improve unseen label masking
Two changes to speed up masking by ~10%:
- Use a bool array rather than an array of float32.
- Let the mask indicate whether a label was seen, rather than
unseen. The mask is most frequently used to index scores for
seen labels. However, since the mask marked unseen labels,
this required computing an intermittent flipped mask.
* Write moves costs directly into numpy array (#10163)
This avoids elementwise indexing and the allocation of an additional
array.
Gives a ~15% speed improvement when using batch_by_sequence with size
32.
* Temporarily disable ner and rehearse tests
Until rehearse is implemented again in the refactored parser.
* Fix loss serialization issue (#10600)
* Fix loss serialization issue
Serialization of a model fails with:
TypeError: array(738.3855, dtype=float32) is not JSON serializable
Fix this using float conversion.
* Disable CI steps that require spacy.TransitionBasedParser.v2
After finishing the refactor, TransitionBasedParser.v2 should be
provided for backwards compat.
* Add back support for beam parsing to the refactored parser (#10633)
* Add back support for beam parsing
Beam parsing was already implemented as part of the `BeamBatch` class.
This change makes its counterpart `GreedyBatch`. Both classes are hooked
up in `TransitionModel`, selecting `GreedyBatch` when the beam size is
one, or `BeamBatch` otherwise.
* Use kwarg for beam width
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Avoid implicit default for beam_width and beam_density
* Parser.{beam,greedy}_parse: ensure labels are added
* Remove 'deprecated' comments
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Parser `StateC` optimizations (#10746)
* `StateC`: Optimizations
Avoid GIL acquisition in `__init__`
Increase default buffer capacities on init
Reduce C++ exception overhead
* Fix typo
* Replace `set::count` with `set::find`
* Add exception attribute to c'tor
* Remove unused import
* Use a power-of-two value for initial capacity
Use default-insert to init `_heads` and `_unshiftable`
* Merge `cdef` variable declarations and assignments
* Vectorize `example.get_aligned_parses` (#10789)
* `example`: Vectorize `get_aligned_parse`
Rename `numpy` import
* Convert aligned array to lists before returning
* Revert import renaming
* Elide slice arguments when selecting the entire range
* Tagger/morphologizer alignment performance optimizations (#10798)
* `example`: Unwrap `numpy` scalar arrays before passing them to `StringStore.__getitem__`
* `AlignmentArray`: Use native list as staging buffer for offset calculation
* `example`: Vectorize `get_aligned`
* Hoist inner functions out of `get_aligned`
* Replace inline `if..else` clause in assignment statement
* `AlignmentArray`: Use raw indexing into offset and data `numpy` arrays
* `example`: Replace array unique value check with `groupby`
* `example`: Correctly exclude tokens with no alignment in `_get_aligned_vectorized`
Simplify `_get_aligned_non_vectorized`
* `util`: Update `all_equal` docstring
* Explicitly use `int32_t*`
* Restore C CPU inference in the refactored parser (#10747)
* Bring back the C parsing model
The C parsing model is used for CPU inference and is still faster for
CPU inference than the forward pass of the Thinc model.
* Use C sgemm provided by the Ops implementation
* Make tb_framework module Cython, merge in C forward implementation
* TransitionModel: raise in backprop returned from forward_cpu
* Re-enable greedy parse test
* Return transition scores when forward_cpu is used
* Apply suggestions from code review
Import `Model` from `thinc.api`
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Use relative imports in tb_framework
* Don't assume a default for beam_width
* We don't have a direct dependency on BLIS anymore
* Rename forwards to _forward_{fallback,greedy_cpu}
* Require thinc >=8.1.0,<8.2.0
* tb_framework: clean up imports
* Fix return type of _get_seen_mask
* Move up _forward_greedy_cpu
* Style fixes.
* Lower thinc lowerbound to 8.1.0.dev0
* Formatting fix
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Reimplement parser rehearsal function (#10878)
* Reimplement parser rehearsal function
Before the parser refactor, rehearsal was driven by a loop in the
`rehearse` method itself. For each parsing step, the loops would:
1. Get the predictions of the teacher.
2. Get the predictions and backprop function of the student.
3. Compute the loss and backprop into the student.
4. Move the teacher and student forward with the predictions of
the student.
In the refactored parser, we cannot perform search stepwise rehearsal
anymore, since the model now predicts all parsing steps at once.
Therefore, rehearsal is performed in the following steps:
1. Get the predictions of all parsing steps from the student, along
with its backprop function.
2. Get the predictions from the teacher, but use the predictions of
the student to advance the parser while doing so.
3. Compute the loss and backprop into the student.
To support the second step a new method, `advance_with_actions` is
added to `GreedyBatch`, which performs the provided parsing steps.
* tb_framework: wrap upper_W and upper_b in Linear
Thinc's Optimizer cannot handle resizing of existing parameters. Until
it does, we work around this by wrapping the weights/biases of the upper
layer of the parser model in Linear. When the upper layer is resized, we
copy over the existing parameters into a new Linear instance. This does
not trigger an error in Optimizer, because it sees the resized layer as
a new set of parameters.
* Add test for TransitionSystem.apply_actions
* Better FIXME marker
Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
* Fixes from Madeesh
* Apply suggestions from Sofie
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Remove useless assignment
Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Rename some identifiers in the parser refactor (#10935)
* Rename _parseC to _parse_batch
* tb_framework: prefix many auxiliary functions with underscore
To clearly state the intent that they are private.
* Rename `lower` to `hidden`, `upper` to `output`
* Parser slow test fixup
We don't have TransitionBasedParser.{v1,v2} until we bring it back as a
legacy option.
* Remove last vestiges of PrecomputableAffine
This does not exist anymore as a separate layer.
* ner: re-enable sentence boundary checks
* Re-enable test that works now.
* test_ner: make loss test more strict again
* Remove commented line
* Re-enable some more beam parser tests
* Remove unused _forward_reference function
* Update for CBlas changes in Thinc 8.1.0.dev2
Bump thinc dependency to 8.1.0.dev3.
* Remove references to spacy.TransitionBasedParser.{v1,v2}
Since they will not be offered starting with spaCy v4.
* `tb_framework`: Replace references to `thinc.backends.linalg` with `CBlas`
* dont use get_array_module (#11056) (#11293)
Co-authored-by: kadarakos <kadar.akos@gmail.com>
* Move `thinc.extra.search` to `spacy.pipeline._parser_internals` (#11317)
* `search`: Move from `thinc.extra.search`
Fix NPE in `Beam.__dealloc__`
* `pytest`: Add support for executing Cython tests
Move `search` tests from thinc and patch them to run with `pytest`
* `mypy` fix
* Update comment
* `conftest`: Expose `register_cython_tests`
* Remove unused import
* Move `argmax` impls to new `_parser_utils` Cython module (#11410)
* Parser does not have to be a cdef class anymore
This also fixes validation of the initialization schema.
* Add back spacy.TransitionBasedParser.v2
* Fix a rename that was missed in #10878.
So that rehearsal tests pass.
* Remove module from setup.py that got added during the merge
* Bring back support for `update_with_oracle_cut_size` (#12086)
* Bring back support for `update_with_oracle_cut_size`
This option was available in the pre-refactor parser, but was never
implemented in the refactored parser. This option cuts transition
sequences that are longer than `update_with_oracle_cut` size into
separate sequences that have at most `update_with_oracle_cut`
transitions. The oracle (gold standard) transition sequence is used to
determine the cuts and the initial states for the additional sequences.
Applying this cut makes the batches more homogeneous in the transition
sequence lengths, making forward passes (and as a consequence training)
much faster.
Training time 1000 steps on de_core_news_lg:
- Before this change: 149s
- After this change: 68s
- Pre-refactor parser: 81s
* Fix a rename that was missed in #10878.
So that rehearsal tests pass.
* Apply suggestions from @shadeMe
* Use chained conditional
* Test with update_with_oracle_cut_size={0, 1, 5, 100}
And fix a git that occurs with a cut size of 1.
* Fix up some merge fall out
* Update parser distillation for the refactor
In the old parser, we'd iterate over the transitions in the distill
function and compute the loss/gradients on the go. In the refactored
parser, we first let the student model parse the inputs. Then we'll let
the teacher compute the transition probabilities of the states in the
student's transition sequence. We can then compute the gradients of the
student given the teacher.
* Add back spacy.TransitionBasedParser.v1 references
- Accordion in the architecture docs.
- Test in test_parse, but disabled until we have a spacy-legacy release.
Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
Co-authored-by: svlandeg <svlandeg@github.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: kadarakos <kadar.akos@gmail.com>
* Add `TrainablePipe.{distill,get_teacher_student_loss}`
This change adds two methods:
- `TrainablePipe::distill` which performs a training step of a
student pipe on a teacher pipe, giving a batch of `Doc`s.
- `TrainablePipe::get_teacher_student_loss` computes the loss
of a student relative to the teacher.
The `distill` or `get_teacher_student_loss` methods are also implemented
in the tagger, edit tree lemmatizer, and parser pipes, to enable
distillation in those pipes and as an example for other pipes.
* Fix stray `Beam` import
* Fix incorrect import
* Apply suggestions from code review
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Apply suggestions from code review
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* TrainablePipe.distill: use `Iterable[Example]`
* Add Pipe.is_distillable method
* Add `validate_distillation_examples`
This first calls `validate_examples` and then checks that the
student/teacher tokens are the same.
* Update distill documentation
* Add distill documentation for all pipes that support distillation
* Fix incorrect identifier
* Apply suggestions from code review
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Add comment to explain `is_distillable`
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* enable fuzzy matching
* add fuzzy param to EntityMatcher
* include rapidfuzz_capi
not yet used
* fix type
* add FUZZY predicate
* add fuzzy attribute list
* fix type properly
* tidying
* remove unnecessary dependency
* handle fuzzy sets
* simplify fuzzy sets
* case fix
* switch to FUZZYn predicates
use Levenshtein distance.
remove fuzzy param.
remove rapidfuzz_capi.
* revert changes added for fuzzy param
* switch to polyleven
(Python package)
* enable fuzzy matching
* add fuzzy param to EntityMatcher
* include rapidfuzz_capi
not yet used
* fix type
* add FUZZY predicate
* add fuzzy attribute list
* fix type properly
* tidying
* remove unnecessary dependency
* handle fuzzy sets
* simplify fuzzy sets
* case fix
* switch to FUZZYn predicates
use Levenshtein distance.
remove fuzzy param.
remove rapidfuzz_capi.
* revert changes added for fuzzy param
* switch to polyleven
(Python package)
* fuzzy match only on oov tokens
* remove polyleven
* exclude whitespace tokens
* don't allow more edits than characters
* fix min distance
* reinstate FUZZY operator
with length-based distance function
* handle sets inside regex operator
* remove is_oov check
* attempt build fix
no mypy failure locally
* re-attempt build fix
* don't overwrite fuzzy param value
* move fuzzy_match
to its own Python module to allow patching
* move fuzzy_match back inside Matcher
simplify logic and add tests
* Format tests
* Parametrize fuzzyn tests
* Parametrize and merge fuzzy+set tests
* Format
* Move fuzzy_match to a standalone method
* Change regex kwarg type to bool
* Add types for fuzzy_match
- Refactor variable names
- Add test for symmetrical behavior
* Parametrize fuzzyn+set tests
* Minor refactoring for fuzz/fuzzy
* Make fuzzy_match a Matcher kwarg
* Update type for _default_fuzzy_match
* don't overwrite function param
* Rename to fuzzy_compare
* Update fuzzy_compare default argument declarations
* allow fuzzy_compare override from EntityRuler
* define new Matcher keyword arg
* fix type definition
* Implement fuzzy_compare config option for EntityRuler and SpanRuler
* Rename _default_fuzzy_compare to fuzzy_compare, remove from reexported objects
* Use simpler fuzzy_compare algorithm
* Update types
* Increase minimum to 2 in fuzzy_compare to allow one transposition
* Fix predicate keys and matching for SetPredicate with FUZZY and REGEX
* Add FUZZY6..9
* Add initial docs
* Increase default fuzzy to rounded 30% of pattern length
* Update docs for fuzzy_compare in components
* Update EntityRuler and SpanRuler API docs
* Rename EntityRuler and SpanRuler setting to matcher_fuzzy_compare
To having naming similar to `phrase_matcher_attr`, rename
`fuzzy_compare` setting for `EntityRuler` and `SpanRuler` to
`matcher_fuzzy_compare. Organize next to `phrase_matcher_attr` in docs.
* Fix schema aliases
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Fix typo
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Add FUZZY6-9 operators and update tests
* Parameterize test over greedy
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Fix type for fuzzy_compare to remove Optional
* Rename to spacy.levenshtein_compare.v1, move to spacy.matcher.levenshtein
* Update docs following levenshtein_compare renaming
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* add test for running evaluate on an nlp pipeline with two distinct textcat components
* cleanup
* merge dicts instead of overwrite
* don't add more labels to the given set
* Revert "merge dicts instead of overwrite"
This reverts commit 89bee0ed77.
* Switch tests to separate scorer keys rather than merged dicts
* Revert unrelated edits
* Switch textcat scorers to v2
* formatting
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Strings in replacement nodes where not added to the `StringStore`
when `EditTreeLemmatizer` was initialized from a set of labels. The
corresponding test did not capture this because it added the strings
through the examples that were passed to the initialization.
This change fixes both this bug in the initialization as the 'shadowing'
of the bug in the test.
* Check textcat values for validity
* Fix error numbers
* Clean up vals reference
* Check category value validity through training
The _validate_categories is called in update, which for multilabel is
inherited from the single label component.
* Formatting
* Update textcat scorer threshold behavior
For `textcat` (with exclusive classes) the scorer should always use a
threshold of 0.0 because there should be one predicted label per doc and
the numeric score for that particular label should not matter.
* Rename to test_textcat_multilabel_threshold
* Remove all uses of threshold for multi_label=False
* Update Scorer.score_cats API docs
* Add tests for score_cats with thresholds
* Update textcat API docs
* Fix types
* Convert threshold back to float
* Fix threshold type in docstring
* Improve formatting in Scorer API docs
* Handle docs with no entities
If a whole batch contains no entities it won't make it to the model, but
it's possible for individual Docs to have no entities. Before this
commit, those Docs would cause an error when attempting to concatenate
arrays because the dimensions didn't match.
It turns out the process of preparing the Ragged at the end of the span
maker forward was a little different from list2ragged, which just uses
the flatten function directly. Letting list2ragged do the conversion
avoids the dimension issue.
This did not come up before because in NEL demo projects it's typical
for data with no entities to be discarded before it reaches the NEL
component.
This includes a simple direct test that shows the issue and checks it's
resolved. It doesn't check if there are any downstream changes, so a
more complete test could be added. A full run was tested by adding an
example with no entities to the Emerson sample project.
* Add a blank instance to default training data in tests
Rather than adding a specific test, since not failing on instances with
no entities is basic functionality, it makes sense to add it to the
default set.
* Fix without modifying architecture
If the architecture is modified this would have to be a new version, but
this change isn't big enough to merit that.
* Replace EntityRuler with SpanRuler implementation
Remove `EntityRuler` and rename the `SpanRuler`-based
`future_entity_ruler` to `entity_ruler`.
Main changes:
* It is no longer possible to load patterns on init as with
`EntityRuler(patterns=)`.
* The older serialization formats (`patterns.jsonl`) are no longer
supported and the related tests are removed.
* The config settings are only stored in the config, not in the
serialized component (in particular the `phrase_matcher_attr` and
overwrite settings).
* Add migration guide to EntityRuler API docs
* docs update
* Minor edit
Co-authored-by: svlandeg <svlandeg@github.com>
* Remove imports marked as v2 leftovers
There are a few functions that were in `spacy.util` in v2, but were
moved to Thinc. In v3 these were imported in `spacy.util` so that code
could be used unchanged, but the comment over them indicates they should
always be imported from Thinc. This commit removes those imports.
It doesn't look like any DeprecationWarning was ever thrown for using
these, but it is probably fine to remove them anyway with a major
version. It is not clear that they were widely used.
* Import fix_random_seed correctly
This seems to be the only place in spaCy that was using the old import.
* Change enable/disable behavior so that arguments take precedence over config options. Extend error message on conflict. Add warning message in case of overwriting config option with arguments.
* Fix tests in test_serialize_pipeline.py to reflect changes to handling of enable/disable.
* Fix type issue.
* Move comment.
* Move comment.
* Issue UserWarning instead of printing wasabi message. Adjust test.
* Added pytest.warns(UserWarning) for expected warning to fix tests.
* Update warning message.
* Move type handling out of fetch_pipes_status().
* Add global variable for default value. Use id() to determine whether used values are default value.
* Fix default value for disable.
* Rename DEFAULT_PIPE_STATUS to _DEFAULT_EMPTY_PIPES.
* Store activations in Doc when `store_activations` is enabled
This change adds the new `activations` attribute to `Doc`. This
attribute can be used by trainable pipes to store their activations,
probabilities, and guesses for downstream users.
As an example, this change modifies the `tagger` and `senter` pipes to
add an `store_activations` option. When this option is enabled, the
probabilities and guesses are stored in `set_annotations`.
* Change type of `store_activations` to `Union[bool, List[str]]`
When the value is:
- A bool: all activations are stored when set to `True`.
- A List[str]: the activations named in the list are stored
* Formatting fixes in Tagger
* Support store_activations in spancat and morphologizer
* Make Doc.activations type visible to MyPy
* textcat/textcat_multilabel: add store_activations option
* trainable_lemmatizer/entity_linker: add store_activations option
* parser/ner: do not currently support returning activations
* Extend tagger and senter tests
So that they, like the other tests, also check that we get no
activations if no activations were requested.
* Document `Doc.activations` and `store_activations` in the relevant pipes
* Start errors/warnings at higher numbers to avoid merge conflicts
Between the master and v4 branches.
* Add `store_activations` to docstrings.
* Replace store_activations setter by set_store_activations method
Setters that take a different type than what the getter returns are still
problematic for MyPy. Replace the setter by a method, so that type inference
works everywhere.
* Use dict comprehension suggested by @svlandeg
* Revert "Use dict comprehension suggested by @svlandeg"
This reverts commit 6e7b958f70.
* EntityLinker: add type annotations to _add_activations
* _store_activations: make kwarg-only, remove doc_scores_lens arg
* set_annotations: add type annotations
* Apply suggestions from code review
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* TextCat.predict: return dict
* Make the `TrainablePipe.store_activations` property a bool
This means that we can also bring back `store_activations` setter.
* Remove `TrainablePipe.activations`
We do not need to enumerate the activations anymore since `store_activations` is
`bool`.
* Add type annotations for activations in predict/set_annotations
* Rename `TrainablePipe.store_activations` to `save_activations`
* Error E1400 is not used anymore
This error was used when activations were still `Union[bool, List[str]]`.
* Change wording in API docs after store -> save change
* docs: tag (save_)activations as new in spaCy 4.0
* Fix copied line in morphologizer activations test
* Don't train in any test_save_activations test
* Rename activations
- "probs" -> "probabilities"
- "guesses" -> "label_ids", except in the edit tree lemmatizer, where
"guesses" -> "tree_ids".
* Remove unused W400 warning.
This warning was used when we still allowed the user to specify
which activations to save.
* Formatting fixes
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Replace "kb_ids" by a constant
* spancat: replace a cast by an assertion
* Fix EOF spacing
* Fix comments in test_save_activations tests
* Do not set RNG seed in activation saving tests
* Revert "spancat: replace a cast by an assertion"
This reverts commit 0bd5730d16.
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* replicate bug with tok2vec in annotating components
* add overfitting test with a frozen tok2vec
* remove broadcast from predict and check doc.tensor instead
* remove broadcast
* proper error
* slight rephrase of documentation
* adding unit test for spacy.load with disable/exclude string arg
* allow pure strings in from_config
* update docs
* upstream type adjustements
* docs update
* make docstring more consistent
* Update spacy/language.py
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* two more cleanups
* fix type in internal method
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Enable flag on spacy.load: foundation for include, enable arguments.
* Enable flag on spacy.load: fixed tests.
* Enable flag on spacy.load: switched from pretrained model to empty model with added pipes for tests.
* Enable flag on spacy.load: switched to more consistent error on misspecification of component activity. Test refactoring. Added to default config.
* Enable flag on spacy.load: added support for fields not in pipeline.
* Enable flag on spacy.load: removed serialization fields from supported fields.
* Enable flag on spacy.load: removed 'enable' from config again.
* Enable flag on spacy.load: relaxed checks in _resolve_component_activation_status() to allow non-standard pipes.
* Enable flag on spacy.load: fixed relaxed checks for _resolve_component_activation_status() to allow non-standard pipes. Extended tests.
* Enable flag on spacy.load: comments w.r.t. resolution workarounds.
* Enable flag on spacy.load: remove include fields. Update website docs.
* Enable flag on spacy.load: updates w.r.t. changes in master.
* Implement Doc.from_json(): update docstrings.
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Implement Doc.from_json(): remove newline.
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Implement Doc.from_json(): change error message for E1038.
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Enable flag on spacy.load: wrapped docstring for _resolve_component_status() at 80 chars.
* Enable flag on spacy.load: changed exmples for enable flag.
* Remove newline.
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Fix docstring for Language._resolve_component_status().
* Rename E1038 to E1042.
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Add SpanRuler component
Add a `SpanRuler` component similar to `EntityRuler` that saves a list
of matched spans to `Doc.spans[spans_key]`. The matches from the token
and phrase matchers are deduplicated and sorted before assignment but
are not otherwise filtered.
* Update spacy/pipeline/span_ruler.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Fix cast
* Add self.key property
* Use number of patterns as length
* Remove patterns kwarg from init
* Update spacy/tests/pipeline/test_span_ruler.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Add options for spans filter and setting to ents
* Add `spans_filter` option as a registered function'
* Make `spans_key` optional and if `None`, set to `doc.ents` instead of
`doc.spans[spans_key]`.
* Update and generalize tests
* Add test for setting doc.ents, fix key property type
* Fix typing
* Allow independent doc.spans and doc.ents
* If `spans_key` is set, set `doc.spans` with `spans_filter`.
* If `annotate_ents` is set, set `doc.ents` with `ents_fitler`.
* Use `util.filter_spans` by default as `ents_filter`.
* Use a custom warning if the filter does not work for `doc.ents`.
* Enable use of SpanC.id in Span
* Support id in SpanRuler as Span.id
* Update types
* `id` can only be provided as string (already by `PatternType`
definition)
* Update all uses of Span.id/ent_id in Doc
* Rename Span id kwarg to span_id
* Update types and docs
* Add ents filter to mimic EntityRuler overwrite_ents
* Refactor `ents_filter` to take `entities, spans` args for more
filtering options
* Give registered filters more descriptive names
* Allow registered `filter_spans` filter
(`spacy.first_longest_spans_filter.v1`) to take any number of
`Iterable[Span]` objects as args so it can be used for spans filter
or ents filter
* Implement future entity ruler as span ruler
Implement a compatible `entity_ruler` as `future_entity_ruler` using
`SpanRuler` as the underlying component:
* Add `sort_key` and `sort_reverse` to allow the sorting behavior to be
customized. (Necessary for the same sorting/filtering as in
`EntityRuler`.)
* Implement `overwrite_overlapping_ents_filter` and
`preserve_existing_ents_filter` to support
`EntityRuler.overwrite_ents` settings.
* Add `remove_by_id` to support `EntityRuler.remove` functionality.
* Refactor `entity_ruler` tests to parametrize all tests to test both
`entity_ruler` and `future_entity_ruler`
* Implement `SpanRuler.token_patterns` and `SpanRuler.phrase_patterns`
properties.
Additional changes:
* Move all config settings to top-level attributes to avoid duplicating
settings in the config vs. `span_ruler/cfg`. (Also avoids a lot of
casting.)
* Format
* Fix filter make method name
* Refactor to use same error for removing by label or ID
* Also provide existing spans to spans filter
* Support ids property
* Remove token_patterns and phrase_patterns
* Update docstrings
* Add span ruler docs
* Fix types
* Apply suggestions from code review
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Move sorting into filters
* Check for all tokens in seen tokens in entity ruler filters
* Remove registered sort key
* Set Token.ent_id in a backwards-compatible way in Doc.set_ents
* Remove sort options from API docs
* Update docstrings
* Rename entity ruler filters
* Fix and parameterize scoring
* Add id to Span API docs
* Fix typo in API docs
* Include explicit labeled=True for scorer
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Add failing test
* Partial fix for issue
This kind of works. The issue with token length mismatches is gone. The
problem is that when you get empty lists of encodings to compare, it
fails because the sizes are not the same, even though they're both zero:
(0, 3) vs (0,). Not sure why that happens...
* Short circuit on empties
* Remove spurious check
The check here isn't needed now the the short circuit is fixed.
* Update spacy/tests/pipeline/test_entity_linker.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Use "eg", not "example"
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Pipe name override in config: added check with warning, added removal of name override from config, extended tests.
* Pipoe name override in config: added pytest UserWarning.
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* add v1 and v2 tests for tok2vec architectures
* textcat architectures are not "layers"
* test older textcat architectures
* test older parser architecture
* Add edit tree lemmatizer
Co-authored-by: Daniël de Kok <me@danieldk.eu>
* Hide edit tree lemmatizer labels
* Use relative imports
* Switch to single quotes in error message
* Type annotation fixes
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Reformat edit_tree_lemmatizer with black
* EditTreeLemmatizer.predict: take Iterable
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Validate edit trees during deserialization
This change also changes the serialized representation. Rather than
mirroring the deep C structure, we use a simple flat union of the match
and substitution node types.
* Move edit_trees to _edit_tree_internals
* Fix invalid edit tree format error message
* edit_tree_lemmatizer: remove outdated TODO comment
* Rename factory name to trainable_lemmatizer
* Ignore type instead of casting truths to List[Union[Ints1d, Floats2d, List[int], List[str]]] for thinc v8.0.14
* Switch to Tagger.v2
* Add documentation for EditTreeLemmatizer
* docs: Fix 3.2 -> 3.3 somewhere
* trainable_lemmatizer documentation fixes
* docs: EditTreeLemmatizer is in edit_tree_lemmatizer.py
Co-authored-by: Daniël de Kok <me@danieldk.eu>
Co-authored-by: Daniël de Kok <me@github.danieldk.eu>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Tagger: use unnormalized probabilities for inference
Using unnormalized softmax avoids use of the relatively expensive exp function,
which can significantly speed up non-transformer models (e.g. I got a speedup
of 27% on a German tagging + parsing pipeline).
* Add spacy.Tagger.v2 with configurable normalization
Normalization of probabilities is disabled by default to improve
performance.
* Update documentation, models, and tests to spacy.Tagger.v2
* Move Tagger.v1 to spacy-legacy
* docs/architectures: run prettier
* Unnormalized softmax is now a Softmax_v2 option
* Require thinc 8.0.14 and spacy-legacy 3.0.9
* Add save_candidates attribute
* Change spancat api
* Add unit test
* reimplement method to produce a list of doc
* Add method to docs
* Add new version tag
* Add intended use to docstring
* prettier formatting
* Fix get_matching_ents
Not sure what happened here - the code prior to this commit simply does
not work. It's already covered by entity linker tests, which were
succeeding in the NEL PR, but couldn't possibly succeed on master.
* Fix test
Test was indented inside another test and so doesn't seem to have been
running properly.
* Partial fix of entity linker batching
* Add import
* Better name
* Add `use_gold_ents` option, docs
* Change to v2, create stub v1, update docs etc.
* Fix error type
Honestly no idea what the right type to use here is.
ConfigValidationError seems wrong. Maybe a NotImplementedError?
* Make mypy happy
* Add hacky fix for init issue
* Add legacy pipeline entity linker
* Fix references to class name
* Add __init__.py for legacy
* Attempted fix for loss issue
* Remove placeholder V1
* formatting
* slightly more interesting train data
* Handle batches with no usable examples
This adds a test for batches that have docs but not entities, and a
check in the component that detects such cases and skips the update step
as thought the batch were empty.
* Remove todo about data verification
Check for empty data was moved further up so this should be OK now - the
case in question shouldn't be possible.
* Fix gradient calculation
The model doesn't know which entities are not in the kb, so it generates
embeddings for the context of all of them.
However, the loss does know which entities aren't in the kb, and it
ignores them, as there's no sensible gradient.
This has the issue that the gradient will not be calculated for some of
the input embeddings, which causes a dimension mismatch in backprop.
That should have caused a clear error, but with numpyops it was causing
nans to happen, which is another problem that should be addressed
separately.
This commit changes the loss to give a zero gradient for entities not in
the kb.
* add failing test for v1 EL legacy architecture
* Add nasty but simple working check for legacy arch
* Clarify why init hack works the way it does
* Clarify use_gold_ents use case
* Fix use gold ents related handling
* Add tests for no gold ents and fix other tests
* Use aligned ents function (not working)
This doesn't actually work because the "aligned" ents are gold-only. But
if I have a different function that returns the intersection, *then*
this will work as desired.
* Use proper matching ent check
This changes the process when gold ents are not used so that the
intersection of ents in the pred and gold is used.
* Move get_matching_ents to Example
* Use model attribute to check for legacy arch
* Rename flag
* bump spacy-legacy to lower 3.0.9
Co-authored-by: svlandeg <svlandeg@github.com>
* Fix Scorer.score_cats for missing labels
* Add test case for Scorer.score_cats missing labels
* semantic nitpick
* black formatting
* adjust test to give different results depending on multi_label setting
* fix loss function according to whether or not missing values are supported
* add note to docs
* small fixes
* make mypy happy
* Update spacy/pipeline/textcat.py
Co-authored-by: Florian Cäsar <florian.caesar@pm.me>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: svlandeg <svlandeg@github.com>
* added ruler coe
* added error for none existing pattern
* changed error to warning
* changed error to warning
* added basic tests
* fixed place
* added test files
* went back to error
* went back to pattern error
* minor change to docs
* changed style
* changed doc
* changed error slightly
* added remove to phrasem api
* error key already existed
* phrase matcher match code to api
* blacked tests
* moved comments before expr
* corrected error no
* Update website/docs/api/entityruler.md
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Update website/docs/api/entityruler.md
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Migrate regressions 1-1000
* Move serialize test to correct file
* Remove tests that won't work in v3
* Migrate regressions 1000-1500
Removed regression test 1250 because v3 doesn't support the old LEX
scheme anymore.
* Add missing imports in serializer tests
* Migrate tests 1500-2000
* Migrate regressions from 2000-2500
* Migrate regressions from 2501-3000
* Migrate regressions from 3000-3501
* Migrate regressions from 3501-4000
* Migrate regressions from 4001-4500
* Migrate regressions from 4501-5000
* Migrate regressions from 5001-5501
* Migrate regressions from 5501 to 7000
* Migrate regressions from 7001 to 8000
* Migrate remaining regression tests
* Fixing missing imports
* Update docs with new system [ci skip]
* Update CONTRIBUTING.md
- Fix formatting
- Update wording
* Remove lemmatizer tests in el lang
* Move a few tests into the general tokenizer
* Separate Doc and DocBin tests
* added error string
* added serialization test
* added more to if statements
* wrote file to tempdir
* added tempdir
* changed parameter a bit
* Update spacy/tests/pipeline/test_entity_ruler.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* 🚨 Ignore all existing Mypy errors
* 🏗 Add Mypy check to CI
* Add types-mock and types-requests as dev requirements
* Add additional type ignore directives
* Add types packages to dev-only list in reqs test
* Add types-dataclasses for python 3.6
* Add ignore to pretrain
* 🏷 Improve type annotation on `run_command` helper
The `run_command` helper previously declared that it returned an
`Optional[subprocess.CompletedProcess]`, but it isn't actually possible
for the function to return `None`. These changes modify the type
annotation of the `run_command` helper and remove all now-unnecessary
`# type: ignore` directives.
* 🔧 Allow variable type redefinition in limited contexts
These changes modify how Mypy is configured to allow variables to have
their type automatically redefined under certain conditions. The Mypy
documentation contains the following example:
```python
def process(items: List[str]) -> None:
# 'items' has type List[str]
items = [item.split() for item in items]
# 'items' now has type List[List[str]]
...
```
This configuration change is especially helpful in reducing the number
of `# type: ignore` directives needed to handle the common pattern of:
* Accepting a filepath as a string
* Overwriting the variable using `filepath = ensure_path(filepath)`
These changes enable redefinition and remove all `# type: ignore`
directives rendered redundant by this change.
* 🏷 Add type annotation to converters mapping
* 🚨 Fix Mypy error in convert CLI argument verification
* 🏷 Improve type annotation on `resolve_dot_names` helper
* 🏷 Add type annotations for `Vocab` attributes `strings` and `vectors`
* 🏷 Add type annotations for more `Vocab` attributes
* 🏷 Add loose type annotation for gold data compilation
* 🏷 Improve `_format_labels` type annotation
* 🏷 Fix `get_lang_class` type annotation
* 🏷 Loosen return type of `Language.evaluate`
* 🏷 Don't accept `Scorer` in `handle_scores_per_type`
* 🏷 Add `string_to_list` overloads
* 🏷 Fix non-Optional command-line options
* 🙈 Ignore redefinition of `wandb_logger` in `loggers.py`
* ➕ Install `typing_extensions` in Python 3.8+
The `typing_extensions` package states that it should be used when
"writing code that must be compatible with multiple Python versions".
Since SpaCy needs to support multiple Python versions, it should be used
when newer `typing` module members are required. One example of this is
`Literal`, which is available starting with Python 3.8.
Previously SpaCy tried to import `Literal` from `typing`, falling back
to `typing_extensions` if the import failed. However, Mypy doesn't seem
to be able to understand what `Literal` means when the initial import
means. Therefore, these changes modify how `compat` imports `Literal` by
always importing it from `typing_extensions`.
These changes also modify how `typing_extensions` is installed, so that
it is a requirement for all Python versions, including those greater
than or equal to 3.8.
* 🏷 Improve type annotation for `Language.pipe`
These changes add a missing overload variant to the type signature of
`Language.pipe`. Additionally, the type signature is enhanced to allow
type checkers to differentiate between the two overload variants based
on the `as_tuple` parameter.
Fixes#8772
* ➖ Don't install `typing-extensions` in Python 3.8+
After more detailed analysis of how to implement Python version-specific
type annotations using SpaCy, it has been determined that by branching
on a comparison against `sys.version_info` can be statically analyzed by
Mypy well enough to enable us to conditionally use
`typing_extensions.Literal`. This means that we no longer need to
install `typing_extensions` for Python versions greater than or equal to
3.8! 🎉
These changes revert previous changes installing `typing-extensions`
regardless of Python version and modify how we import the `Literal` type
to ensure that Mypy treats it properly.
* resolve mypy errors for Strict pydantic types
* refactor code to avoid missing return statement
* fix types of convert CLI command
* avoid list-set confustion in debug_data
* fix typo and formatting
* small fixes to avoid type ignores
* fix types in profile CLI command and make it more efficient
* type fixes in projects CLI
* put one ignore back
* type fixes for render
* fix render types - the sequel
* fix BaseDefault in language definitions
* fix type of noun_chunks iterator - yields tuple instead of span
* fix types in language-specific modules
* 🏷 Expand accepted inputs of `get_string_id`
`get_string_id` accepts either a string (in which case it returns its
ID) or an ID (in which case it immediately returns the ID). These
changes extend the type annotation of `get_string_id` to indicate that
it can accept either strings or IDs.
* 🏷 Handle override types in `combine_score_weights`
The `combine_score_weights` function allows users to pass an `overrides`
mapping to override data extracted from the `weights` argument. Since it
allows `Optional` dictionary values, the return value may also include
`Optional` dictionary values.
These changes update the type annotations for `combine_score_weights` to
reflect this fact.
* 🏷 Fix tokenizer serialization method signatures in `DummyTokenizer`
* 🏷 Fix redefinition of `wandb_logger`
These changes fix the redefinition of `wandb_logger` by giving a
separate name to each `WandbLogger` version. For
backwards-compatibility, `spacy.train` still exports `wandb_logger_v3`
as `wandb_logger` for now.
* more fixes for typing in language
* type fixes in model definitions
* 🏷 Annotate `_RandomWords.probs` as `NDArray`
* 🏷 Annotate `tok2vec` layers to help Mypy
* 🐛 Fix `_RandomWords.probs` type annotations for Python 3.6
Also remove an import that I forgot to move to the top of the module 😅
* more fixes for matchers and other pipeline components
* quick fix for entity linker
* fixing types for spancat, textcat, etc
* bugfix for tok2vec
* type annotations for scorer
* add runtime_checkable for Protocol
* type and import fixes in tests
* mypy fixes for training utilities
* few fixes in util
* fix import
* 🐵 Remove unused `# type: ignore` directives
* 🏷 Annotate `Language._components`
* 🏷 Annotate `spacy.pipeline.Pipe`
* add doc as property to span.pyi
* small fixes and cleanup
* explicit type annotations instead of via comment
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>
Co-authored-by: svlandeg <svlandeg@github.com>
* Add overwrite settings for more components
For pipeline components where it's relevant and not already implemented,
add an explicit `overwrite` setting that controls whether
`set_annotations` overwrites existing annotation.
For the `morphologizer`, add an additional setting `extend`, which
controls whether the existing features are preserved.
* +overwrite, +extend: overwrite values of existing features, add any new
features
* +overwrite, -extend: overwrite completely, removing any existing
features
* -overwrite, +extend: keep values of existing features, add any new
features
* -overwrite, -extend: do not modify the existing value if set
In all cases an unset value will be set by `set_annotations`.
Preserve current overwrite defaults:
* True: morphologizer, entity linker
* False: tagger, sentencizer, senter
* Add backwards compat overwrite settings
* Put empty line back
Removed by accident in last commit
* Set backwards-compatible defaults in __init__
Because the `TrainablePipe` serialization methods update `cfg`, there's
no straightforward way to detect whether models serialized with a
previous version are missing the overwrite settings.
It would be possible in the sentencizer due to its separate
serialization methods, however to keep the changes parallel, this also
sets the default in `__init__`.
* Remove traces
Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
* overfitting test on non-overlapping entities
* add failing overfitting test for overlapping entities
* failing test for list comprehension
* remove test that was put in separate PR
* bugfix
* cleanup
* test for error after Doc has been garbage collected
* warn about using a SpanGroup when the Doc has been garbage collected
* add warning to the docs
* rephrase slightly
* raise error instead of warning
* update
* move warning to doc property
* Add scorer option to components
Add an optional `scorer` parameter to all pipeline components. If a
scoring function is provided, it overrides the default scoring method
for that component.
* Add registered scorers for all components
* Add `scorers` registry
* Move all scoring methods outside of components as independent
functions and register
* Use the registered scoring methods as defaults in configs and inits
Additional:
* The scoring methods no longer have access to the full component, so
use settings from `cfg` as default scorer options to handle settings
such as `labels`, `threshold`, and `positive_label`
* The `attribute_ruler` scoring method no longer has access to the
patterns, so all scoring methods are called
* Bug fix: `spancat` scoring method is updated to set `allow_overlap` to
score overlapping spans correctly
* Update Russian lemmatizer to use direct score method
* Check type of cfg in Pipe.score
* Fix check
* Update spacy/pipeline/sentencizer.pyx
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Remove validate_examples from scoring functions
* Use Pipe.labels instead of Pipe.cfg["labels"]
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Add scores to output in spancat
This exposes the scores as an attribute on the SpanGroup. Includes a
basic test.
* Add basic doc note
* Vectorize score calcs
* Add "annotation format" section
* Update website/docs/api/spancategorizer.md
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Clean up doc section
* Ran prettier on docs
* Get arrays off the gpu before iterating over them
* Remove int() calls
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Raise an error for textcat with <2 labels
Raise an error if initializing a `textcat` component without at least
two labels.
* Add similar note to docs
* Update positive_label description in API docs
* Draft spancat model
* Add spancat model
* Add test for extract_spans
* Add extract_spans layer
* Upd extract_spans
* Add spancat model
* Add test for spancat model
* Upd spancat model
* Update spancat component
* Upd spancat
* Update spancat model
* Add quick spancat test
* Import SpanCategorizer
* Fix SpanCategorizer component
* Import SpanGroup
* Fix span extraction
* Fix import
* Fix import
* Upd model
* Update spancat models
* Add scoring, update defaults
* Update and add docs
* Fix type
* Update spacy/ml/extract_spans.py
* Auto-format and fix import
* Fix comment
* Fix type
* Fix type
* Update website/docs/api/spancategorizer.md
* Fix comment
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Better defense
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Fix labels list
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Update spacy/ml/extract_spans.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Update spacy/pipeline/spancat.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Set annotations during update
* Set annotations in spancat
* fix imports in test
* Update spacy/pipeline/spancat.py
* replace MaxoutLogistic with LinearLogistic
* fix config
* various small fixes
* remove set_annotations parameter in update
* use our beloved tupley format with recent support for doc.spans
* bugfix to allow renaming the default span_key (scores weren't showing up)
* use different key in docs example
* change defaults to better-working parameters from project (WIP)
* register spacy.extract_spans.v1 for legacy purposes
* Upd dev version so can build wheel
* layers instead of architectures for smaller building blocks
* Update website/docs/api/spancategorizer.md
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Update website/docs/api/spancategorizer.md
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Include additional scores from overrides in combined score weights
* Parameterize spans key in scoring
Parameterize the `SpanCategorizer` `spans_key` for scoring purposes so
that it's possible to evaluate multiple `spancat` components in the same
pipeline.
* Use the (intentionally very short) default spans key `sc` in the
`SpanCategorizer`
* Adjust the default score weights to include the default key
* Adjust the scorer to use `spans_{spans_key}` as the prefix for the
returned score
* Revert addition of `attr_name` argument to `score_spans` and adjust
the key in the `getter` instead.
Note that for `spancat` components with a custom `span_key`, the score
weights currently need to be modified manually in
`[training.score_weights]` for them to be available during training. To
suppress the default score weights `spans_sc_p/r/f` during training, set
them to `null` in `[training.score_weights]`.
* Update website/docs/api/scorer.md
* Fix scorer for spans key containing underscore
* Increment version
* Add Spans to Evaluate CLI (#8439)
* Add Spans to Evaluate CLI
* Change to spans_key
* Add spans per_type output
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Fix spancat GPU issues (#8455)
* Fix GPU issues
* Require thinc >=8.0.6
* Switch to glorot_uniform_init
* Fix and test ngram suggester
* Include final ngram in doc for all sizes
* Fix ngrams for docs of the same length as ngram size
* Handle batches of docs that result in no ngrams
* Add tests
Co-authored-by: Ines Montani <ines@ines.io>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Nirant <NirantK@users.noreply.github.com>
* implement textcat resizing for TextCatCNN
* resizing textcat in-place
* simplify code
* ensure predictions for old textcat labels remain the same after resizing (WIP)
* fix for softmax
* store softmax as attr
* fix ensemble weight copy and cleanup
* restructure slightly
* adjust documentation, update tests and quickstart templates to use latest versions
* extend unit test slightly
* revert unnecessary edits
* fix typo
* ensemble architecture won't be resizable for now
* use resizable layer (WIP)
* revert using resizable layer
* resizable container while avoid shape inference trouble
* cleanup
* ensure model continues training after resizing
* use fill_b parameter
* use fill_defaults
* resize_layer callback
* format
* bump thinc to 8.0.4
* bump spacy-legacy to 3.0.6