* Add TextCatReduce.v1
This is a textcat classifier that pools the vectors generated by a
tok2vec implementation and then applies a classifier to the pooled
representation. Three reductions are supported for pooling: first, max,
and mean. When multiple reductions are enabled, the reductions are
concatenated before providing them to the classification layer.
This model is a generalization of the TextCatCNN model, which only
supports mean reductions and is a bit of a misnomer, because it can also
be used with transformers. This change also reimplements TextCatCNN.v2
using the new TextCatReduce.v1 layer.
* Doc fixes
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Fully specify `TextCatCNN` <-> `TextCatReduce` equivalence
* Move TextCatCNN docs to legacy, in prep for moving to spacy-legacy
* Add back a test for TextCatCNN.v2
* Replace TextCatCNN in pipe configurations and templates
* Add an infobox to the `TextCatReduce` section with an `TextCatCNN` anchor
* Add last reduction (`use_reduce_last`)
* Remove non-working TextCatCNN Netlify redirect
* Revert layer changes for the quickstart
* Revert one more quickstart change
* Remove unused import
* Fix docstring
* Fix setting name in error message
---------
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Update `TextCatBOW` to use the fixed `SparseLinear` layer
A while ago, we fixed the `SparseLinear` layer to use all available
parameters: https://github.com/explosion/thinc/pull/754
This change updates `TextCatBOW` to `v3` which uses the new
`SparseLinear_v2` layer. This results in a sizeable improvement on a
text categorization task that was tested.
While at it, this `spacy.TextCatBOW.v3` also adds the `length_exponent`
option to make it possible to change the hidden size. Ideally, we'd just
have an option called `length`. But the way that `TextCatBOW` uses
hashes results in a non-uniform distribution of parameters when the
length is not a power of two.
* Replace TexCatBOW `length_exponent` parameter by `length`
We now round up the length to the next power of two if it isn't
a power of two.
* Remove some tests for TextCatBOW.v2
* Fix missing import
* Update the "Missing factory" error message
This accounts for model installations that took place during the current Python session.
* Add a note about Jupyter notebooks
* Move error to `spacy.cli.download`
Add extra message for Jupyter sessions
* Add additional note for interactive sessions
* Remove note about `spacy-transformers` from error message
* `isort`
* Improve checks for colab (also helps displacy)
* Update warning messages
* Improve flow for multiple checks
---------
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Support registered vectors
* Format
* Auto-fill [nlp] on load from config and from bytes/disk
* Only auto-fill [nlp]
* Undo all changes to Language.from_disk
* Expand BaseVectors
These methods are needed in various places for training and vector
similarity.
* isort
* More linting
* Only fill [nlp.vectors]
* Update spacy/vocab.pyx
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Revert changes to test related to auto-filling [nlp]
* Add vectors registry
* Rephrase error about vocab methods for vectors
* Switch to dummy implementation for BaseVectors.to_ops
* Add initial draft of docs
* Remove example from BaseVectors docs
* Apply suggestions from code review
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Update website/docs/api/basevectors.mdx
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Fix type and lint bpemb example
* Update website/docs/api/basevectors.mdx
---------
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Setting up weasel branch (#12456)
* remove project-specific functionality
* remove project-specific tests
* remove project-specific schemas
* remove project-specific information in about
* remove project-specific functions in util.py
* remove project-specific error strings
* remove project-specific CLI commands
* black formatting
* restore some functions that are used beyond projects
* remove project imports
* remove imports
* remove remote_storage tests
* remove one more project unit test
* update for PR 12394
* remove get_hash and get_checksum
* remove upload_ and download_file methods
* remove ensure_pathy
* revert clumsy fingers
* reinstate E970
* feat: use weasel as spacy project command (#12473)
* feat: use weasel as spacy project command
* build: use constrained requirement for weasel
* feat: add weasel to the library requirements
* build: update weasel to new version
* build: use specific weasel tag
* build: use weasel-0.1.0rc1 from PyPI
* fix: remove weasel from requirements.txt
* fix: requirements.txt and setup.cfg need to reflect each other
* feat: remove legacy spacy project code
* bump version
* further merge fixes
* isort
---------
Co-authored-by: Basile Dura <bdura@users.noreply.github.com>
* `Language.replace_listeners`: Pass the replaced listener and the `tok2vec` pipe to the callback
* Update developer docs
* `isort` fixes
* Add error message to assertion
* Add clarification to dev docs
* Replace assertion with exception
* Doc fixes
* Support custom token/lexeme attribute for vectors
* Fix imports
* Back off to ORTH without Vectors.attr
* Fallback if vectors.attr doesn't exist
* Update docs
* Use isort with Black profile
* isort all the things
* Fix import cycles as a result of import sorting
* Add DOCBIN_ALL_ATTRS type definition
* Add isort to requirements
* Remove isort from build dependencies check
* Typo
* span finder integrated into spacy from experimental
* black
* isort
* black
* default spankey constant
* black
* Update spacy/pipeline/spancat.py
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* rename
* rename
* max_length and min_length as Optional[int] and strict checking
* black
* mypy fix for integer type infinity
* revert line order
* implement all comparison operators for inf int
* avoid two for loops over all docs by not precomputing
* interleave thresholding with span creation
* black
* revert to not interleaving (relized its faster)
* black
* Update spacy/errors.py
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* update dosctring
* enforce that the gold and predicted documents have the same text
* new error for ensuring reference and predicted texts are the same
* remove todo
* adjust test
* black
* handle misaligned tokenization
* return correct variable
* failing overfit test
* only use a single spans_key like in spancat
* black
* remove debug lines
* typo
* remove comment
* remove near duplicate reduntant method
* use the 'spans_key' variable name everywhere
* Update spacy/pipeline/span_finder.py
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* flaky test fix suggestion, hand set bias terms
* only test suggester and test result exhaustively
* make it clear that the span_finder_suggester is more general (not specific to span_finder)
* Update spacy/tests/pipeline/test_span_finder.py
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Apply suggestions from code review
* remove question comment
* move preset_spans_suggester test to spancat tests
* Add docs and unify default configs for spancat and span finder
* Add `allow_overlap=True` to span finder scorer
* Fix offset bug in set_annotations
* Ignore labels in span finder scorer
* Format
* Add span_finder to quickstart template
* Move settings to self.cfg, store min/max unset as None
* Remove debugging
* Update docstrings and docs
* Update spacy/pipeline/span_finder.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Fix imports
---------
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Require that all SpanGroup spans are from the current doc
The restriction on only adding spans from the current doc were already
implemented for all operations except for `SpanGroup.__init__`.
Initialize copied spans for `SpanGroup.copy` with `Doc.char_span` in
order to validate the character offsets and to make it possible to copy
spans between documents with differing tokenization. Currently there is
no validation that the document texts are identical, but the span char
offsets must be valid spans in the target doc, which prevents you from
ending up with completely invalid spans.
* Undo change in test_beam_overfitting_IO
* [wip] Update
* [wip] Update
* Add initial port
* [wip] Update
* Fix all imports
* Add spancat_exclusive to pipeline
* [WIP] Update
* [ci skip] Add breakpoint for debugging
* Use spacy.SpanCategorizer.v1 as default archi
* Update spacy/pipeline/spancat_exclusive.py
Co-authored-by: kadarakos <kadar.akos@gmail.com>
* [ci skip] Small updates
* Use Softmax v2 directly from thinc
* Cache the label map
* Fix mypy errors
However, I ignored line 370 because it opened up a bunch of type errors
that might be trickier to solve and might lead to a more complicated
codebase.
* avoid multiplication with 1.0
Co-authored-by: kadarakos <kadar.akos@gmail.com>
* Update spacy/pipeline/spancat_exclusive.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Update component versions to v2
* Add scorer to docstring
* Add _n_labels property to SpanCategorizer
Instead of using len(self.labels) in initialize() I am using a private
property self._n_labels. This achieves implementation parity and allows
me to delete the whole initialize() method for spancat_exclusive (since
it's now the same with spancat).
* Inherit from SpanCat instead of TrainablePipe
This commit changes the inheritance structure of Exclusive_Spancat,
now it's inheriting from SpanCategorizer than TrainablePipe. This
allows me to remove duplicate methods that are already present in
the parent function.
* Revert documentation link to spancat
* Fix init call for exclusive spancat
* Update spacy/pipeline/spancat_exclusive.py
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Import Suggester from spancat
* Include zero_init.v1 for spancat
* Implement _allow_extra_label to use _n_labels
To ensure that spancat / spancat_exclusive cannot be resized after
initialization, I inherited the _allow_extra_label() method from
spacy/pipeline/trainable_pipe.pyx and used self._n_labels instead
of len(self.labels) for checking.
I think that changing it locally is a better solution rather than
forcing each class that inherits TrainablePipe to use the self._n_labels
attribute.
Also note that I turned-off black formatting in this block of code
because it reads better without the overhang.
* Extend existing tests to spancat_exclusive
In this commit, I extended the existing tests for spancat to include
spancat_exclusive. I parametrized the test functions with 'name'
(similar var name with textcat and textcat_multilabel) for each
applicable test.
TODO: Add overfitting tests for spancat_exclusive
* Update documentation for spancat
* Turn on formatting for allow_extra_label
* Remove initializers in default config
* Use DEFAULT_EXCL_SPANCAT_MODEL
I also renamed spancat_exclusive_default_config into
spancat_excl_default_config because black does some not pretty
formatting changes.
* Update documentation
Update grammar and usage
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Clarify docstring for Exclusive_SpanCategorizer
* Remove mypy ignore and typecast labels to list
* Fix documentation API
* Use a single variable for tests
* Update defaults for number of rows
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Put back initializers in spancat config
Whenever I remove model.scorer.init_w and model.scorer.init_b,
I encounter an error in the test:
SystemError: <method '__getitem__' of 'dict' objects> returned a result
with an error set.
My Thinc version is 8.1.5, but I can't seem to check what's causing the
error.
* Update spancat_exclusive docstring
* Remove init_W and init_B parameters
This commit is expected to fail until the new Thinc release.
* Require thinc>=8.1.6 for serializable Softmax defaults
* Handle zero suggestions to make tests pass
I'm not sure if this is the most elegant solution. But what should
happen is that the _make_span_group function MUST return an empty
SpanGroup if there are no suggestions.
The error happens when the 'scores' variable is empty. We cannot
get the 'predicted' and other downstream vars.
* Better approach for handling zero suggestions
* Update website/docs/api/spancategorizer.md
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Update spancategorizer headers
* Apply suggestions from code review
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Add default value in negative_weight in docs
* Add default value in allow_overlap in docs
* Update how spancat_exclusive is constructed
In this commit, I added the following:
- Put the default values of negative_weight and allow_overlap
in the default_config dictionary.
- Rename make_spancat -> make_exclusive_spancat
* Run prettier on spancategorizer.mdx
* Change exactly one -> at most one
* Add suggester documentation in Exclusive_SpanCategorizer
* Add suggester to spancat docstrings
* merge multilabel and singlelabel spancat
* rename spancat_exclusive to singlelable
* wire up different make_spangroups for single and multilabel
* black
* black
* add docstrings
* more docstring and fix negative_label
* don't rely on default arguments
* black
* remove spancat exclusive
* replace single_label with add_negative_label and adjust inference
* mypy
* logical bug in configuration check
* add spans.attrs[scores]
* single label make_spangroup test
* bugfix
* black
* tests for make_span_group with negative labels
* refactor make_span_group
* black
* Update spacy/tests/pipeline/test_spancat.py
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* remove duplicate declaration
* Update spacy/pipeline/spancat.py
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* raise error instead of just print
* make label mapper private
* update docs
* run prettier
* Update website/docs/api/spancategorizer.mdx
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Update website/docs/api/spancategorizer.mdx
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Update spacy/pipeline/spancat.py
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Update spacy/pipeline/spancat.py
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Update spacy/pipeline/spancat.py
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Update spacy/pipeline/spancat.py
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* don't keep recomputing self._label_map for each span
* typo in docs
* Intervals to private and document 'name' param
* Update spacy/pipeline/spancat.py
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Update spacy/pipeline/spancat.py
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* add Tag to new features
* replace tags
* revert
* revert
* revert
* revert
* Update website/docs/api/spancategorizer.mdx
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Update website/docs/api/spancategorizer.mdx
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* prettier
* Fix merge
* Update website/docs/api/spancategorizer.mdx
* remove references to 'single_label'
* remove old paragraph
* Add spancat_singlelabel to config template
* Format
* Extend init config tests
---------
Co-authored-by: kadarakos <kadar.akos@gmail.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Clean up displacy port-related error messages, docs
There were some issues in the error messages and docs in #11948.
1. the error messages didn't specify the port argument to displacy.serve correctly
2. the docs didn't mark the auto select argument as new
This addresses those issues.
* Update website/docs/api/top-level.md
Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com>
* Apply prettier
Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com>
* check port in use and add itself
* check port in use and add itself
* Auto switch to nearest available port.
* Use bind to check port instead of connect_ex.
* Reformat.
* Add auto_select_port argument.
* update docs for displacy.serve
* Update spacy/errors.py
Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
* Update website/docs/api/top-level.md
Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
* Update spacy/errors.py
Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
* Add test using multiprocessing
* fix argument name
* Increase sleep times
Want to rule this out as a cause of test failure
* Don't terminate a process that isn't alive
* Refactor port finding logic
This moves all the port logic into its own util function, which can be
tested without having to background a server directly.
* Use with for the server
This ensures the server is closed correctly.
* Pass in the host when checking port availability
* Shorten argument name
* Update error codes following merge
* Add types for arguments, specify docstrings.
* Add typing for arguments with default value.
* Update docstring to match spaCy format.
* Update docstring to match spaCy format.
* Fix docs
Arg name changed from `auto_select_port` to just `auto_select`.
* Revert "Fix docs"
This reverts commit 356966fe84.
Co-authored-by: zhiiw <1302593554@qq.com>
Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com>
* Add `ConsoleLogger.v3`
This addition expands the progress bar feature to count up the training/distillation steps to either the next evaluation pass or the maximum number of steps.
* Rename progress bar types
* Add defaults to docs
Minor fixes
* Move comment
* Minor punctuation fixes
* Explicitly check for `None` when validating progress bar type
Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
* Check textcat values for validity
* Fix error numbers
* Clean up vals reference
* Check category value validity through training
The _validate_categories is called in update, which for multilabel is
inherited from the single label component.
* Formatting
* Change enable/disable behavior so that arguments take precedence over config options. Extend error message on conflict. Add warning message in case of overwriting config option with arguments.
* Fix tests in test_serialize_pipeline.py to reflect changes to handling of enable/disable.
* Fix type issue.
* Move comment.
* Move comment.
* Issue UserWarning instead of printing wasabi message. Adjust test.
* Added pytest.warns(UserWarning) for expected warning to fix tests.
* Update warning message.
* Move type handling out of fetch_pipes_status().
* Add global variable for default value. Use id() to determine whether used values are default value.
* Fix default value for disable.
* Rename DEFAULT_PIPE_STATUS to _DEFAULT_EMPTY_PIPES.
* replicate bug with tok2vec in annotating components
* add overfitting test with a frozen tok2vec
* remove broadcast from predict and check doc.tensor instead
* remove broadcast
* proper error
* slight rephrase of documentation
* Add token and span custom attributes to to_json()
* Change logic for to_json
* Add functionality to from_json
* Small adjustments
* Move token/span attributes to new dict key
* Fix test
* Fix the same test but much better
* Add backwards compatibility tests and adjust logic
* Add test to check if attributes not set in underscore are not saved in the json
* Add tests for json compatibility
* Adjust test names
* Fix tests and clean up code
* Fix assert json tests
* small adjustment
* adjust naming and code readability
* Adjust naming, added more tests and changed logic
* Fix typo
* Adjust errors, naming, and small test optimization
* Fix byte tests
* Fix bytes tests
* Change naming and json structure
* update schema
* Update spacy/schemas.py
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Update spacy/tokens/doc.pyx
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Update spacy/tokens/doc.pyx
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Update spacy/schemas.py
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Update schema for underscore attributes
* Adjust underscore schema
* adjust schema tests
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* `TrainablePipe`: Add NVTX range decorator
* Annotate `TrainablePipe` subclasses with NVTX ranges
* Export function signature to allow introspection of args in tests
* Revert "Annotate `TrainablePipe` subclasses with NVTX ranges"
This reverts commit d8684f7372.
* Revert "Export function signature to allow introspection of args in tests"
This reverts commit f4405ca3ad.
* Revert "`TrainablePipe`: Add NVTX range decorator"
This reverts commit 26536eb6b8.
* Add `spacy.pipes_with_nvtx_range` pipeline callback
* Show warnings for all missing user-defined pipe functions that need to be annotated
Fix imports, typos
* Rename `DEFAULT_ANNOTATABLE_PIPE_METHODS` to `DEFAULT_NVTX_ANNOTATABLE_PIPE_METHODS`
Reorder import
* Walk model nodes directly whilst applying NVTX ranges
Ignore pipe method wrapper when applying range
* Enable flag on spacy.load: foundation for include, enable arguments.
* Enable flag on spacy.load: fixed tests.
* Enable flag on spacy.load: switched from pretrained model to empty model with added pipes for tests.
* Enable flag on spacy.load: switched to more consistent error on misspecification of component activity. Test refactoring. Added to default config.
* Enable flag on spacy.load: added support for fields not in pipeline.
* Enable flag on spacy.load: removed serialization fields from supported fields.
* Enable flag on spacy.load: removed 'enable' from config again.
* Enable flag on spacy.load: relaxed checks in _resolve_component_activation_status() to allow non-standard pipes.
* Enable flag on spacy.load: fixed relaxed checks for _resolve_component_activation_status() to allow non-standard pipes. Extended tests.
* Enable flag on spacy.load: comments w.r.t. resolution workarounds.
* Enable flag on spacy.load: remove include fields. Update website docs.
* Enable flag on spacy.load: updates w.r.t. changes in master.
* Implement Doc.from_json(): update docstrings.
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Implement Doc.from_json(): remove newline.
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Implement Doc.from_json(): change error message for E1038.
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Enable flag on spacy.load: wrapped docstring for _resolve_component_status() at 80 chars.
* Enable flag on spacy.load: changed exmples for enable flag.
* Remove newline.
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Fix docstring for Language._resolve_component_status().
* Rename E1038 to E1042.
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* fix: De/Serialize `SpanGroups` including the SpanGroup keys
This prevents the loss of `SpanGroup`s that have the same .name as other `SpanGroup`s within the same `SpanGroups` object (upon de/serialization of the `SpanGroups`).
Fixes#10685
* Maintain backwards compatibility for serialized `SpanGroups`
(serialized as: a list of `SpanGroup`s, or b'')
* Add tests for `SpanGroups` deserialization backwards-compatibility
* Move a `SpanGroups` de/serialization test (test_issue10685)
to tests/serialize/test_serialize_spangroups.py
* Output a warning if deserializing a `SpanGroups` with duplicate .name-d `SpanGroup`s
* Minor refactor
* `SpanGroups.from_bytes` handles only `list` and `dict` types with
`dict` as the expected default
* For lists, keep first rather than last value encountered
* Update error message
* Rename and update tests
* Update to preserve list serialization of SpanGroups
To avoid breaking compatibility of serialized `Doc` and `DocBin` with
earlier versions of spacy v3, revert back to a list-only serialization,
but update the names just for serialization so that the SpanGroups keys
override the SpanGroup names.
* Preserve object identity and current key overwrite
* Preserve SpanGroup object identity
* Preserve last rather than first span group from SpanGroup list
format without SpanGroups keys
* Update inline comments
* Fix types
* Add type info for SpanGroup.copy
* Deserialize `SpanGroup`s as copies
when a single SpanGroup is the value for more than 1 `SpanGroups` key.
This is because we serialize `SpanGroups` as dicts (to maintain backward-
and forward-compatibility) and we can't assume `SpanGroup`s with the same
bytes/serialization were the same (identical) object, pre-serialization.
* Update spacy/tokens/_dict_proxies.py
* Add more SpanGroups serialization tests
Test that serialized SpanGroups maintain their Span order
* small clarification on older spaCy version
* Update spacy/tests/serialize/test_serialize_span_groups.py
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Add SpanRuler component
Add a `SpanRuler` component similar to `EntityRuler` that saves a list
of matched spans to `Doc.spans[spans_key]`. The matches from the token
and phrase matchers are deduplicated and sorted before assignment but
are not otherwise filtered.
* Update spacy/pipeline/span_ruler.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Fix cast
* Add self.key property
* Use number of patterns as length
* Remove patterns kwarg from init
* Update spacy/tests/pipeline/test_span_ruler.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Add options for spans filter and setting to ents
* Add `spans_filter` option as a registered function'
* Make `spans_key` optional and if `None`, set to `doc.ents` instead of
`doc.spans[spans_key]`.
* Update and generalize tests
* Add test for setting doc.ents, fix key property type
* Fix typing
* Allow independent doc.spans and doc.ents
* If `spans_key` is set, set `doc.spans` with `spans_filter`.
* If `annotate_ents` is set, set `doc.ents` with `ents_fitler`.
* Use `util.filter_spans` by default as `ents_filter`.
* Use a custom warning if the filter does not work for `doc.ents`.
* Enable use of SpanC.id in Span
* Support id in SpanRuler as Span.id
* Update types
* `id` can only be provided as string (already by `PatternType`
definition)
* Update all uses of Span.id/ent_id in Doc
* Rename Span id kwarg to span_id
* Update types and docs
* Add ents filter to mimic EntityRuler overwrite_ents
* Refactor `ents_filter` to take `entities, spans` args for more
filtering options
* Give registered filters more descriptive names
* Allow registered `filter_spans` filter
(`spacy.first_longest_spans_filter.v1`) to take any number of
`Iterable[Span]` objects as args so it can be used for spans filter
or ents filter
* Implement future entity ruler as span ruler
Implement a compatible `entity_ruler` as `future_entity_ruler` using
`SpanRuler` as the underlying component:
* Add `sort_key` and `sort_reverse` to allow the sorting behavior to be
customized. (Necessary for the same sorting/filtering as in
`EntityRuler`.)
* Implement `overwrite_overlapping_ents_filter` and
`preserve_existing_ents_filter` to support
`EntityRuler.overwrite_ents` settings.
* Add `remove_by_id` to support `EntityRuler.remove` functionality.
* Refactor `entity_ruler` tests to parametrize all tests to test both
`entity_ruler` and `future_entity_ruler`
* Implement `SpanRuler.token_patterns` and `SpanRuler.phrase_patterns`
properties.
Additional changes:
* Move all config settings to top-level attributes to avoid duplicating
settings in the config vs. `span_ruler/cfg`. (Also avoids a lot of
casting.)
* Format
* Fix filter make method name
* Refactor to use same error for removing by label or ID
* Also provide existing spans to spans filter
* Support ids property
* Remove token_patterns and phrase_patterns
* Update docstrings
* Add span ruler docs
* Fix types
* Apply suggestions from code review
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Move sorting into filters
* Check for all tokens in seen tokens in entity ruler filters
* Remove registered sort key
* Set Token.ent_id in a backwards-compatible way in Doc.set_ents
* Remove sort options from API docs
* Update docstrings
* Rename entity ruler filters
* Fix and parameterize scoring
* Add id to Span API docs
* Fix typo in API docs
* Include explicit labeled=True for scorer
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Make changes to typing
* Correction
* Format with black
* Corrections based on review
* Bumped Thinc dependency version
* Bumped blis requirement
* Correction for older Python versions
* Update spacy/ml/models/textcat.py
Co-authored-by: Daniël de Kok <me@github.danieldk.eu>
* Corrections based on review feedback
* Readd deleted docstring line
Co-authored-by: Daniël de Kok <me@github.danieldk.eu>
* Pipe name override in config: added check with warning, added removal of name override from config, extended tests.
* Pipoe name override in config: added pytest UserWarning.
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Added new convenience cython functions to SpanGroup to avoid unnecessary allocation/deallocation of objects
* Replaced sorting in has_overlap with C++ for efficiency. Also, added a test for has_overlap
* Added a method to efficiently merge SpanGroups
* Added __delitem__, __add__ and __iadd__. Also, allowed to pass span lists to merge function. Replaced extend() body with call to merge
* Renamed merge to concat and added missing things to documentation
* Added operator+ and operator += in the documentation
* Added a test for Doc deallocation
* Update spacy/tokens/span_group.pyx
* Updated SpanGroup tests to use new span list comparison function rather than assert_span_list_equal, eliminating the need to have a separate assert_not_equal fnction
* Fixed typos in SpanGroup documentation
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Minor changes requested by Sofie: rearranged import statements. Added new=3.2.1 tag to SpanGroup.__setitem__ documentation
* SpanGroup: moved repetitive list index check/adjustment in a separate function
* Turn off formatting that hurts readability spacy/tests/doc/test_span_group.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Remove formatting that hurts readability spacy/tests/doc/test_span_group.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Turn off formatting that hurts readability in spacy/tests/doc/test_span_group.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Support more internal methods for SpanGroup
Add support for:
* `__setitem__`
* `__delitem__`
* `__iadd__`: for `SpanGroup` or `Iterable[Span]`
* `__add__`: for `SpanGroup` only
Adapted from #9698 with the scope limited to the magic methods.
* Use v3.3 as new version in docs
* Add new tag to SpanGroup.copy in API docs
* Remove duplicate import
* Apply suggestions from code review
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Remaining suggestions and formatting
Co-authored-by: nrodnova <nrodnova@hotmail.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Natalia Rodnova <4512370+nrodnova@users.noreply.github.com>
* Alignment: use a simplified ragged type for performance
This introduces the AlignmentArray type, which is a simplified version
of Ragged that performs better on the simple(r) indexing performed for
alignment.
* AlignmentArray: raise an error when using unsupported index
* AlignmentArray: move error messages to Errors
* AlignmentArray: remove simlified ... with simplifications
* AlignmentArray: fix typo that broke a[n:n] indexing