Distinguish between vectors that are 0 vs. missing vectors when warning
about missing vectors.
Update `Doc.has_vector` to match `Span.has_vector` and
`Token.has_vector` for cases where the vocab has vectors but none of the
tokens in the container have vectors.
* Handle Russian, Ukrainian and Bulgarian
* Corrections
* Correction
* Correction to comment
* Changes based on review
* Correction
* Reverted irrelevant change in punctuation.py
* Remove unnecessary group
* Reverted accidental change
* Try cloning repo from main & master
* fixup! Try cloning repo from main & master
* fixup! fixup! Try cloning repo from main & master
* refactor clone and check for repo:branch existence
* spacing fix
* make mypy happy
* type util function
* Update spacy/cli/project/clone.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Peter Baumgartner <5107405+pmbaumgartner@users.noreply.github.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Enable flag on spacy.load: foundation for include, enable arguments.
* Enable flag on spacy.load: fixed tests.
* Enable flag on spacy.load: switched from pretrained model to empty model with added pipes for tests.
* Enable flag on spacy.load: switched to more consistent error on misspecification of component activity. Test refactoring. Added to default config.
* Enable flag on spacy.load: added support for fields not in pipeline.
* Enable flag on spacy.load: removed serialization fields from supported fields.
* Enable flag on spacy.load: removed 'enable' from config again.
* Enable flag on spacy.load: relaxed checks in _resolve_component_activation_status() to allow non-standard pipes.
* Enable flag on spacy.load: fixed relaxed checks for _resolve_component_activation_status() to allow non-standard pipes. Extended tests.
* Enable flag on spacy.load: comments w.r.t. resolution workarounds.
* Enable flag on spacy.load: remove include fields. Update website docs.
* Enable flag on spacy.load: updates w.r.t. changes in master.
* Implement Doc.from_json(): update docstrings.
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Implement Doc.from_json(): remove newline.
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Implement Doc.from_json(): change error message for E1038.
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Enable flag on spacy.load: wrapped docstring for _resolve_component_status() at 80 chars.
* Enable flag on spacy.load: changed exmples for enable flag.
* Remove newline.
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Fix docstring for Language._resolve_component_status().
* Rename E1038 to E1042.
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* account for NER labels with a hyphen in the name
* cleanup
* fix docstring
* add return type to helper method
* shorter method and few more occurrences
* user helper method across repo
* fix circular import
* partial revert to avoid circular import
The float -1 was returned rather than the integer -1 as the row for
unknown keys. This doesn't introduce a realy bug, since such floats
cast (without issues) to int in the conversion to NumPy arrays. Still,
it's nice to to do the correct thing :).
The `forward` of `precomputable_biaffine` performs matrix multiplication
and then `vstack`s the result with padding. This creates a temporary
array used for the output of matrix concatenation.
This change avoids the temporary by pre-allocating an array that is
large enough for the output of matrix multiplication plus padding and
fills the array in-place.
This gave me a small speedup (a bit over 100 WPS) on de_core_news_lg on
M1 Max (after changing thinc-apple-ops to support in-place gemm as BLIS
does).
* detect cycle during projectivize
* not complete test to detect cycle in projectivize
* boolean to int type to propagate error
* use unordered_set instead of set
* moved error message to errors
* removed cycle from test case
* use find instead of count
* cycle check: only perform one lookup
* Return bool again from _has_head_as_ancestor
Communicate presence of cycles through an output argument.
* Switch to returning std::pair to encode presence of a cycle
The has_cycle pointer is too easy to misuse. Ideally, we would have a
sum type like Rust's `Result` here, but C++ is not there yet.
* _is_non_proj_arc: clarify what we are returning
* _has_head_as_ancestor: remove count
We are now explicitly checking for cycles, so the algorithm must always
terminate. Either we encounter the head, we find a root, or a cycle.
* _is_nonproj_arc: simplify condition
* Another refactor using C++ exceptions
* Remove unused error code
* Print graph with cycle on exception
* Include .hh files in source package
* Add FIXME comment
* cycle detection test
* find cycle when starting from problematic vertex
Co-authored-by: Daniël de Kok <me@danieldk.eu>
* fix: De/Serialize `SpanGroups` including the SpanGroup keys
This prevents the loss of `SpanGroup`s that have the same .name as other `SpanGroup`s within the same `SpanGroups` object (upon de/serialization of the `SpanGroups`).
Fixes#10685
* Maintain backwards compatibility for serialized `SpanGroups`
(serialized as: a list of `SpanGroup`s, or b'')
* Add tests for `SpanGroups` deserialization backwards-compatibility
* Move a `SpanGroups` de/serialization test (test_issue10685)
to tests/serialize/test_serialize_spangroups.py
* Output a warning if deserializing a `SpanGroups` with duplicate .name-d `SpanGroup`s
* Minor refactor
* `SpanGroups.from_bytes` handles only `list` and `dict` types with
`dict` as the expected default
* For lists, keep first rather than last value encountered
* Update error message
* Rename and update tests
* Update to preserve list serialization of SpanGroups
To avoid breaking compatibility of serialized `Doc` and `DocBin` with
earlier versions of spacy v3, revert back to a list-only serialization,
but update the names just for serialization so that the SpanGroups keys
override the SpanGroup names.
* Preserve object identity and current key overwrite
* Preserve SpanGroup object identity
* Preserve last rather than first span group from SpanGroup list
format without SpanGroups keys
* Update inline comments
* Fix types
* Add type info for SpanGroup.copy
* Deserialize `SpanGroup`s as copies
when a single SpanGroup is the value for more than 1 `SpanGroups` key.
This is because we serialize `SpanGroups` as dicts (to maintain backward-
and forward-compatibility) and we can't assume `SpanGroup`s with the same
bytes/serialization were the same (identical) object, pre-serialization.
* Update spacy/tokens/_dict_proxies.py
* Add more SpanGroups serialization tests
Test that serialized SpanGroups maintain their Span order
* small clarification on older spaCy version
* Update spacy/tests/serialize/test_serialize_span_groups.py
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Add SpanRuler component
Add a `SpanRuler` component similar to `EntityRuler` that saves a list
of matched spans to `Doc.spans[spans_key]`. The matches from the token
and phrase matchers are deduplicated and sorted before assignment but
are not otherwise filtered.
* Update spacy/pipeline/span_ruler.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Fix cast
* Add self.key property
* Use number of patterns as length
* Remove patterns kwarg from init
* Update spacy/tests/pipeline/test_span_ruler.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Add options for spans filter and setting to ents
* Add `spans_filter` option as a registered function'
* Make `spans_key` optional and if `None`, set to `doc.ents` instead of
`doc.spans[spans_key]`.
* Update and generalize tests
* Add test for setting doc.ents, fix key property type
* Fix typing
* Allow independent doc.spans and doc.ents
* If `spans_key` is set, set `doc.spans` with `spans_filter`.
* If `annotate_ents` is set, set `doc.ents` with `ents_fitler`.
* Use `util.filter_spans` by default as `ents_filter`.
* Use a custom warning if the filter does not work for `doc.ents`.
* Enable use of SpanC.id in Span
* Support id in SpanRuler as Span.id
* Update types
* `id` can only be provided as string (already by `PatternType`
definition)
* Update all uses of Span.id/ent_id in Doc
* Rename Span id kwarg to span_id
* Update types and docs
* Add ents filter to mimic EntityRuler overwrite_ents
* Refactor `ents_filter` to take `entities, spans` args for more
filtering options
* Give registered filters more descriptive names
* Allow registered `filter_spans` filter
(`spacy.first_longest_spans_filter.v1`) to take any number of
`Iterable[Span]` objects as args so it can be used for spans filter
or ents filter
* Implement future entity ruler as span ruler
Implement a compatible `entity_ruler` as `future_entity_ruler` using
`SpanRuler` as the underlying component:
* Add `sort_key` and `sort_reverse` to allow the sorting behavior to be
customized. (Necessary for the same sorting/filtering as in
`EntityRuler`.)
* Implement `overwrite_overlapping_ents_filter` and
`preserve_existing_ents_filter` to support
`EntityRuler.overwrite_ents` settings.
* Add `remove_by_id` to support `EntityRuler.remove` functionality.
* Refactor `entity_ruler` tests to parametrize all tests to test both
`entity_ruler` and `future_entity_ruler`
* Implement `SpanRuler.token_patterns` and `SpanRuler.phrase_patterns`
properties.
Additional changes:
* Move all config settings to top-level attributes to avoid duplicating
settings in the config vs. `span_ruler/cfg`. (Also avoids a lot of
casting.)
* Format
* Fix filter make method name
* Refactor to use same error for removing by label or ID
* Also provide existing spans to spans filter
* Support ids property
* Remove token_patterns and phrase_patterns
* Update docstrings
* Add span ruler docs
* Fix types
* Apply suggestions from code review
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Move sorting into filters
* Check for all tokens in seen tokens in entity ruler filters
* Remove registered sort key
* Set Token.ent_id in a backwards-compatible way in Doc.set_ents
* Remove sort options from API docs
* Update docstrings
* Rename entity ruler filters
* Fix and parameterize scoring
* Add id to Span API docs
* Fix typo in API docs
* Include explicit labeled=True for scorer
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Fix TODO about typing
Fix was simple: just request an array2f.
* Add type ignore
Maxout has a more restrictive type than the residual layer expects (only
Floats2d vs any Floats).
* Various cleanup
This moves a lot of lines around but doesn't change any functionality.
Details:
1. use `continue` to reduce indentation
2. move sentence doc building inside conditional since it's otherwise
unused
3. reduces some temporary assignments
* Parser: use C saxpy/sgemm provided by the Ops implementation
This is a backport of https://github.com/explosion/spaCy/pull/10747
from the parser refactor branch. It eliminates the explicit calls
to BLIS, instead using the saxpy/sgemm provided by the Ops
implementation.
This allows us to use Accelerate in the parser on M1 Macs (with
an updated thinc-apple-ops).
Performance of the de_core_news_lg pipe:
BLIS 0.7.0, no thinc-apple-ops: 6385 WPS
BLIS 0.7.0, thinc-apple-ops: 36455 WPS
BLIS 0.9.0, no thinc-apple-ops: 19188 WPS
BLIS 0.9.0, thinc-apple-ops: 36682 WPS
This PR, thinc-apple-ops: 38726 WPS
Performance of the de_core_news_lg pipe (only tok2vec -> parser):
BLIS 0.7.0, no thinc-apple-ops: 13907 WPS
BLIS 0.7.0, thinc-apple-ops: 73172 WPS
BLIS 0.9.0, no thinc-apple-ops: 41576 WPS
BLIS 0.9.0, thinc-apple-ops: 72569 WPS
This PR, thinc-apple-ops: 87061 WPS
* Require thinc >=8.1.0,<8.2.0
* Lower thinc lowerbound to 8.1.0.dev0
* Use best CPU ops for CBLAS when the parser model is on the GPU
* Fix another unguarded cblas() call
* Fix: use ops as a shorthand for self.model.ops
Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
This change removes `thinc.util.has_cupy` from the GPU presence check.
Currently `gpu_is_available` already implies `has_cupy`. We also want
to show this warning in the future when a machine has a non-CuPy GPU.
* Make changes to typing
* Correction
* Format with black
* Corrections based on review
* Bumped Thinc dependency version
* Bumped blis requirement
* Correction for older Python versions
* Update spacy/ml/models/textcat.py
Co-authored-by: Daniël de Kok <me@github.danieldk.eu>
* Corrections based on review feedback
* Readd deleted docstring line
Co-authored-by: Daniël de Kok <me@github.danieldk.eu>
* Add failing test
* Partial fix for issue
This kind of works. The issue with token length mismatches is gone. The
problem is that when you get empty lists of encodings to compare, it
fails because the sizes are not the same, even though they're both zero:
(0, 3) vs (0,). Not sure why that happens...
* Short circuit on empties
* Remove spurious check
The check here isn't needed now the the short circuit is fixed.
* Update spacy/tests/pipeline/test_entity_linker.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Use "eg", not "example"
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Rename to spans_key for consistency
* Implement spans length in debug data
* Implement how span bounds and spans are obtained
In this commit, I implemented how span boundaries (the tokens) around a
given span and spans are obtained. I've put them in the compile_gold()
function so that it's accessible later on. I will do the actual
computation of the span and boundary distinctiveness in the main
function above.
* Compute for p_spans and p_bounds
* Add computation for SD and BD
* Fix mypy issues
* Add weighted average computation
* Fix compile_gold conditional logic
* Add test for frequency distribution computation
* Add tests for kl-divergence computation
* Fix weighted average computation
* Make tables more compact by rounding them
* Add more descriptive checks for spans
* Modularize span computation methods
In this commit, I added the _get_span_characteristics and
_print_span_characteristics functions so that they can be reusable
anywhere.
* Remove unnecessary arguments and make fxs more compact
* Update a few parameter arguments
* Add tests for print_span and get_span methods
* Update API to talk about span characteristics in brief
* Add better reporting of spans_length
* Add test for span length reporting
* Update formatting of span length report
Removed '' to indicate that it's not a string, then
sort the n-grams by their length, not by their frequency.
* Apply suggestions from code review
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Show all frequency distribution when -V
In this commit, I displayed the full frequency distribution of the
span lengths when --verbose is passed. To make things simpler, I
rewrote some of the formatter functions so that I can call them
whenever.
Another notable change is that instead of showing percentages as
Integers, I showed them as floats (max 2-decimal places). I did this
because it looks weird when it displays (0%).
* Update logic on how total is computed
The way the 90% thresholding is computed now is that we keep
adding the percentages until we reach >= 90%. I also updated the wording
and used the term "At least" to denote that >= 90% of your spans have
these distributions.
* Fix display when showing the threshold percentage
* Apply suggestions from code review
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Add better phrasing for span information
* Update spacy/cli/debug_data.py
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Add minor edits for whitespaces etc.
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Add glossary entry for root
There was already one but it was lower case, maybe that should be
removed?
* remove lowercase root
On reflection, that was probably just a mistake.
* Add lowercase root back
It's harmless to leave it there.
* Pipe name override in config: added check with warning, added removal of name override from config, extended tests.
* Pipoe name override in config: added pytest UserWarning.
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Allow assets to be optional in spacy project: draft for optional flag/download_all options.
* Allow assets to be optional in spacy project: added OPTIONAL_DEFAULT reflecting default asset optionality.
* Allow assets to be optional in spacy project: renamed --all to --extra.
* Allow assets to be optional in spacy project: included optional flag in project config test.
* Allow assets to be optional in spacy project: added documentation.
* Allow assets to be optional in spacy project: fixing deprecated --all reference.
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Allow assets to be optional in spacy project: fixed project_assets() docstring.
* Allow assets to be optional in spacy project: adjusted wording in justification of optional assets.
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Allow assets to be optional in spacy project: switched to as keyword in project.yml. Updated docs.
* Allow assets to be optional in spacy project: updated comment.
* Allow assets to be optional in spacy project: replacing 'optional' with 'extra' in output.
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Allow assets to be optional in spacy project: replacing 'optional' with 'extra' in docstring..
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Allow assets to be optional in spacy project: replacing 'optional' with 'extra' in test..
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Allow assets to be optional in spacy project: replacing 'optional' with 'extra' in test.
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Allow assets to be optional in spacy project: renamed OPTIONAL_DEFAULT to EXTRA_DEFAULT.
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* add v1 and v2 tests for tok2vec architectures
* textcat architectures are not "layers"
* test older textcat architectures
* test older parser architecture
* Fix StringStore.__getitem__ return type depending on parameter types
Small fix using `@overload` so that `StringStore.__getitem__` returns an `int` when given a `str` or `bytes` and a `str` when given an `int`.
* Update spacy/strings.pyi
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>