* Add SpanRuler component
Add a `SpanRuler` component similar to `EntityRuler` that saves a list
of matched spans to `Doc.spans[spans_key]`. The matches from the token
and phrase matchers are deduplicated and sorted before assignment but
are not otherwise filtered.
* Update spacy/pipeline/span_ruler.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Fix cast
* Add self.key property
* Use number of patterns as length
* Remove patterns kwarg from init
* Update spacy/tests/pipeline/test_span_ruler.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Add options for spans filter and setting to ents
* Add `spans_filter` option as a registered function'
* Make `spans_key` optional and if `None`, set to `doc.ents` instead of
`doc.spans[spans_key]`.
* Update and generalize tests
* Add test for setting doc.ents, fix key property type
* Fix typing
* Allow independent doc.spans and doc.ents
* If `spans_key` is set, set `doc.spans` with `spans_filter`.
* If `annotate_ents` is set, set `doc.ents` with `ents_fitler`.
* Use `util.filter_spans` by default as `ents_filter`.
* Use a custom warning if the filter does not work for `doc.ents`.
* Enable use of SpanC.id in Span
* Support id in SpanRuler as Span.id
* Update types
* `id` can only be provided as string (already by `PatternType`
definition)
* Update all uses of Span.id/ent_id in Doc
* Rename Span id kwarg to span_id
* Update types and docs
* Add ents filter to mimic EntityRuler overwrite_ents
* Refactor `ents_filter` to take `entities, spans` args for more
filtering options
* Give registered filters more descriptive names
* Allow registered `filter_spans` filter
(`spacy.first_longest_spans_filter.v1`) to take any number of
`Iterable[Span]` objects as args so it can be used for spans filter
or ents filter
* Implement future entity ruler as span ruler
Implement a compatible `entity_ruler` as `future_entity_ruler` using
`SpanRuler` as the underlying component:
* Add `sort_key` and `sort_reverse` to allow the sorting behavior to be
customized. (Necessary for the same sorting/filtering as in
`EntityRuler`.)
* Implement `overwrite_overlapping_ents_filter` and
`preserve_existing_ents_filter` to support
`EntityRuler.overwrite_ents` settings.
* Add `remove_by_id` to support `EntityRuler.remove` functionality.
* Refactor `entity_ruler` tests to parametrize all tests to test both
`entity_ruler` and `future_entity_ruler`
* Implement `SpanRuler.token_patterns` and `SpanRuler.phrase_patterns`
properties.
Additional changes:
* Move all config settings to top-level attributes to avoid duplicating
settings in the config vs. `span_ruler/cfg`. (Also avoids a lot of
casting.)
* Format
* Fix filter make method name
* Refactor to use same error for removing by label or ID
* Also provide existing spans to spans filter
* Support ids property
* Remove token_patterns and phrase_patterns
* Update docstrings
* Add span ruler docs
* Fix types
* Apply suggestions from code review
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Move sorting into filters
* Check for all tokens in seen tokens in entity ruler filters
* Remove registered sort key
* Set Token.ent_id in a backwards-compatible way in Doc.set_ents
* Remove sort options from API docs
* Update docstrings
* Rename entity ruler filters
* Fix and parameterize scoring
* Add id to Span API docs
* Fix typo in API docs
* Include explicit labeled=True for scorer
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Update docs for displacy style kwargs
Added "span" to the accepted values for the style kwarg in the displacy.serve and displacy.render top-level functions. These styles are new as of SpaCy 3.3, so I added the "new" tag for that option only
* restored alpha ordering
* Rename to spans_key for consistency
* Implement spans length in debug data
* Implement how span bounds and spans are obtained
In this commit, I implemented how span boundaries (the tokens) around a
given span and spans are obtained. I've put them in the compile_gold()
function so that it's accessible later on. I will do the actual
computation of the span and boundary distinctiveness in the main
function above.
* Compute for p_spans and p_bounds
* Add computation for SD and BD
* Fix mypy issues
* Add weighted average computation
* Fix compile_gold conditional logic
* Add test for frequency distribution computation
* Add tests for kl-divergence computation
* Fix weighted average computation
* Make tables more compact by rounding them
* Add more descriptive checks for spans
* Modularize span computation methods
In this commit, I added the _get_span_characteristics and
_print_span_characteristics functions so that they can be reusable
anywhere.
* Remove unnecessary arguments and make fxs more compact
* Update a few parameter arguments
* Add tests for print_span and get_span methods
* Update API to talk about span characteristics in brief
* Add better reporting of spans_length
* Add test for span length reporting
* Update formatting of span length report
Removed '' to indicate that it's not a string, then
sort the n-grams by their length, not by their frequency.
* Apply suggestions from code review
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Show all frequency distribution when -V
In this commit, I displayed the full frequency distribution of the
span lengths when --verbose is passed. To make things simpler, I
rewrote some of the formatter functions so that I can call them
whenever.
Another notable change is that instead of showing percentages as
Integers, I showed them as floats (max 2-decimal places). I did this
because it looks weird when it displays (0%).
* Update logic on how total is computed
The way the 90% thresholding is computed now is that we keep
adding the percentages until we reach >= 90%. I also updated the wording
and used the term "At least" to denote that >= 90% of your spans have
these distributions.
* Fix display when showing the threshold percentage
* Apply suggestions from code review
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Add better phrasing for span information
* Update spacy/cli/debug_data.py
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Add minor edits for whitespaces etc.
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Allow assets to be optional in spacy project: draft for optional flag/download_all options.
* Allow assets to be optional in spacy project: added OPTIONAL_DEFAULT reflecting default asset optionality.
* Allow assets to be optional in spacy project: renamed --all to --extra.
* Allow assets to be optional in spacy project: included optional flag in project config test.
* Allow assets to be optional in spacy project: added documentation.
* Allow assets to be optional in spacy project: fixing deprecated --all reference.
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Allow assets to be optional in spacy project: fixed project_assets() docstring.
* Allow assets to be optional in spacy project: adjusted wording in justification of optional assets.
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Allow assets to be optional in spacy project: switched to as keyword in project.yml. Updated docs.
* Allow assets to be optional in spacy project: updated comment.
* Allow assets to be optional in spacy project: replacing 'optional' with 'extra' in output.
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Allow assets to be optional in spacy project: replacing 'optional' with 'extra' in docstring..
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Allow assets to be optional in spacy project: replacing 'optional' with 'extra' in test..
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Allow assets to be optional in spacy project: replacing 'optional' with 'extra' in test.
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Allow assets to be optional in spacy project: renamed OPTIONAL_DEFAULT to EXTRA_DEFAULT.
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* add v1 and v2 tests for tok2vec architectures
* textcat architectures are not "layers"
* test older textcat architectures
* test older parser architecture
* Document different ways to create a pipeline: moved up/slightly modified paragraph on pipeline creation.
* Document different ways to create a pipeline: changed Finnish to Ukrainian in example for language without trained pipeline.
* Document different ways to create a pipeline: added explanation of blank pipeline.
* Document different ways to create a pipeline: exchanged Ukrainian with Yoruba.
* Add initial design for diff command
For now, the diffing process looks like this:
- The default config is created based from some values in the user
config (e.g. which pipeline components were used, the lang, etc.)
- The user must supply manually if it was optimized for acc/efficiency
and if pretraining was involved.
* Make diff command structure similar to siblings
* Include gpu as a user option for CLI
* Make variables more explicit
* Fix type declaration for optimize enum
* Improve docstrings for diff CLI
* Add debug-diff to website API docs
* Switch position of configs so that user config is modded
* Add markdown flag for debug diff
This commit adds a --markdown (--md) flag that allows easier
copy-pasting to Github issues. Please note that this commit is dependent
on an unreleased version of wasabi (for the time being).
For posterity, the related PR is found here: https://github.com/ines/wasabi/pull/20
* Bump version of wasabi to 0.9.1
So that we can use the add_symbols parameter.
* Apply suggestions from code review
Co-authored-by: Ines Montani <ines@ines.io>
* Update docs based on code review suggestions
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Change command name from diff -> diff-config
* Clarify when options are relevant or not
* Rerun prettier on cli.md
Co-authored-by: Ines Montani <ines@ines.io>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Added new convenience cython functions to SpanGroup to avoid unnecessary allocation/deallocation of objects
* Replaced sorting in has_overlap with C++ for efficiency. Also, added a test for has_overlap
* Added a method to efficiently merge SpanGroups
* Added __delitem__, __add__ and __iadd__. Also, allowed to pass span lists to merge function. Replaced extend() body with call to merge
* Renamed merge to concat and added missing things to documentation
* Added operator+ and operator += in the documentation
* Added a test for Doc deallocation
* Update spacy/tokens/span_group.pyx
* Updated SpanGroup tests to use new span list comparison function rather than assert_span_list_equal, eliminating the need to have a separate assert_not_equal fnction
* Fixed typos in SpanGroup documentation
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Minor changes requested by Sofie: rearranged import statements. Added new=3.2.1 tag to SpanGroup.__setitem__ documentation
* SpanGroup: moved repetitive list index check/adjustment in a separate function
* Turn off formatting that hurts readability spacy/tests/doc/test_span_group.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Remove formatting that hurts readability spacy/tests/doc/test_span_group.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Turn off formatting that hurts readability in spacy/tests/doc/test_span_group.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Support more internal methods for SpanGroup
Add support for:
* `__setitem__`
* `__delitem__`
* `__iadd__`: for `SpanGroup` or `Iterable[Span]`
* `__add__`: for `SpanGroup` only
Adapted from #9698 with the scope limited to the magic methods.
* Use v3.3 as new version in docs
* Add new tag to SpanGroup.copy in API docs
* Remove duplicate import
* Apply suggestions from code review
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Remaining suggestions and formatting
Co-authored-by: nrodnova <nrodnova@hotmail.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Natalia Rodnova <4512370+nrodnova@users.noreply.github.com>
* Add vector deduplication
* Add `Vocab.deduplicate_vectors()`
* Always run deduplication in `spacy init vectors`
* Clean up a few vector-related error messages and docs examples
* Always unique with numpy
* Fix types
* Add edit tree lemmatizer
Co-authored-by: Daniël de Kok <me@danieldk.eu>
* Hide edit tree lemmatizer labels
* Use relative imports
* Switch to single quotes in error message
* Type annotation fixes
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Reformat edit_tree_lemmatizer with black
* EditTreeLemmatizer.predict: take Iterable
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Validate edit trees during deserialization
This change also changes the serialized representation. Rather than
mirroring the deep C structure, we use a simple flat union of the match
and substitution node types.
* Move edit_trees to _edit_tree_internals
* Fix invalid edit tree format error message
* edit_tree_lemmatizer: remove outdated TODO comment
* Rename factory name to trainable_lemmatizer
* Ignore type instead of casting truths to List[Union[Ints1d, Floats2d, List[int], List[str]]] for thinc v8.0.14
* Switch to Tagger.v2
* Add documentation for EditTreeLemmatizer
* docs: Fix 3.2 -> 3.3 somewhere
* trainable_lemmatizer documentation fixes
* docs: EditTreeLemmatizer is in edit_tree_lemmatizer.py
Co-authored-by: Daniël de Kok <me@danieldk.eu>
Co-authored-by: Daniël de Kok <me@github.danieldk.eu>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Add tokenizer option to allow Matcher handling for all rules
Add tokenizer option `with_faster_rules_heuristics` that determines
whether the special cases applied by the internal `Matcher` are filtered
by whether they contain affixes or space. If `True` (default), the rules
are filtered to prioritize speed over rare edge cases. If `False`, all
rules are included in the final `Matcher`-based pass over the doc.
* Reset all caches when reloading special cases
* Revert "Reset all caches when reloading special cases"
This reverts commit 4ef6bd171d.
* Initialize max_length properly
* Add new tag to API docs
* Rename to faster heuristics
* Fix docstring for EntityRenderer
* Add warning in displacy if doc.spans are empty
* Implement parse_spans converter
One notable change here is that the default spans_key is sc, and
it's set by the user through the options.
* Implement SpanRenderer
Here, I implemented a SpanRenderer that looks similar to the
EntityRenderer except for some templates. The spans_key, by default, is
set to sc, but can be configured in the options (see parse_spans). The
way I rendered these spans is per-token, i.e., I first check if each
token (1) belongs to a given span type and (2) a starting token of a
given span type. Once I have this information, I render them into the
markup.
* Fix mypy issues on typing
* Add tests for displacy spans support
* Update colors from RGB to hex
Co-authored-by: Ines Montani <ines@ines.io>
* Remove unnecessary CSS properties
* Add documentation for website
* Remove unnecesasry scripts
* Update wording on the documentation
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Put typing dependency on top of file
* Put back z-index so that spans overlap properly
* Make warning more explicit for spans_key
Co-authored-by: Ines Montani <ines@ines.io>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Tagger: use unnormalized probabilities for inference
Using unnormalized softmax avoids use of the relatively expensive exp function,
which can significantly speed up non-transformer models (e.g. I got a speedup
of 27% on a German tagging + parsing pipeline).
* Add spacy.Tagger.v2 with configurable normalization
Normalization of probabilities is disabled by default to improve
performance.
* Update documentation, models, and tests to spacy.Tagger.v2
* Move Tagger.v1 to spacy-legacy
* docs/architectures: run prettier
* Unnormalized softmax is now a Softmax_v2 option
* Require thinc 8.0.14 and spacy-legacy 3.0.9
* Add save_candidates attribute
* Change spancat api
* Add unit test
* reimplement method to produce a list of doc
* Add method to docs
* Add new version tag
* Add intended use to docstring
* prettier formatting
* Partial fix of entity linker batching
* Add import
* Better name
* Add `use_gold_ents` option, docs
* Change to v2, create stub v1, update docs etc.
* Fix error type
Honestly no idea what the right type to use here is.
ConfigValidationError seems wrong. Maybe a NotImplementedError?
* Make mypy happy
* Add hacky fix for init issue
* Add legacy pipeline entity linker
* Fix references to class name
* Add __init__.py for legacy
* Attempted fix for loss issue
* Remove placeholder V1
* formatting
* slightly more interesting train data
* Handle batches with no usable examples
This adds a test for batches that have docs but not entities, and a
check in the component that detects such cases and skips the update step
as thought the batch were empty.
* Remove todo about data verification
Check for empty data was moved further up so this should be OK now - the
case in question shouldn't be possible.
* Fix gradient calculation
The model doesn't know which entities are not in the kb, so it generates
embeddings for the context of all of them.
However, the loss does know which entities aren't in the kb, and it
ignores them, as there's no sensible gradient.
This has the issue that the gradient will not be calculated for some of
the input embeddings, which causes a dimension mismatch in backprop.
That should have caused a clear error, but with numpyops it was causing
nans to happen, which is another problem that should be addressed
separately.
This commit changes the loss to give a zero gradient for entities not in
the kb.
* add failing test for v1 EL legacy architecture
* Add nasty but simple working check for legacy arch
* Clarify why init hack works the way it does
* Clarify use_gold_ents use case
* Fix use gold ents related handling
* Add tests for no gold ents and fix other tests
* Use aligned ents function (not working)
This doesn't actually work because the "aligned" ents are gold-only. But
if I have a different function that returns the intersection, *then*
this will work as desired.
* Use proper matching ent check
This changes the process when gold ents are not used so that the
intersection of ents in the pred and gold is used.
* Move get_matching_ents to Example
* Use model attribute to check for legacy arch
* Rename flag
* bump spacy-legacy to lower 3.0.9
Co-authored-by: svlandeg <svlandeg@github.com>
* remove duplicate line
* add sent start/end token attributes to the docs
* let has_annotation work with IS_SENT_END
* elif instead of if
* add has_annotation test for sent attributes
* fix typo
* remove duplicate is_sent_start entry in docs
* Clarify Span.ents documentation
Ref: #10135
Retain current behaviour. Span.ents will only include entities within
said span. You can't get tokens outside of the original span.
* Reword docstrings
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Update API docs in the website
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Fix infix as prefix in Tokenizer.explain
Update `Tokenizer.explain` to align with the `Tokenizer` algorithm:
* skip infix matches that are prefixes in the current substring
* Update tokenizer pseudocode in docs