* Add failing test
* Partial fix for issue
This kind of works. The issue with token length mismatches is gone. The
problem is that when you get empty lists of encodings to compare, it
fails because the sizes are not the same, even though they're both zero:
(0, 3) vs (0,). Not sure why that happens...
* Short circuit on empties
* Remove spurious check
The check here isn't needed now the the short circuit is fixed.
* Update spacy/tests/pipeline/test_entity_linker.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Use "eg", not "example"
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Rename to spans_key for consistency
* Implement spans length in debug data
* Implement how span bounds and spans are obtained
In this commit, I implemented how span boundaries (the tokens) around a
given span and spans are obtained. I've put them in the compile_gold()
function so that it's accessible later on. I will do the actual
computation of the span and boundary distinctiveness in the main
function above.
* Compute for p_spans and p_bounds
* Add computation for SD and BD
* Fix mypy issues
* Add weighted average computation
* Fix compile_gold conditional logic
* Add test for frequency distribution computation
* Add tests for kl-divergence computation
* Fix weighted average computation
* Make tables more compact by rounding them
* Add more descriptive checks for spans
* Modularize span computation methods
In this commit, I added the _get_span_characteristics and
_print_span_characteristics functions so that they can be reusable
anywhere.
* Remove unnecessary arguments and make fxs more compact
* Update a few parameter arguments
* Add tests for print_span and get_span methods
* Update API to talk about span characteristics in brief
* Add better reporting of spans_length
* Add test for span length reporting
* Update formatting of span length report
Removed '' to indicate that it's not a string, then
sort the n-grams by their length, not by their frequency.
* Apply suggestions from code review
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Show all frequency distribution when -V
In this commit, I displayed the full frequency distribution of the
span lengths when --verbose is passed. To make things simpler, I
rewrote some of the formatter functions so that I can call them
whenever.
Another notable change is that instead of showing percentages as
Integers, I showed them as floats (max 2-decimal places). I did this
because it looks weird when it displays (0%).
* Update logic on how total is computed
The way the 90% thresholding is computed now is that we keep
adding the percentages until we reach >= 90%. I also updated the wording
and used the term "At least" to denote that >= 90% of your spans have
these distributions.
* Fix display when showing the threshold percentage
* Apply suggestions from code review
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Add better phrasing for span information
* Update spacy/cli/debug_data.py
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Add minor edits for whitespaces etc.
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Add glossary entry for root
There was already one but it was lower case, maybe that should be
removed?
* remove lowercase root
On reflection, that was probably just a mistake.
* Add lowercase root back
It's harmless to leave it there.
* Pipe name override in config: added check with warning, added removal of name override from config, extended tests.
* Pipoe name override in config: added pytest UserWarning.
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Allow assets to be optional in spacy project: draft for optional flag/download_all options.
* Allow assets to be optional in spacy project: added OPTIONAL_DEFAULT reflecting default asset optionality.
* Allow assets to be optional in spacy project: renamed --all to --extra.
* Allow assets to be optional in spacy project: included optional flag in project config test.
* Allow assets to be optional in spacy project: added documentation.
* Allow assets to be optional in spacy project: fixing deprecated --all reference.
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Allow assets to be optional in spacy project: fixed project_assets() docstring.
* Allow assets to be optional in spacy project: adjusted wording in justification of optional assets.
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Allow assets to be optional in spacy project: switched to as keyword in project.yml. Updated docs.
* Allow assets to be optional in spacy project: updated comment.
* Allow assets to be optional in spacy project: replacing 'optional' with 'extra' in output.
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Allow assets to be optional in spacy project: replacing 'optional' with 'extra' in docstring..
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Allow assets to be optional in spacy project: replacing 'optional' with 'extra' in test..
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Allow assets to be optional in spacy project: replacing 'optional' with 'extra' in test.
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Allow assets to be optional in spacy project: renamed OPTIONAL_DEFAULT to EXTRA_DEFAULT.
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* add v1 and v2 tests for tok2vec architectures
* textcat architectures are not "layers"
* test older textcat architectures
* test older parser architecture
* Document different ways to create a pipeline: moved up/slightly modified paragraph on pipeline creation.
* Document different ways to create a pipeline: changed Finnish to Ukrainian in example for language without trained pipeline.
* Document different ways to create a pipeline: added explanation of blank pipeline.
* Document different ways to create a pipeline: exchanged Ukrainian with Yoruba.
* Fix StringStore.__getitem__ return type depending on parameter types
Small fix using `@overload` so that `StringStore.__getitem__` returns an `int` when given a `str` or `bytes` and a `str` when given an `int`.
* Update spacy/strings.pyi
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
The list of stop words for Spanish contained many inadequate words, see:
https://github.com/explosion/spaCy/issues/3052#issuecomment-1100760100
Removed words:
- verb forms of 'trabajar' (work) and intentar (try)
- words related to 'empleo' (employment)
- incorrect words: ampleamos, arribaabajo, soyos, paìs
- miscellaneous words due to being too significant of too infrequent:
actualmente, aproximadamente, antaño, cosas, ejemplo, horas, general,
pais, principalmente, raras
Added other stop words for completion:
- Spanish one-letter words
- numbers up to twelve
Some reformatting to 79 columns.
When in doubt, the English and German lists have been consulted as good
examples.
* `Matcher`: Remove superfluous GIL-acquiring check in `get_is_final`
This check incurred a significant performance penalty due to implict interactions between the GIL and Cython ref-counting code.
* `Matcher`: Inline `PatternStateC` accessors
* signing contributor agreement
* adding new content to the spaCy universe
* updating outdated example codes
* resolving issues for the PR
* resolve review for klayers
* remove contributor-agreement file from the PR
* Update code example of spaCySentiWS
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Update spacy-sentiws code example
Co-authored-by: schaeran <schaeran1994@gmail.com>
Co-authored-by: schaeran <schaeran@explosion.ai>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
The returned match offsets were only adjusted for `as_spans`, not
generally. Because the `on_match` callbacks are always applied to the
doc, the `Matcher` matches on spans should consistently use the doc
offsets.
* Test for arc levels for identical arcs
Also moves the test in order with the other numbered tests.
* displaCy: filter identical arcs
Avoid increased levels due to identical arcs by first
filtering any identical arcs.
* Sort keys before filtering
Manual entry with keys out of order would previously become
different tuples and therefore not filtered correctly.
Co-authored-by: Joachim Fainberg <joachimfainberg@Joachims-MBP.lan>