The list of stop words for Spanish contained many inadequate words, see:
https://github.com/explosion/spaCy/issues/3052#issuecomment-1100760100
Removed words:
- verb forms of 'trabajar' (work) and intentar (try)
- words related to 'empleo' (employment)
- incorrect words: ampleamos, arribaabajo, soyos, paìs
- miscellaneous words due to being too significant of too infrequent:
actualmente, aproximadamente, antaño, cosas, ejemplo, horas, general,
pais, principalmente, raras
Added other stop words for completion:
- Spanish one-letter words
- numbers up to twelve
Some reformatting to 79 columns.
When in doubt, the English and German lists have been consulted as good
examples.
* `Matcher`: Remove superfluous GIL-acquiring check in `get_is_final`
This check incurred a significant performance penalty due to implict interactions between the GIL and Cython ref-counting code.
* `Matcher`: Inline `PatternStateC` accessors
* signing contributor agreement
* adding new content to the spaCy universe
* updating outdated example codes
* resolving issues for the PR
* resolve review for klayers
* remove contributor-agreement file from the PR
* Update code example of spaCySentiWS
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Update spacy-sentiws code example
Co-authored-by: schaeran <schaeran1994@gmail.com>
Co-authored-by: schaeran <schaeran@explosion.ai>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Test for arc levels for identical arcs
Also moves the test in order with the other numbered tests.
* displaCy: filter identical arcs
Avoid increased levels due to identical arcs by first
filtering any identical arcs.
* Sort keys before filtering
Manual entry with keys out of order would previously become
different tuples and therefore not filtered correctly.
Co-authored-by: Joachim Fainberg <joachimfainberg@Joachims-MBP.lan>
This is necessary because one of the three old methods relied on scipy
for some complex problem solving. LEA is generally better for
evaluations.
The downside is that this means evaluations aren't comparable with many
papers, but canonical scoring can be supported using external eval
scripts or other methods.
* added crosslingual coreference to spacy universe
* Updated example to introduce batching example.
Co-authored-by: David Berenstein <david.berenstein@pandoraintelligence.com>
* Add basic tests for Tamil (ta)
* Add comment
Remove superfluous condition
* Remove superfluous call to `pipe`
Instantiate new tokenizer for special case
* Add initial design for diff command
For now, the diffing process looks like this:
- The default config is created based from some values in the user
config (e.g. which pipeline components were used, the lang, etc.)
- The user must supply manually if it was optimized for acc/efficiency
and if pretraining was involved.
* Make diff command structure similar to siblings
* Include gpu as a user option for CLI
* Make variables more explicit
* Fix type declaration for optimize enum
* Improve docstrings for diff CLI
* Add debug-diff to website API docs
* Switch position of configs so that user config is modded
* Add markdown flag for debug diff
This commit adds a --markdown (--md) flag that allows easier
copy-pasting to Github issues. Please note that this commit is dependent
on an unreleased version of wasabi (for the time being).
For posterity, the related PR is found here: https://github.com/ines/wasabi/pull/20
* Bump version of wasabi to 0.9.1
So that we can use the add_symbols parameter.
* Apply suggestions from code review
Co-authored-by: Ines Montani <ines@ines.io>
* Update docs based on code review suggestions
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Change command name from diff -> diff-config
* Clarify when options are relevant or not
* Rerun prettier on cli.md
Co-authored-by: Ines Montani <ines@ines.io>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Added test for overlapping arcs
* Provide distinct levels to overlapping arcs
* Update return type hint for get_levels
* Improved formatting spacy/displacy/render.py
Co-authored-by: Ines Montani <ines@ines.io>
Co-authored-by: Joachim Fainberg <joachimfainberg@Joachims-MacBook-Pro.local>
Co-authored-by: Ines Montani <ines@ines.io>
* Added new convenience cython functions to SpanGroup to avoid unnecessary allocation/deallocation of objects
* Replaced sorting in has_overlap with C++ for efficiency. Also, added a test for has_overlap
* Added a method to efficiently merge SpanGroups
* Added __delitem__, __add__ and __iadd__. Also, allowed to pass span lists to merge function. Replaced extend() body with call to merge
* Renamed merge to concat and added missing things to documentation
* Added operator+ and operator += in the documentation
* Added a test for Doc deallocation
* Update spacy/tokens/span_group.pyx
* Updated SpanGroup tests to use new span list comparison function rather than assert_span_list_equal, eliminating the need to have a separate assert_not_equal fnction
* Fixed typos in SpanGroup documentation
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Minor changes requested by Sofie: rearranged import statements. Added new=3.2.1 tag to SpanGroup.__setitem__ documentation
* SpanGroup: moved repetitive list index check/adjustment in a separate function
* Turn off formatting that hurts readability spacy/tests/doc/test_span_group.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Remove formatting that hurts readability spacy/tests/doc/test_span_group.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Turn off formatting that hurts readability in spacy/tests/doc/test_span_group.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Support more internal methods for SpanGroup
Add support for:
* `__setitem__`
* `__delitem__`
* `__iadd__`: for `SpanGroup` or `Iterable[Span]`
* `__add__`: for `SpanGroup` only
Adapted from #9698 with the scope limited to the magic methods.
* Use v3.3 as new version in docs
* Add new tag to SpanGroup.copy in API docs
* Remove duplicate import
* Apply suggestions from code review
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Remaining suggestions and formatting
Co-authored-by: nrodnova <nrodnova@hotmail.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Natalia Rodnova <4512370+nrodnova@users.noreply.github.com>
* Alignment: use a simplified ragged type for performance
This introduces the AlignmentArray type, which is a simplified version
of Ragged that performs better on the simple(r) indexing performed for
alignment.
* AlignmentArray: raise an error when using unsupported index
* AlignmentArray: move error messages to Errors
* AlignmentArray: remove simlified ... with simplifications
* AlignmentArray: fix typo that broke a[n:n] indexing
* added failing test case for the issue.
* Fixed typo.
* fixed typo in test.
* added corrected typo word into test_tr_lex_attrs_capitals as param. Test passes. Also tried and confirmed that test is failing after fixing the typo in the test case I wrote. Deleted the test case for typo.
Co-authored-by: Yunus Atahan <yunus.atahan@trmotor.local>
* Add vector deduplication
* Add `Vocab.deduplicate_vectors()`
* Always run deduplication in `spacy init vectors`
* Clean up a few vector-related error messages and docs examples
* Always unique with numpy
* Fix types