Commit Graph

15679 Commits

Author SHA1 Message Date
Richard Hudson
4b227f4861
Merge pull request #10669 from mgrojo/develop
Fix some issues in Spanish stop-word list and examples
2022-04-19 09:37:34 +02:00
mgr
3d50b1a989 Fix some issues in Spanish examples
- Spelling: nationalities in lowercase, accent.
- Incorrect verb composition
- Untranslated word
2022-04-18 22:12:57 +02:00
mgr
2a2654c756 Remove significant or not very frequent words from stop word list [es]
The list of stop words for Spanish contained many inadequate words, see:

https://github.com/explosion/spaCy/issues/3052#issuecomment-1100760100

Removed words:
- verb forms of 'trabajar' (work) and intentar (try)
- words related to 'empleo' (employment)
- incorrect words: ampleamos, arribaabajo, soyos, paìs
- miscellaneous words due to being too significant of too infrequent:
  actualmente, aproximadamente, antaño, cosas, ejemplo, horas, general,
  pais, principalmente, raras

Added other stop words for completion:
- Spanish one-letter words
- numbers up to twelve

Some reformatting to 79 columns.

When in doubt, the English and German lists have been consulted as good
examples.
2022-04-18 22:04:02 +02:00
Madeesh Kannan
aa6780eb27
Matcher: Remove superfluous GIL-acquiring check in get_is_final (#10659)
* `Matcher`: Remove superfluous GIL-acquiring check in `get_is_final`

This check incurred a significant performance penalty due to  implict interactions between the GIL and Cython ref-counting code.

* `Matcher`: Inline `PatternStateC` accessors
2022-04-18 12:59:34 +02:00
Duy Ngo
229ecaf0ea
Add numbers and definitions (#10665) 2022-04-18 12:58:32 +02:00
Paul O'Leary McCann
683f470852 Merge branch 'master' into feature/coref 2022-04-18 18:39:08 +09:00
Schero1994
d622883a42
Adding and updating content in the spacy universe (#10493)
* signing contributor agreement

* adding new content to the spaCy universe

* updating outdated example codes

* resolving issues for the PR

* resolve review for klayers

* remove contributor-agreement file from the PR

* Update code example of spaCySentiWS

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update spacy-sentiws code example

Co-authored-by: schaeran <schaeran1994@gmail.com>
Co-authored-by: schaeran <schaeran@explosion.ai>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-04-15 15:36:54 +02:00
Joachim Fainberg
4e1716223c
displaCy: Avoid increasing levels for identical arcs (#10639)
* Test for arc levels for identical arcs

Also moves the test in order with the other numbered tests.

* displaCy: filter identical arcs

Avoid increased levels due to identical arcs by first
filtering any identical arcs.

* Sort keys before filtering

Manual entry with keys out of order would previously become
different tuples and therefore not filtered correctly.

Co-authored-by: Joachim Fainberg <joachimfainberg@Joachims-MBP.lan>
2022-04-14 16:48:00 +02:00
Philip Vollet
e63a5d4888
Update newsletter id (#10655) 2022-04-14 13:34:01 +02:00
Paul O'Leary McCann
afd255c0ed Undo multiply by 100
This was mistaken, not sure why my score seemed to be off before.
2022-04-14 18:42:09 +09:00
Paul O'Leary McCann
08729e0fbd Remove end adjustment
The difference in environments was due to a change in Thinc, the code
here is fine.
2022-04-14 18:31:30 +09:00
fonfonx
028cbad05e
Add feminine form of word "one" in French (#10653)
* Add French number

* Add fonfonx.md

* Add feminine ordinal words for French
2022-04-14 10:21:27 +02:00
Schero1994
caf8528af7
Batch #1 | spaCy universe cleanup (#10642)
* delete universe object: wmd-relax

* delete universe object: spaCy.jl

* delete universe object: saber

* delete universe object: languagecrunch

* delete universe object: gracyql

* delete universe object: ExcelCy

* delete universe object: EpiTator

Co-authored-by: schaeran <schaeran1994@gmail.com>
2022-04-14 10:08:19 +02:00
single-fingal
4228f3c757
Fix a few minor bugs in the SpanGroup API web docs (#10650)
* Fix a few minor bugs in the SpanGroup API web docs

* Update SpanGroup docs examples to have Spans reflect intended "errors"
2022-04-14 09:59:48 +02:00
Paul O'Leary McCann
8181d4570c Multiply accuracy by 100
This seems to match with the scorer expectations better
2022-04-14 15:56:38 +09:00
Paul O'Leary McCann
e8af02700f Remove all coref scoring exept LEA
This is necessary because one of the three old methods relied on scipy
for some complex problem solving. LEA is generally better for
evaluations.

The downside is that this means evaluations aren't comparable with many
papers, but canonical scoring can be supported using external eval
scripts or other methods.
2022-04-13 21:02:18 +09:00
Paul O'Leary McCann
2300f4df3d Fix span score logging 2022-04-13 20:37:06 +09:00
Paul O'Leary McCann
d470fa03c1 Adjust end indices
It's not clear if this is technically correct or not but it won't run
without it for me.
2022-04-13 20:19:21 +09:00
kadarakos
b53113e3b8
Preparing span predictor for predicting from gold (#10547)
Note this is squashed because rebasing had conflicts.

* remove unnecessary .device

* span predictor debug start

* gearing up SpanPredictor for gold-heads

* merge SpanPredictor attributes

* remove useless extra prefix and device from spanpredictor

* make sure predicted and reference keeps aligned

* handle empty head_ids

* handle empty clusters

* addressing suggestions by @polm

* nicer restore

* fix score overwriting bug

* prepare for aligned heads-spans training

* span accuracy score

* update with eg.predited as other components

* add backprop callback to spanpredictor

* report start- and end-accuracies separately

* fixing scorer

Co-authored-by: Kádár Ákos <akos@onyx.uvt.nl>
2022-04-13 19:42:49 +09:00
Adriane Boyd
64602d997d
Require srsly v2.4.3+ due to buffer overflow vulnerability (#10651) 2022-04-13 11:41:40 +02:00
Richard Hudson
75fbbcdc18
Display warning when spacy.explain() finds no term (#10645)
* Display warning when spacy.explain() finds no term

* Updated warning message text
2022-04-12 10:48:28 +02:00
Kádár Ákos
6aedd98d02 fixing scorer 2022-04-11 16:10:14 +02:00
Kádár Ákos
7a239f2ec7 report start- and end-accuracies separately 2022-04-08 14:57:19 +02:00
Kádár Ákos
2a1ad4c5d2 add backprop callback to spanpredictor 2022-04-08 14:56:44 +02:00
David Berenstein
d4196a62f1
added crosslingual coreference to spacy universe without additional commits (#10580)
* added crosslingual coreference to spacy universe

* Updated example to introduce batching example.

Co-authored-by: David Berenstein <david.berenstein@pandoraintelligence.com>
2022-04-08 08:23:58 +02:00
Madeesh Kannan
9ba3e1cb2f
Basic tests for the Tamil language (#10629)
* Add basic tests for Tamil (ta)

* Add comment
Remove superfluous condition

* Remove superfluous call to `pipe`
Instantiate new tokenizer for special case
2022-04-07 14:47:37 +02:00
Kádár Ákos
3ba913109d update with eg.predited as other components 2022-04-07 13:20:12 +02:00
Lj Miranda
02dafa3a84
Add debug diff command in spaCy CLI (#10502)
* Add initial design for diff command

For now, the diffing process looks like this:
- The default config is created based from some values in the user
config (e.g. which pipeline components were used, the lang, etc.)
- The user must supply manually if it was optimized for acc/efficiency
and if pretraining was involved.

* Make diff command structure similar to siblings

* Include gpu as a user option for CLI

* Make variables more explicit

* Fix type declaration for optimize enum

* Improve docstrings for diff CLI

* Add debug-diff to website API docs

* Switch position of configs so that user config is modded

* Add markdown flag for debug diff

This commit adds a --markdown (--md) flag that allows easier
copy-pasting to Github issues. Please note that this commit is dependent
on an unreleased version of wasabi (for the time being).

For posterity, the related PR is found here: https://github.com/ines/wasabi/pull/20

* Bump version of wasabi to 0.9.1

So that we can use the add_symbols parameter.

* Apply suggestions from code review

Co-authored-by: Ines Montani <ines@ines.io>

* Update docs based on code review suggestions

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Change command name from diff -> diff-config

* Clarify when options are relevant or not

* Rerun prettier on cli.md

Co-authored-by: Ines Montani <ines@ines.io>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-04-07 10:48:45 +02:00
Joachim Fainberg
b91255a454
displacy: avoid overlapping arcs in manual mode (#10534)
* Added test for overlapping arcs

* Provide distinct levels to overlapping arcs

* Update return type hint for get_levels

* Improved formatting spacy/displacy/render.py

Co-authored-by: Ines Montani <ines@ines.io>

Co-authored-by: Joachim Fainberg <joachimfainberg@Joachims-MacBook-Pro.local>
Co-authored-by: Ines Montani <ines@ines.io>
2022-04-05 09:08:02 +02:00
Kádár Ákos
ef141ad399 span accuracy score 2022-04-04 18:10:09 +02:00
Adriane Boyd
0d0153db63
Update default spans_key to sc in API docs (#10616) 2022-04-04 18:09:15 +02:00
Bram Vanroy
f966bf6a15
Update to spacy_conll in universe (#10617)
* update to spacy_conll

* Update website/meta/universe.json

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update website/meta/universe.json

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-04-04 17:57:52 +02:00
Madeesh Kannan
cfd9217bae
Update link to flake8 config (#10620)
* Update link to flake8 config

* Run prettier
2022-04-04 17:35:37 +02:00
Kádár Ákos
a1d0219903 prepare for aligned heads-spans training 2022-04-04 15:26:15 +02:00
Adriane Boyd
849bef2de6
Merge pull request #10596 from adrianeboyd/chore/v3.3.0.dev0
Set version to v3.3.0.dev0
2022-04-04 09:18:07 +02:00
Adriane Boyd
e422101e00 Temporarily skip tests that require models/compat 2022-04-01 11:09:28 +02:00
Adriane Boyd
ca54de27bb
Support more internal methods for SpanGroup (#10476)
* Added new convenience cython functions to SpanGroup to avoid unnecessary allocation/deallocation of objects

* Replaced sorting in has_overlap with C++ for efficiency. Also, added a test for has_overlap

* Added a method to efficiently merge SpanGroups

* Added __delitem__, __add__ and __iadd__. Also, allowed to pass span lists to merge function. Replaced extend() body with call to merge

* Renamed merge to concat and added missing things to documentation

* Added operator+ and operator += in the documentation

* Added a test for Doc deallocation

* Update spacy/tokens/span_group.pyx

* Updated SpanGroup tests to use new span list comparison function rather than assert_span_list_equal, eliminating the need to have a separate assert_not_equal fnction

* Fixed typos in SpanGroup documentation

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Minor changes requested by Sofie: rearranged import statements. Added new=3.2.1 tag to SpanGroup.__setitem__ documentation

* SpanGroup: moved repetitive list index check/adjustment in a separate function

* Turn off formatting that hurts readability spacy/tests/doc/test_span_group.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Remove formatting that hurts readability spacy/tests/doc/test_span_group.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Turn off formatting that hurts readability in spacy/tests/doc/test_span_group.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Support more internal methods for SpanGroup

Add support for:

* `__setitem__`
* `__delitem__`
* `__iadd__`: for `SpanGroup` or `Iterable[Span]`
* `__add__`: for `SpanGroup` only

Adapted from #9698 with the scope limited to the magic methods.

* Use v3.3 as new version in docs

* Add new tag to SpanGroup.copy in API docs

* Remove duplicate import

* Apply suggestions from code review

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Remaining suggestions and formatting

Co-authored-by: nrodnova <nrodnova@hotmail.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Natalia Rodnova <4512370+nrodnova@users.noreply.github.com>
2022-04-01 09:56:26 +02:00
Adriane Boyd
d56b1400d2 Set version to v3.3.0.dev0 2022-04-01 09:54:52 +02:00
Daniël de Kok
c90dd6f265
Alignment: use a simplified ragged type for performance (#10319)
* Alignment: use a simplified ragged type for performance

This introduces the AlignmentArray type, which is a simplified version
of Ragged that performs better on the simple(r) indexing performed for
alignment.

* AlignmentArray: raise an error when using unsupported index

* AlignmentArray: move error messages to Errors

* AlignmentArray: remove simlified ... with simplifications

* AlignmentArray: fix typo that broke a[n:n] indexing
2022-04-01 09:02:06 +02:00
Adriane Boyd
03762b4b92
Add spancat, trainable_lemmatizer to quickstart (#10524)
* Add `SPACY` and `IS_SPACE` as default `tok2vec` features
2022-04-01 09:01:04 +02:00
Adriane Boyd
7d1edc0c25
Merge pull request #10593 from adrianeboyd/chore/undo-click-pin
Revert "Add click pin to avoid typer issues (#10573)"
2022-04-01 09:00:38 +02:00
Adriane Boyd
e3ccc1973b
Provide debug data info for floret vectors (#10592) 2022-03-31 15:11:32 +02:00
Adriane Boyd
88933ca878 Revert "Add click pin to avoid typer issues (#10573)"
This reverts commit 9966e08f32.
2022-03-31 14:16:21 +02:00
Kádár Ákos
63a41ba50a fix score overwriting bug 2022-03-30 17:28:20 +02:00
Yunus Atahan
36d3af3013
Fixed typo in Turkish lang. (#10582)
* added failing test case for the issue.

* Fixed typo.

* fixed typo in test.

* added corrected typo word into test_tr_lex_attrs_capitals as param. Test passes. Also tried and confirmed that test is failing after fixing the typo in the test case I wrote. Deleted the test case for typo.

Co-authored-by: Yunus Atahan <yunus.atahan@trmotor.local>
2022-03-30 13:16:08 +02:00
Adriane Boyd
f98b41c390
Add vector deduplication (#10551)
* Add vector deduplication

* Add `Vocab.deduplicate_vectors()`
* Always run deduplication in `spacy init vectors`
* Clean up a few vector-related error messages and docs examples

* Always unique with numpy

* Fix types
2022-03-30 08:54:23 +02:00
Adriane Boyd
9966e08f32
Add click pin to avoid typer issues (#10573) 2022-03-29 11:15:24 +02:00
Kádár Ákos
7ff99a3acc nicer restore 2022-03-28 18:16:41 +02:00
Kádár Ákos
06d680b269 addressing suggestions by @polm 2022-03-28 14:31:51 +02:00
Kádár Ákos
e4b4b67ef6 handle empty clusters 2022-03-28 11:29:00 +02:00