Commit Graph

9337 Commits

Author SHA1 Message Date
kadarakos
97fd9741c6 don't keep recomputing self._label_map for each span 2023-03-03 15:41:41 +00:00
kadarakos
c7e7343999
Update spacy/pipeline/spancat.py
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-03-03 15:56:00 +01:00
kadarakos
fded200128
Update spacy/pipeline/spancat.py
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-03-03 15:55:42 +01:00
kadarakos
b972328337
Update spacy/pipeline/spancat.py
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-03-03 15:55:15 +01:00
kadarakos
0a74e8c260
Update spacy/pipeline/spancat.py
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-03-03 15:54:54 +01:00
Adriane Boyd
6182213fef
Merge branch 'master' into add/exclusive-spancat 2023-03-01 15:51:16 +01:00
Sofie Van Landeghem
74cae47bf6
rely on is_empty property instead of __len__ (#12347) 2023-03-01 12:06:07 +01:00
Adriane Boyd
8f058e39bd
Fix error message for displacy auto_select_port (#12343) 2023-02-28 16:36:03 +01:00
TAN Long
071667376a
Add new REL_OPs: >+, >-, <+, and <- (#12334)
* Add immediate left/right child/parent dependency relations

* Add tests for new REL_OPs: `>+`, `>-`, `<+`, and `<-`.

---------

Co-authored-by: Tan Long <tanloong@foxmail.com>
2023-02-28 14:36:33 +01:00
lise-brinck
e2de188cf1
Bugfix/swedish tokenizer (#12315)
* add unittest for explosion#12311

* create punctuation.py for swedish

* removed : from infixes in swedish punctuation.py

* allow : as infix if succeeding char is uppercase
2023-02-27 10:53:45 +01:00
Kevin Humphreys
acdd993071
Matcher performance fix for extension predicates: use shared key function (#12272)
* standardize predicate key format

* single key function

* Make optional args in key function keyword-only

---------

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-02-27 08:35:08 +01:00
Paul O'Leary McCann
1e8bac99f3
Add tests for projects to master (#12303)
* Add tests for projects to master

* Fix git clone related issues on Windows

* Add stat import
2023-02-23 10:22:57 +01:00
kadarakos
86d3e78c64 make label mapper private 2023-02-20 17:02:27 +00:00
kadarakos
813b3551ed Merge branch 'add/exclusive-spancat' of github.com:ljvmiranda921/spaCy into spancat-exclusive 2023-02-20 10:52:34 +00:00
kadarakos
6f3b257cf4 raise error instead of just print 2023-02-20 10:48:41 +00:00
kadarakos
43d5cab2c2
Update spacy/pipeline/spancat.py
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-02-20 11:37:51 +01:00
kadarakos
e847487ebb remove duplicate declaration 2023-02-20 10:36:54 +00:00
kadarakos
af3fa670d4
Update spacy/tests/pipeline/test_spancat.py
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-02-20 11:36:32 +01:00
Adriane Boyd
80bc140533
Add grc to langs with lexeme norms in spacy-lookups-data (#12287) 2023-02-16 17:57:02 +01:00
Edward
61b8454137
Adjust return type of registry.find (#12227)
* Fix registry find return type

* add dot

* Add type ignore for mypy

* update black formatting version

* add mypy ignore to package cli

* mypy type fix (for real)

* Update find description in spacy/util.py

Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com>

* adjust mypy directive

---------

Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com>
2023-02-15 12:32:53 +01:00
kadarakos
afc3a5a4af black 2023-02-10 14:07:39 +00:00
kadarakos
a07aafc28e refactor make_span_group 2023-02-10 14:06:56 +00:00
kadarakos
a281a7c9a1 tests for make_span_group with negative labels 2023-02-10 14:06:07 +00:00
kadarakos
b98cba2bef black 2023-02-08 19:45:01 +00:00
kadarakos
43162029bc bugfix 2023-02-08 19:43:51 +00:00
kadarakos
ec941a128d single label make_spangroup test 2023-02-08 19:43:33 +00:00
kadarakos
6fc25f64dd add spans.attrs[scores] 2023-02-07 18:12:32 +00:00
kadarakos
afc3ce1c7e logical bug in configuration check 2023-02-06 19:05:35 +00:00
kadarakos
5c927effde mypy 2023-02-06 19:03:33 +00:00
kadarakos
c24b3785a6 replace single_label with add_negative_label and adjust inference 2023-02-06 18:54:30 +00:00
kadarakos
c864f12e28 remove spancat exclusive 2023-02-06 10:15:53 +00:00
kadarakos
b8cdcfb2f5 black 2023-02-02 15:23:05 +00:00
kadarakos
d13e494abd don't rely on default arguments 2023-02-02 10:36:36 +00:00
Sofie Van Landeghem
79ef6cf0f9
Have logging calls use string formatting types (#12215)
* change logging call for spacy.LookupsDataLoader.v1

* substitutions in language and _util

* various more substitutions

* add string formatting guidelines to contribution guidelines
2023-02-02 11:15:22 +01:00
kadarakos
5ccb154972 more docstring and fix negative_label 2023-02-01 11:16:34 +00:00
kadarakos
edf9134e45 add docstrings 2023-01-31 17:06:20 +00:00
kadarakos
079f09b97c black 2023-01-31 16:33:06 +00:00
kadarakos
8a807ef1dd black 2023-01-31 16:30:12 +00:00
kadarakos
dceeb02b94 wire up different make_spangroups for single and multilabel 2023-01-31 16:27:26 +00:00
kadarakos
52e7324df4 Merge branch 'master' into spancat-exclusive 2023-01-31 16:05:08 +00:00
kadarakos
f1e091a31f rename spancat_exclusive to singlelable 2023-01-31 16:04:35 +00:00
kadarakos
3f6fd410cc merge multilabel and singlelabel spancat 2023-01-31 16:04:11 +00:00
kadarakos
330a452f5e Merge branch 'master' into spancat-exclusive 2023-01-31 16:03:35 +00:00
Raphael Mitsch
02af17a5c8
Remove flaky assertions. (#12210) 2023-01-31 16:52:06 +01:00
Adriane Boyd
606273f7e4
Normalize whitespace in evaluate CLI output test (#12157)
* Normalize whitespace in evaluate CLI output test

Depending on terminal settings, lines may be padded to the screen width
so the comparison is too strict with only the command string replacement.

* Move to test util method

* Change to normalization method
2023-01-27 16:13:34 +01:00
Adriane Boyd
5f8a398bb9
Add span_id to Span.char_span, update Doc/Span.char_span docs (#12196)
* Add span_id to Span.char_span, update Doc/Span.char_span docs

`Span.char_span(id=)` should be removed in the future.

* Also use Union[int, str] in Doc docstring
2023-01-27 15:09:17 +01:00
Simon Gurcke
774c10fa39
Add alignment_mode argument to Span.char_span() (#12145)
* Add alignment_mode argument to Span.char_span()

* Update website

* Update spacy/tokens/span.pyx

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Add test

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-01-27 11:43:40 +01:00
Peter Baumgartner
c68e6b8a96
trainable_lemmatizer in debug data (#11419)
* WIP

* rm ipython embeds

* rm total

* WIP

* cleanup

* cleanup + reword

* rm component function

* remove migration support form

* fix reference dataset for dev data

* additional fixes

- set approach to identifying unique trees
- adjust line length on messages
- add logic for detecting docs without annotations

* use 0 instead of none for no annotation

* partial annotation support

* initial tests for _compile_gold lemma attributes

Using the example data from the edit tree lemmatizer tests for:
- lemmatizer_trees
- partial_lemma_annotations
- n_low_cardinality_lemmas
- no_lemma_annotations

* adds output test for cli app

* switch msg level

* rm unclear uniqueness check

* Revert "rm unclear uniqueness check"

This reverts commit 6ea2b3524b.

* remove good message on uniqueness

* formatting

* use en_vocab fixture

* clarify data set source in messages

* remove unnecessary import

Co-authored-by: svlandeg <svlandeg@github.com>
2023-01-26 17:36:50 +01:00
Daniël de Kok
8d69874afb
Add spacy.PlainTextCorpusReader.v1 (#12122)
* Add `spacy.PlainTextCorpusReader.v1`

This is a corpus reader that reads plain text corpora with the following
format:

- UTF-8 encoding
- One line per document.
- Blank lines are ignored.

It is useful for applications where we deal with very large corpora,
such as distillation, and don't want to deal with the space overhead of
serialized formats. Additionally, many large corpora already use such
a text format, keeping the necessary preprocessing to a minimum.

* Update spacy/training/corpus.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* docs: add version to `PlainTextCorpus`

* Add docstring to registry function

* Add plain text corpus tests

* Only strip newline/carriage return

* Add return type _string_to_tmp_file helper

* Use a temporary directory in place of file name

Different OS auto delete/sharing semantics are just wonky.

* This will be new in 3.5.1 (rather than 4)

* Test improvements from code review

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-01-26 11:33:22 +01:00
Raphael Mitsch
950fceceb6
Make test_cli_find_threshold() more robust. (#12148) 2023-01-23 14:42:33 +01:00