Commit Graph

9322 Commits

Author SHA1 Message Date
kadarakos
86d3e78c64 make label mapper private 2023-02-20 17:02:27 +00:00
kadarakos
813b3551ed Merge branch 'add/exclusive-spancat' of github.com:ljvmiranda921/spaCy into spancat-exclusive 2023-02-20 10:52:34 +00:00
kadarakos
6f3b257cf4 raise error instead of just print 2023-02-20 10:48:41 +00:00
kadarakos
43d5cab2c2
Update spacy/pipeline/spancat.py
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-02-20 11:37:51 +01:00
kadarakos
e847487ebb remove duplicate declaration 2023-02-20 10:36:54 +00:00
kadarakos
af3fa670d4
Update spacy/tests/pipeline/test_spancat.py
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-02-20 11:36:32 +01:00
kadarakos
afc3a5a4af black 2023-02-10 14:07:39 +00:00
kadarakos
a07aafc28e refactor make_span_group 2023-02-10 14:06:56 +00:00
kadarakos
a281a7c9a1 tests for make_span_group with negative labels 2023-02-10 14:06:07 +00:00
kadarakos
b98cba2bef black 2023-02-08 19:45:01 +00:00
kadarakos
43162029bc bugfix 2023-02-08 19:43:51 +00:00
kadarakos
ec941a128d single label make_spangroup test 2023-02-08 19:43:33 +00:00
kadarakos
6fc25f64dd add spans.attrs[scores] 2023-02-07 18:12:32 +00:00
kadarakos
afc3ce1c7e logical bug in configuration check 2023-02-06 19:05:35 +00:00
kadarakos
5c927effde mypy 2023-02-06 19:03:33 +00:00
kadarakos
c24b3785a6 replace single_label with add_negative_label and adjust inference 2023-02-06 18:54:30 +00:00
kadarakos
c864f12e28 remove spancat exclusive 2023-02-06 10:15:53 +00:00
kadarakos
b8cdcfb2f5 black 2023-02-02 15:23:05 +00:00
kadarakos
d13e494abd don't rely on default arguments 2023-02-02 10:36:36 +00:00
kadarakos
5ccb154972 more docstring and fix negative_label 2023-02-01 11:16:34 +00:00
kadarakos
edf9134e45 add docstrings 2023-01-31 17:06:20 +00:00
kadarakos
079f09b97c black 2023-01-31 16:33:06 +00:00
kadarakos
8a807ef1dd black 2023-01-31 16:30:12 +00:00
kadarakos
dceeb02b94 wire up different make_spangroups for single and multilabel 2023-01-31 16:27:26 +00:00
kadarakos
52e7324df4 Merge branch 'master' into spancat-exclusive 2023-01-31 16:05:08 +00:00
kadarakos
f1e091a31f rename spancat_exclusive to singlelable 2023-01-31 16:04:35 +00:00
kadarakos
3f6fd410cc merge multilabel and singlelabel spancat 2023-01-31 16:04:11 +00:00
kadarakos
330a452f5e Merge branch 'master' into spancat-exclusive 2023-01-31 16:03:35 +00:00
Raphael Mitsch
02af17a5c8
Remove flaky assertions. (#12210) 2023-01-31 16:52:06 +01:00
Adriane Boyd
606273f7e4
Normalize whitespace in evaluate CLI output test (#12157)
* Normalize whitespace in evaluate CLI output test

Depending on terminal settings, lines may be padded to the screen width
so the comparison is too strict with only the command string replacement.

* Move to test util method

* Change to normalization method
2023-01-27 16:13:34 +01:00
Adriane Boyd
5f8a398bb9
Add span_id to Span.char_span, update Doc/Span.char_span docs (#12196)
* Add span_id to Span.char_span, update Doc/Span.char_span docs

`Span.char_span(id=)` should be removed in the future.

* Also use Union[int, str] in Doc docstring
2023-01-27 15:09:17 +01:00
Simon Gurcke
774c10fa39
Add alignment_mode argument to Span.char_span() (#12145)
* Add alignment_mode argument to Span.char_span()

* Update website

* Update spacy/tokens/span.pyx

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Add test

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-01-27 11:43:40 +01:00
Peter Baumgartner
c68e6b8a96
trainable_lemmatizer in debug data (#11419)
* WIP

* rm ipython embeds

* rm total

* WIP

* cleanup

* cleanup + reword

* rm component function

* remove migration support form

* fix reference dataset for dev data

* additional fixes

- set approach to identifying unique trees
- adjust line length on messages
- add logic for detecting docs without annotations

* use 0 instead of none for no annotation

* partial annotation support

* initial tests for _compile_gold lemma attributes

Using the example data from the edit tree lemmatizer tests for:
- lemmatizer_trees
- partial_lemma_annotations
- n_low_cardinality_lemmas
- no_lemma_annotations

* adds output test for cli app

* switch msg level

* rm unclear uniqueness check

* Revert "rm unclear uniqueness check"

This reverts commit 6ea2b3524b.

* remove good message on uniqueness

* formatting

* use en_vocab fixture

* clarify data set source in messages

* remove unnecessary import

Co-authored-by: svlandeg <svlandeg@github.com>
2023-01-26 17:36:50 +01:00
Daniël de Kok
8d69874afb
Add spacy.PlainTextCorpusReader.v1 (#12122)
* Add `spacy.PlainTextCorpusReader.v1`

This is a corpus reader that reads plain text corpora with the following
format:

- UTF-8 encoding
- One line per document.
- Blank lines are ignored.

It is useful for applications where we deal with very large corpora,
such as distillation, and don't want to deal with the space overhead of
serialized formats. Additionally, many large corpora already use such
a text format, keeping the necessary preprocessing to a minimum.

* Update spacy/training/corpus.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* docs: add version to `PlainTextCorpus`

* Add docstring to registry function

* Add plain text corpus tests

* Only strip newline/carriage return

* Add return type _string_to_tmp_file helper

* Use a temporary directory in place of file name

Different OS auto delete/sharing semantics are just wonky.

* This will be new in 3.5.1 (rather than 4)

* Test improvements from code review

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-01-26 11:33:22 +01:00
Raphael Mitsch
950fceceb6
Make test_cli_find_threshold() more robust. (#12148) 2023-01-23 14:42:33 +01:00
Richard Hudson
f9e020dd67
Fix speed problem with top_k>1 on CPU in edit tree lemmatizer (#12017)
* Refactor _scores2guesses

* Handle arrays on GPU

* Convert argmax result to raw integer

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>

* Use NumpyOps() to copy data to CPU

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>

* Changes based on review comments

* Use different _scores2guesses depending on tree_k

* Add tests for corner cases

* Add empty line for consistency

* Improve naming

Co-authored-by: Daniël de Kok <me@github.danieldk.eu>

* Improve naming

Co-authored-by: Daniël de Kok <me@github.danieldk.eu>

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
Co-authored-by: Daniël de Kok <me@github.danieldk.eu>
2023-01-20 19:34:11 +01:00
Adriane Boyd
1e993d3b03
Merge pull request #12121 from adrianeboyd/chore/v3.5.0-2
Revert "Temporarily skip tests that require models/compat"
2023-01-19 15:59:30 +01:00
Adriane Boyd
3b8918e166
API docs: Rename kb_in_memory to inmemorylookupkb, add to sidebar (#12128)
* API docs: Rename kb_in_memory to inmemorylookupkb, add to sidebar

* adjust to mdx

* linkout to InMemoryLookupKB at first occurrence in kb.mdx

* fix links to docs

* revert Azure trigger setting (I'll make a separate PR)

Co-authored-by: svlandeg <svlandeg@github.com>
2023-01-19 13:29:17 +01:00
Adriane Boyd
dc0f527039 Revert "Temporarily skip tests that require models/compat"
This reverts commit 378db0eb1e.
2023-01-18 12:54:56 +01:00
Adriane Boyd
794cea6907
Fix comments and examples for levenshtein_compare (#12113) 2023-01-18 08:02:33 +01:00
Lj Miranda
a722bd8fba Add suggester to spancat docstrings 2023-01-17 20:38:35 +08:00
Lj Miranda
26d5d637e3 Add suggester documentation in Exclusive_SpanCategorizer 2023-01-17 10:34:21 +08:00
Lj Miranda
e61f0a4035 Update how spancat_exclusive is constructed
In this commit, I added the following:
- Put the default values of negative_weight and allow_overlap
    in the default_config dictionary.
- Rename make_spancat -> make_exclusive_spancat
2023-01-17 10:17:29 +08:00
Lj Miranda
65ce4347ef
Apply suggestions from code review
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2023-01-17 09:38:47 +08:00
Lj Miranda
bf2f0173d2 Merge branch 'master' into add/exclusive-spancat 2023-01-13 17:30:29 +08:00
github-actions[bot]
9ef7d26032
Auto-format code with black (#12100)
Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>
2023-01-13 10:12:10 +01:00
Daniël de Kok
dda7331da3
Handle missing annotations in the edit tree lemmatizer (#12098)
The losses/gradients of missing annotations were not correctly masked
out. Fix this and check the masking in the partial data test.
2023-01-12 12:13:55 +01:00
Daniël de Kok
319eb508b5
Add a spacy benchmark speed subcommand (#11902)
* Add a `spacy evaluate speed` subcommand

This subcommand reports the mean batch performance of a model on a data set with
a 95% confidence interval. For reliability, it first performs some warmup
rounds. Then it will measure performance on batches with randomly shuffled
documents.

To avoid having too many spaCy commands, `speed` is a subcommand of `evaluate`
and accuracy evaluation is moved to its own `evaluate accuracy` subcommand.

* Fix import cycle

* Restore `spacy evaluate`, make `spacy benchmark speed` an alias

* Add documentation for `spacy benchmark`

* CREATES -> PRINTS

* WPS -> words/s

* Disable formatting of benchmark speed arguments

* Fail with an error message when trying to speed bench empty corpus

* Make it clearer that `benchmark accuracy` is a replacement for `evaluate`

* Fix docstring webpage reference

* tests: check `evaluate` output against `benchmark accuracy`
2023-01-12 11:55:21 +01:00
Paul O'Leary McCann
8e558095a1
Clean up displacy port-related error messages, docs (#12089)
* Clean up displacy port-related error messages, docs

There were some issues in the error messages and docs in #11948.

1. the error messages didn't specify the port argument to displacy.serve correctly
2. the docs didn't mark the auto select argument as new

This addresses those issues.

* Update website/docs/api/top-level.md

Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com>

* Apply prettier

Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com>
2023-01-12 14:54:09 +09:00
Adriane Boyd
9e0322de1a
Restore v2 token_acc score implementation (#12073)
In the v3 scorer refactoring, `token_acc` was implemented incorrectly.
It should use `precision` instead of `fscore` for the measure of
correctly aligned tokens / number of predicted tokens.

Fix the docs to reflect that the measure uses the number of predicted
tokens rather than the number of gold tokens.
2023-01-11 08:01:47 +01:00