Commit Graph

616 Commits

Author SHA1 Message Date
kadarakos
0a74e8c260
Update spacy/pipeline/spancat.py
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-03-03 15:54:54 +01:00
Adriane Boyd
6182213fef
Merge branch 'master' into add/exclusive-spancat 2023-03-01 15:51:16 +01:00
Sofie Van Landeghem
74cae47bf6
rely on is_empty property instead of __len__ (#12347) 2023-03-01 12:06:07 +01:00
kadarakos
86d3e78c64 make label mapper private 2023-02-20 17:02:27 +00:00
kadarakos
813b3551ed Merge branch 'add/exclusive-spancat' of github.com:ljvmiranda921/spaCy into spancat-exclusive 2023-02-20 10:52:34 +00:00
kadarakos
6f3b257cf4 raise error instead of just print 2023-02-20 10:48:41 +00:00
kadarakos
43d5cab2c2
Update spacy/pipeline/spancat.py
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-02-20 11:37:51 +01:00
kadarakos
e847487ebb remove duplicate declaration 2023-02-20 10:36:54 +00:00
kadarakos
a07aafc28e refactor make_span_group 2023-02-10 14:06:56 +00:00
kadarakos
43162029bc bugfix 2023-02-08 19:43:51 +00:00
kadarakos
6fc25f64dd add spans.attrs[scores] 2023-02-07 18:12:32 +00:00
kadarakos
afc3ce1c7e logical bug in configuration check 2023-02-06 19:05:35 +00:00
kadarakos
5c927effde mypy 2023-02-06 19:03:33 +00:00
kadarakos
c24b3785a6 replace single_label with add_negative_label and adjust inference 2023-02-06 18:54:30 +00:00
kadarakos
c864f12e28 remove spancat exclusive 2023-02-06 10:15:53 +00:00
kadarakos
b8cdcfb2f5 black 2023-02-02 15:23:05 +00:00
kadarakos
d13e494abd don't rely on default arguments 2023-02-02 10:36:36 +00:00
kadarakos
5ccb154972 more docstring and fix negative_label 2023-02-01 11:16:34 +00:00
kadarakos
edf9134e45 add docstrings 2023-01-31 17:06:20 +00:00
kadarakos
8a807ef1dd black 2023-01-31 16:30:12 +00:00
kadarakos
dceeb02b94 wire up different make_spangroups for single and multilabel 2023-01-31 16:27:26 +00:00
kadarakos
3f6fd410cc merge multilabel and singlelabel spancat 2023-01-31 16:04:11 +00:00
kadarakos
330a452f5e Merge branch 'master' into spancat-exclusive 2023-01-31 16:03:35 +00:00
Richard Hudson
f9e020dd67
Fix speed problem with top_k>1 on CPU in edit tree lemmatizer (#12017)
* Refactor _scores2guesses

* Handle arrays on GPU

* Convert argmax result to raw integer

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>

* Use NumpyOps() to copy data to CPU

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>

* Changes based on review comments

* Use different _scores2guesses depending on tree_k

* Add tests for corner cases

* Add empty line for consistency

* Improve naming

Co-authored-by: Daniël de Kok <me@github.danieldk.eu>

* Improve naming

Co-authored-by: Daniël de Kok <me@github.danieldk.eu>

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
Co-authored-by: Daniël de Kok <me@github.danieldk.eu>
2023-01-20 19:34:11 +01:00
Lj Miranda
a722bd8fba Add suggester to spancat docstrings 2023-01-17 20:38:35 +08:00
Lj Miranda
26d5d637e3 Add suggester documentation in Exclusive_SpanCategorizer 2023-01-17 10:34:21 +08:00
Lj Miranda
e61f0a4035 Update how spancat_exclusive is constructed
In this commit, I added the following:
- Put the default values of negative_weight and allow_overlap
    in the default_config dictionary.
- Rename make_spancat -> make_exclusive_spancat
2023-01-17 10:17:29 +08:00
Lj Miranda
65ce4347ef
Apply suggestions from code review
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2023-01-17 09:38:47 +08:00
Lj Miranda
bf2f0173d2 Merge branch 'master' into add/exclusive-spancat 2023-01-13 17:30:29 +08:00
Daniël de Kok
dda7331da3
Handle missing annotations in the edit tree lemmatizer (#12098)
The losses/gradients of missing annotations were not correctly masked
out. Fix this and check the masking in the partial data test.
2023-01-12 12:13:55 +01:00
Kevin Humphreys
19650ebb52
Enable fuzzy text matching in Matcher (#11359)
* enable fuzzy matching

* add fuzzy param to EntityMatcher

* include rapidfuzz_capi

not yet used

* fix type

* add FUZZY predicate

* add fuzzy attribute list

* fix type properly

* tidying

* remove unnecessary dependency

* handle fuzzy sets

* simplify fuzzy sets

* case fix

* switch to FUZZYn predicates

use Levenshtein distance.
remove fuzzy param.
remove rapidfuzz_capi.

* revert changes added for fuzzy param

* switch to polyleven

(Python package)

* enable fuzzy matching

* add fuzzy param to EntityMatcher

* include rapidfuzz_capi

not yet used

* fix type

* add FUZZY predicate

* add fuzzy attribute list

* fix type properly

* tidying

* remove unnecessary dependency

* handle fuzzy sets

* simplify fuzzy sets

* case fix

* switch to FUZZYn predicates

use Levenshtein distance.
remove fuzzy param.
remove rapidfuzz_capi.

* revert changes added for fuzzy param

* switch to polyleven

(Python package)

* fuzzy match only on oov tokens

* remove polyleven

* exclude whitespace tokens

* don't allow more edits than characters

* fix min distance

* reinstate FUZZY operator

with length-based distance function

* handle sets inside regex operator

* remove is_oov check

* attempt build fix

no mypy failure locally

* re-attempt build fix

* don't overwrite fuzzy param value

* move fuzzy_match

to its own Python module to allow patching

* move fuzzy_match back inside Matcher

simplify logic and add tests

* Format tests

* Parametrize fuzzyn tests

* Parametrize and merge fuzzy+set tests

* Format

* Move fuzzy_match to a standalone method

* Change regex kwarg type to bool

* Add types for fuzzy_match

- Refactor variable names
- Add test for symmetrical behavior

* Parametrize fuzzyn+set tests

* Minor refactoring for fuzz/fuzzy

* Make fuzzy_match a Matcher kwarg

* Update type for _default_fuzzy_match

* don't overwrite function param

* Rename to fuzzy_compare

* Update fuzzy_compare default argument declarations

* allow fuzzy_compare override from EntityRuler

* define new Matcher keyword arg

* fix type definition

* Implement fuzzy_compare config option for EntityRuler and SpanRuler

* Rename _default_fuzzy_compare to fuzzy_compare, remove from reexported objects

* Use simpler fuzzy_compare algorithm

* Update types

* Increase minimum to 2 in fuzzy_compare to allow one transposition

* Fix predicate keys and matching for SetPredicate with FUZZY and REGEX

* Add FUZZY6..9

* Add initial docs

* Increase default fuzzy to rounded 30% of pattern length

* Update docs for fuzzy_compare in components

* Update EntityRuler and SpanRuler API docs

* Rename EntityRuler and SpanRuler setting to matcher_fuzzy_compare

To having naming similar to `phrase_matcher_attr`, rename
`fuzzy_compare` setting for `EntityRuler` and `SpanRuler` to
`matcher_fuzzy_compare. Organize next to `phrase_matcher_attr` in docs.

* Fix schema aliases

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Fix typo

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Add FUZZY6-9 operators and update tests

* Parameterize test over greedy

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Fix type for fuzzy_compare to remove Optional

* Rename to spacy.levenshtein_compare.v1, move to spacy.matcher.levenshtein

* Update docs following levenshtein_compare renaming

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2023-01-10 10:36:17 +01:00
Sofie Van Landeghem
6d03b04901
Improve score_cats for use with multiple textcat components (#11820)
* add test for running evaluate on an nlp pipeline with two distinct textcat components

* cleanup

* merge dicts instead of overwrite

* don't add more labels to the given set

* Revert "merge dicts instead of overwrite"

This reverts commit 89bee0ed77.

* Switch tests to separate scorer keys rather than merged dicts

* Revert unrelated edits

* Switch textcat scorers to v2

* formatting

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-01-09 11:43:48 +01:00
Adriane Boyd
ef9e504eac
Rename modified textcat scorer to v2 (#11971)
As a follow-up to #11696, rename the modified scorer to v2 and move the
v1 scorer to `spacy-legacy`.
2022-12-29 14:01:08 +01:00
kadarakos
933b54ac79
typo fix (#11995) 2022-12-26 13:26:35 +01:00
Lj Miranda
8c4eee28bc Better approach for handling zero suggestions 2022-12-21 20:01:02 +08:00
Lj Miranda
a3fad0b983 Handle zero suggestions to make tests pass
I'm not sure if this is the most elegant solution. But what should
happen is that the _make_span_group function MUST return an empty
SpanGroup if there are no suggestions.

The error happens when the 'scores' variable is empty. We cannot
get the 'predicted' and other downstream vars.
2022-12-21 10:36:01 +08:00
Lj Miranda
0336618eff
Merge branch 'master' into add/exclusive-spancat 2022-12-12 16:26:48 +08:00
Daniël de Kok
27fac7df2e
EditTreeLemmatizer: correctly add strings when initializing from labels (#11934)
Strings in replacement nodes where not added to the `StringStore`
when `EditTreeLemmatizer` was initialized from a set of labels. The
corresponding test did not capture this because it added the strings
through the examples that were passed to the initialization.

This change fixes both this bug in the initialization as the 'shadowing'
of the bug in the test.
2022-12-07 13:53:41 +09:00
Lj Miranda
9e88108298 Remove init_W and init_B parameters
This commit is expected to fail until the new Thinc release.
2022-12-05 08:13:59 +08:00
Adriane Boyd
445c670a2d
Fix spancat for zero suggestions (#11860)
* Add test for spancat predict with zero suggestions

* Fix spancat for zero suggestions

* Undo changes to extract_spans

* Use .sum() as in update
2022-12-02 09:33:52 +01:00
Lj Miranda
6a10d56caf Update spancat_exclusive docstring 2022-11-30 15:43:49 +08:00
Paul O'Leary McCann
f1e0243450
Remove macro auc per type from textcat defaults (#11887)
This appears to have been added by mistake and never used. Removing it
does not break validation.
2022-11-29 11:50:23 +01:00
Lj Miranda
14bf26d3e6 Merge branch 'add/exclusive-spancat' of github.com:ljvmiranda921/spaCy into add/exclusive-spancat 2022-11-29 11:37:16 +08:00
Lj Miranda
a1be07e2da Put back initializers in spancat config
Whenever I remove model.scorer.init_w and model.scorer.init_b,
I encounter an error in the test:

    SystemError: <method '__getitem__' of 'dict' objects> returned a result
    with an error set.

My Thinc version is 8.1.5, but I can't seem to check what's causing the
error.
2022-11-29 11:32:38 +08:00
Lj Miranda
8138e49764
Update defaults for number of rows
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-11-29 11:26:04 +08:00
Lj Miranda
616723e902 Merge branch 'add/exclusive-spancat' of github.com:ljvmiranda921/spaCy into add/exclusive-spancat 2022-11-29 11:15:15 +08:00
Lj Miranda
0b32a949f1 Remove mypy ignore and typecast labels to list 2022-11-29 11:14:43 +08:00
Lj Miranda
14ae4a52c0 Clarify docstring for Exclusive_SpanCategorizer 2022-11-29 11:11:26 +08:00
Lj Miranda
29f156aa1a
Update documentation
Update grammar and usage

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-11-29 11:06:35 +08:00
Lj Miranda
bd0562e609 Use DEFAULT_EXCL_SPANCAT_MODEL
I also renamed spancat_exclusive_default_config into
spancat_excl_default_config because black does some not pretty
formatting changes.
2022-11-29 11:01:18 +08:00