Commit Graph

610 Commits

Author SHA1 Message Date
Paul O'Leary McCann
f9c82e249c Update error number
This was changed by merge
2022-07-11 20:14:36 +09:00
Paul O'Leary McCann
5969634e92 Merge branch 'master' into coref/dimension-inference 2022-07-11 20:11:51 +09:00
Paul O'Leary McCann
4d032396b8 Merge branch 'feature/coref' into coref/dimension-inference 2022-07-11 19:18:46 +09:00
Paul O'Leary McCann
6d9eafeb37
Merge branch 'feature/coref' into fix/coref-alignment 2022-07-11 19:14:37 +09:00
Paul O'Leary McCann
da81a90d64 Span predictor leftovers 2022-07-06 19:29:27 +09:00
Paul O'Leary McCann
b0800ea855 Do dimension inference in span predictor 2022-07-06 19:22:37 +09:00
Paul O'Leary McCann
f67c1735c5 Remove tok2vec_size from coref 2022-07-06 18:58:57 +09:00
Paul O'Leary McCann
bd17c38b74 It works!
Was missing the serialization-related code from biaffine.
2022-07-06 18:58:22 +09:00
Paul O'Leary McCann
ce49136458 Update NotImplementedError for coref component 2022-07-06 17:28:15 +09:00
Paul O'Leary McCann
5e405738d2 Update span predictor docstrings 2022-07-06 17:28:05 +09:00
Paul O'Leary McCann
c4de3e51a2 Remove old TODOs 2022-07-06 17:23:41 +09:00
Paul O'Leary McCann
da9c379355 Update docs
Parameter names in architecture docs were not updated after parameters
were renamed.
2022-07-06 17:13:31 +09:00
Daniël de Kok
a06cbae70d
precompute_hiddens/Parser: do not look up CPU ops (3.4) (#11069)
* precompute_hiddens/Parser: do not look up CPU ops

`get_ops("cpu")` is quite expensive. To avoid this, we want to cache the
result as in #11068. However, for 3.x we do not want to change the ABI.
So we avoid the expensive lookup by using NumpyOps. This should have a
minimal impact, since `get_ops("cpu")` was only used when the model ops
were `CupyOps`. If the ops are `AppleOps`, we are still passing through
the correct BLAS implementation.

* _NUMPY_OPS -> NUMPY_OPS
2022-07-05 10:53:42 +02:00
kadarakos
5240baccfe
dont use get_array_module (#11056) 2022-07-04 17:15:33 +02:00
Raphael Mitsch
e9eb59699f
NEL confidence threshold (#11016)
* Add base for NEL abstention threshold mechanism.

* Add abstention threshold to entity linker. Add test.

* Fix entity linking tests.

* Changed abstention default threshold from 0 to None.

* Fix default values for abstention thresholds.

* Fix mypy errors.

* Replace assertion with raise of proper error code.

* Simplify threshold check. Remove thresholding from EntityLinker_v1.

* Rename test.

* Update spacy/pipeline/entity_linker.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update spacy/pipeline/entity_linker.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Make E1043 configurable.

* Update docs.

* Rephrase description in docs. Adjusting error code message.

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-07-04 17:05:21 +02:00
Paul O'Leary McCann
178feae00a Add tests to give up with whitespace differences
Docs in Examples are allowed to have arbitrarily different whitespace.
Handling that properly would be nice but isn't required, but for now
check for it and blow up.
2022-07-04 19:37:42 +09:00
Paul O'Leary McCann
b09bbc7f5e Fix alignment issues
I believe this resolves issues with tokenization mismatches.
2022-07-03 20:11:03 +09:00
Paul O'Leary McCann
79720886fa Merge branch 'feature/coref' into fix/coref-alignment
Had to renumber error message.
2022-07-01 19:09:29 +09:00
kadarakos
9f9453865a Merge branch 'master' into feature/coref 2022-06-28 10:27:35 +00:00
Paul O'Leary McCann
d1ff933e9b Test works
This may not be done yet, as the test is just for consistency, and not
overfitting correctly yet.
2022-06-28 19:15:33 +09:00
Paul O'Leary McCann
16894e665d
Refactor Coval Scoring code (#10875)
* Move coref scoring code to scorer.py

Includes some renames to make names less generic.

* Refactor coval code to remove ternary expressions

* Black formatting

* Add header

* Make scorers into registered scorers

* Small test fixes

* Skip coref tests when torch not present

Coref can't be loaded without Torch, so nothing works.

* Fix remaining type issues

Some of this just involves ignoring types in thorny areas. Two main
issues:

1. Some things have weird types due to indirection/ argskwargs
2. xp2torch return type seems to have changed at some point

* Update spacy/scorer.py

Co-authored-by: kadarakos <kadar.akos@gmail.com>

* Small changes from review

* Be specific about the ValueError

* Type fix

Co-authored-by: kadarakos <kadar.akos@gmail.com>
2022-06-22 16:05:52 +09:00
Sofie Van Landeghem
eaeca5eb6a
account for NER labels with a hyphen in the name (#10960)
* account for NER labels with a hyphen in the name

* cleanup

* fix docstring

* add return type to helper method

* shorter method and few more occurrences

* user helper method across repo

* fix circular import

* partial revert to avoid circular import
2022-06-17 20:02:37 +01:00
kadarakos
1bb87f35bc
Detect cycle during projectivize (#10877)
* detect cycle during projectivize

* not complete test to detect cycle in projectivize

* boolean to int type to propagate error

* use unordered_set instead of set

* moved error message to errors

* removed cycle from test case

* use find instead of count

* cycle check: only perform one lookup

* Return bool again from _has_head_as_ancestor

Communicate presence of cycles through an output argument.

* Switch to returning std::pair to encode presence of a cycle

The has_cycle pointer is too easy to misuse. Ideally, we would have a
sum type like Rust's `Result` here, but C++ is not there yet.

* _is_non_proj_arc: clarify what we are returning

* _has_head_as_ancestor: remove count

We are now explicitly checking for cycles, so the algorithm must always
terminate. Either we encounter the head, we find a root, or a cycle.

* _is_nonproj_arc: simplify condition

* Another refactor using C++ exceptions

* Remove unused error code

* Print graph with cycle on exception

* Include .hh files in source package

* Add FIXME comment

* cycle detection test

* find cycle when starting from problematic vertex

Co-authored-by: Daniël de Kok <me@danieldk.eu>
2022-06-08 19:34:11 +02:00
Paul O'Leary McCann
196886bbca
Fix coref size inference (#10916)
* Add explicit tok2vec_size parameter in clusterer

* Add tok2vec size to span predictor config

* Minor fixes
2022-06-08 20:03:41 +09:00
Adriane Boyd
a322d6d5f2
Add SpanRuler component (#9880)
* Add SpanRuler component

Add a `SpanRuler` component similar to `EntityRuler` that saves a list
of matched spans to `Doc.spans[spans_key]`. The matches from the token
and phrase matchers are deduplicated and sorted before assignment but
are not otherwise filtered.

* Update spacy/pipeline/span_ruler.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Fix cast

* Add self.key property

* Use number of patterns as length

* Remove patterns kwarg from init

* Update spacy/tests/pipeline/test_span_ruler.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Add options for spans filter and setting to ents

* Add `spans_filter` option as a registered function'
* Make `spans_key` optional and if `None`, set to `doc.ents` instead of
`doc.spans[spans_key]`.

* Update and generalize tests

* Add test for setting doc.ents, fix key property type

* Fix typing

* Allow independent doc.spans and doc.ents

* If `spans_key` is set, set `doc.spans` with `spans_filter`.
* If `annotate_ents` is set, set `doc.ents` with `ents_fitler`.
  * Use `util.filter_spans` by default as `ents_filter`.
  * Use a custom warning if the filter does not work for `doc.ents`.

* Enable use of SpanC.id in Span

* Support id in SpanRuler as Span.id

* Update types

* `id` can only be provided as string (already by `PatternType`
definition)

* Update all uses of Span.id/ent_id in Doc

* Rename Span id kwarg to span_id

* Update types and docs

* Add ents filter to mimic EntityRuler overwrite_ents

* Refactor `ents_filter` to take `entities, spans` args for more
  filtering options
* Give registered filters more descriptive names
* Allow registered `filter_spans` filter
  (`spacy.first_longest_spans_filter.v1`) to take any number of
  `Iterable[Span]` objects as args so it can be used for spans filter
  or ents filter

* Implement future entity ruler as span ruler

Implement a compatible `entity_ruler` as `future_entity_ruler` using
`SpanRuler` as the underlying component:
* Add `sort_key` and `sort_reverse` to allow the sorting behavior to be
  customized. (Necessary for the same sorting/filtering as in
  `EntityRuler`.)
* Implement `overwrite_overlapping_ents_filter` and
  `preserve_existing_ents_filter` to support
  `EntityRuler.overwrite_ents` settings.
* Add `remove_by_id` to support `EntityRuler.remove` functionality.
* Refactor `entity_ruler` tests to parametrize all tests to test both
  `entity_ruler` and `future_entity_ruler`
* Implement `SpanRuler.token_patterns` and `SpanRuler.phrase_patterns`
  properties.

Additional changes:

* Move all config settings to top-level attributes to avoid duplicating
  settings in the config vs. `span_ruler/cfg`. (Also avoids a lot of
  casting.)

* Format

* Fix filter make method name

* Refactor to use same error for removing by label or ID

* Also provide existing spans to spans filter

* Support ids property

* Remove token_patterns and phrase_patterns

* Update docstrings

* Add span ruler docs

* Fix types

* Apply suggestions from code review

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Move sorting into filters

* Check for all tokens in seen tokens in entity ruler filters

* Remove registered sort key

* Set Token.ent_id in a backwards-compatible way in Doc.set_ents

* Remove sort options from API docs

* Update docstrings

* Rename entity ruler filters

* Fix and parameterize scoring

* Add id to Span API docs

* Fix typo in API docs

* Include explicit labeled=True for scorer

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-06-02 13:12:53 +02:00
Sofie Van Landeghem
f7507c2327
fix typo + CI slow testing (#10835)
* fix typo

* one more typo
2022-06-02 00:10:16 +02:00
Paul O'Leary McCann
dca2e8c644
Minor NEL type fixes (#10860)
* Fix TODO about typing

Fix was simple: just request an array2f.

* Add type ignore

Maxout has a more restrictive type than the residual layer expects (only
Floats2d vs any Floats).

* Various cleanup

This moves a lot of lines around but doesn't change any functionality.
Details:

1. use `continue` to reduce indentation
2. move sentence doc building inside conditional since it's otherwise
   unused
3. reduces some temporary assignments
2022-06-01 00:41:28 +02:00
Daniël de Kok
85dd2b6c04
Parser: use C saxpy/sgemm provided by the Ops implementation (#10773)
* Parser: use C saxpy/sgemm provided by the Ops implementation

This is a backport of https://github.com/explosion/spaCy/pull/10747
from the parser refactor branch. It eliminates the explicit calls
to BLIS, instead using the saxpy/sgemm provided by the Ops
implementation.

This allows us to use Accelerate in the parser on M1 Macs (with
an updated thinc-apple-ops).

Performance of the de_core_news_lg pipe:

BLIS 0.7.0, no thinc-apple-ops:  6385 WPS
BLIS 0.7.0, thinc-apple-ops:    36455 WPS
BLIS 0.9.0, no thinc-apple-ops: 19188 WPS
BLIS 0.9.0, thinc-apple-ops:    36682 WPS
This PR, thinc-apple-ops:       38726 WPS

Performance of the de_core_news_lg pipe (only tok2vec -> parser):

BLIS 0.7.0, no thinc-apple-ops: 13907 WPS
BLIS 0.7.0, thinc-apple-ops:    73172 WPS
BLIS 0.9.0, no thinc-apple-ops: 41576 WPS
BLIS 0.9.0, thinc-apple-ops:    72569 WPS
This PR, thinc-apple-ops:       87061 WPS

* Require thinc >=8.1.0,<8.2.0

* Lower thinc lowerbound to 8.1.0.dev0

* Use best CPU ops for CBLAS when the parser model is on the GPU

* Fix another unguarded cblas() call

* Fix: use ops as a shorthand for self.model.ops

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
2022-05-27 11:20:52 +02:00
svlandeg
aa2eb2789c small type fixes 2022-05-25 13:50:54 +02:00
svlandeg
cea40c9d7b fix types + black formatting 2022-05-25 13:34:09 +02:00
svlandeg
015050f42c Merge branch 'master' into feature/coref 2022-05-25 13:01:56 +02:00
Paul O'Leary McCann
6087da9675 Suggestions from code review, cleanup, typing 2022-05-25 19:11:48 +09:00
Richard Hudson
32954c3bcb
Fix issues for Mypy 0.950 and Pydantic 1.9.0 (#10786)
* Make changes to typing

* Correction

* Format with black

* Corrections based on review

* Bumped Thinc dependency version

* Bumped blis requirement

* Correction for older Python versions

* Update spacy/ml/models/textcat.py

Co-authored-by: Daniël de Kok <me@github.danieldk.eu>

* Corrections based on review feedback

* Readd deleted docstring line

Co-authored-by: Daniël de Kok <me@github.danieldk.eu>
2022-05-25 09:33:54 +02:00
Paul O'Leary McCann
6be09bbd07
Fix Entity Linker with tokenization mismatches (fix #9575) (#10457)
* Add failing test

* Partial fix for issue

This kind of works. The issue with token length mismatches is gone. The
problem is that when you get empty lists of encodings to compare, it
fails because the sizes are not the same, even though they're both zero:
(0, 3) vs (0,). Not sure why that happens...

* Short circuit on empties

* Remove spurious check

The check here isn't needed now the the short circuit is fixed.

* Update spacy/tests/pipeline/test_entity_linker.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Use "eg", not "example"

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-05-23 20:42:26 +02:00
kadarakos
1dc3894447 new parameters 2022-05-17 15:36:32 +00:00
kadarakos
403fb95d56 merge 2022-05-17 06:56:34 +00:00
Paul O'Leary McCann
2e8f0e9168 Rename coref params 2022-05-16 16:50:10 +09:00
kadarakos
b7ac4b33e2 fixing arguments 2022-05-11 14:59:59 +00:00
kadarakos
7cf6bcca0e merge misery 2022-05-10 17:19:16 +00:00
Paul O'Leary McCann
33f4f90ff0 Formatting 2022-05-10 19:09:52 +09:00
Paul O'Leary McCann
f852c5cea4 Split span predictor component into its own file
This runs. The imports in both of the split files could probably use a
close check to remove extras.
2022-05-10 18:53:45 +09:00
Raphael Mitsch
f5390e278a
Refactor error messages to remove hardcoded strings (#10729)
* Use custom error msg instead of hardcoded string: replaced remaining hardcoded error message strings.

* Use custom error msg instead of hardcoded string: fixing faulty Errors import.
2022-05-02 13:38:46 +02:00
Paul O'Leary McCann
683f470852 Merge branch 'master' into feature/coref 2022-04-18 18:39:08 +09:00
Paul O'Leary McCann
afd255c0ed Undo multiply by 100
This was mistaken, not sure why my score seemed to be off before.
2022-04-14 18:42:09 +09:00
Paul O'Leary McCann
08729e0fbd Remove end adjustment
The difference in environments was due to a change in Thinc, the code
here is fine.
2022-04-14 18:31:30 +09:00
Paul O'Leary McCann
8181d4570c Multiply accuracy by 100
This seems to match with the scorer expectations better
2022-04-14 15:56:38 +09:00
Paul O'Leary McCann
e8af02700f Remove all coref scoring exept LEA
This is necessary because one of the three old methods relied on scipy
for some complex problem solving. LEA is generally better for
evaluations.

The downside is that this means evaluations aren't comparable with many
papers, but canonical scoring can be supported using external eval
scripts or other methods.
2022-04-13 21:02:18 +09:00
Paul O'Leary McCann
2300f4df3d Fix span score logging 2022-04-13 20:37:06 +09:00
Paul O'Leary McCann
d470fa03c1 Adjust end indices
It's not clear if this is technically correct or not but it won't run
without it for me.
2022-04-13 20:19:21 +09:00
kadarakos
b53113e3b8
Preparing span predictor for predicting from gold (#10547)
Note this is squashed because rebasing had conflicts.

* remove unnecessary .device

* span predictor debug start

* gearing up SpanPredictor for gold-heads

* merge SpanPredictor attributes

* remove useless extra prefix and device from spanpredictor

* make sure predicted and reference keeps aligned

* handle empty head_ids

* handle empty clusters

* addressing suggestions by @polm

* nicer restore

* fix score overwriting bug

* prepare for aligned heads-spans training

* span accuracy score

* update with eg.predited as other components

* add backprop callback to spanpredictor

* report start- and end-accuracies separately

* fixing scorer

Co-authored-by: Kádár Ákos <akos@onyx.uvt.nl>
2022-04-13 19:42:49 +09:00