Commit Graph

580 Commits

Author SHA1 Message Date
svlandeg
015050f42c Merge branch 'master' into feature/coref 2022-05-25 13:01:56 +02:00
Paul O'Leary McCann
6087da9675 Suggestions from code review, cleanup, typing 2022-05-25 19:11:48 +09:00
Richard Hudson
32954c3bcb
Fix issues for Mypy 0.950 and Pydantic 1.9.0 (#10786)
* Make changes to typing

* Correction

* Format with black

* Corrections based on review

* Bumped Thinc dependency version

* Bumped blis requirement

* Correction for older Python versions

* Update spacy/ml/models/textcat.py

Co-authored-by: Daniël de Kok <me@github.danieldk.eu>

* Corrections based on review feedback

* Readd deleted docstring line

Co-authored-by: Daniël de Kok <me@github.danieldk.eu>
2022-05-25 09:33:54 +02:00
Paul O'Leary McCann
6be09bbd07
Fix Entity Linker with tokenization mismatches (fix #9575) (#10457)
* Add failing test

* Partial fix for issue

This kind of works. The issue with token length mismatches is gone. The
problem is that when you get empty lists of encodings to compare, it
fails because the sizes are not the same, even though they're both zero:
(0, 3) vs (0,). Not sure why that happens...

* Short circuit on empties

* Remove spurious check

The check here isn't needed now the the short circuit is fixed.

* Update spacy/tests/pipeline/test_entity_linker.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Use "eg", not "example"

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-05-23 20:42:26 +02:00
kadarakos
1dc3894447 new parameters 2022-05-17 15:36:32 +00:00
kadarakos
403fb95d56 merge 2022-05-17 06:56:34 +00:00
Paul O'Leary McCann
2e8f0e9168 Rename coref params 2022-05-16 16:50:10 +09:00
kadarakos
b7ac4b33e2 fixing arguments 2022-05-11 14:59:59 +00:00
kadarakos
7cf6bcca0e merge misery 2022-05-10 17:19:16 +00:00
Paul O'Leary McCann
33f4f90ff0 Formatting 2022-05-10 19:09:52 +09:00
Paul O'Leary McCann
f852c5cea4 Split span predictor component into its own file
This runs. The imports in both of the split files could probably use a
close check to remove extras.
2022-05-10 18:53:45 +09:00
Raphael Mitsch
f5390e278a
Refactor error messages to remove hardcoded strings (#10729)
* Use custom error msg instead of hardcoded string: replaced remaining hardcoded error message strings.

* Use custom error msg instead of hardcoded string: fixing faulty Errors import.
2022-05-02 13:38:46 +02:00
Paul O'Leary McCann
683f470852 Merge branch 'master' into feature/coref 2022-04-18 18:39:08 +09:00
Paul O'Leary McCann
afd255c0ed Undo multiply by 100
This was mistaken, not sure why my score seemed to be off before.
2022-04-14 18:42:09 +09:00
Paul O'Leary McCann
08729e0fbd Remove end adjustment
The difference in environments was due to a change in Thinc, the code
here is fine.
2022-04-14 18:31:30 +09:00
Paul O'Leary McCann
8181d4570c Multiply accuracy by 100
This seems to match with the scorer expectations better
2022-04-14 15:56:38 +09:00
Paul O'Leary McCann
e8af02700f Remove all coref scoring exept LEA
This is necessary because one of the three old methods relied on scipy
for some complex problem solving. LEA is generally better for
evaluations.

The downside is that this means evaluations aren't comparable with many
papers, but canonical scoring can be supported using external eval
scripts or other methods.
2022-04-13 21:02:18 +09:00
Paul O'Leary McCann
2300f4df3d Fix span score logging 2022-04-13 20:37:06 +09:00
Paul O'Leary McCann
d470fa03c1 Adjust end indices
It's not clear if this is technically correct or not but it won't run
without it for me.
2022-04-13 20:19:21 +09:00
kadarakos
b53113e3b8
Preparing span predictor for predicting from gold (#10547)
Note this is squashed because rebasing had conflicts.

* remove unnecessary .device

* span predictor debug start

* gearing up SpanPredictor for gold-heads

* merge SpanPredictor attributes

* remove useless extra prefix and device from spanpredictor

* make sure predicted and reference keeps aligned

* handle empty head_ids

* handle empty clusters

* addressing suggestions by @polm

* nicer restore

* fix score overwriting bug

* prepare for aligned heads-spans training

* span accuracy score

* update with eg.predited as other components

* add backprop callback to spanpredictor

* report start- and end-accuracies separately

* fixing scorer

Co-authored-by: Kádár Ákos <akos@onyx.uvt.nl>
2022-04-13 19:42:49 +09:00
Kádár Ákos
6aedd98d02 fixing scorer 2022-04-11 16:10:14 +02:00
Kádár Ákos
7a239f2ec7 report start- and end-accuracies separately 2022-04-08 14:57:19 +02:00
Kádár Ákos
3ba913109d update with eg.predited as other components 2022-04-07 13:20:12 +02:00
Kádár Ákos
ef141ad399 span accuracy score 2022-04-04 18:10:09 +02:00
Kádár Ákos
a1d0219903 prepare for aligned heads-spans training 2022-04-04 15:26:15 +02:00
Daniël de Kok
c90dd6f265
Alignment: use a simplified ragged type for performance (#10319)
* Alignment: use a simplified ragged type for performance

This introduces the AlignmentArray type, which is a simplified version
of Ragged that performs better on the simple(r) indexing performed for
alignment.

* AlignmentArray: raise an error when using unsupported index

* AlignmentArray: move error messages to Errors

* AlignmentArray: remove simlified ... with simplifications

* AlignmentArray: fix typo that broke a[n:n] indexing
2022-04-01 09:02:06 +02:00
Kádár Ákos
63a41ba50a fix score overwriting bug 2022-03-30 17:28:20 +02:00
Kádár Ákos
7ff99a3acc nicer restore 2022-03-28 18:16:41 +02:00
Kádár Ákos
06d680b269 addressing suggestions by @polm 2022-03-28 14:31:51 +02:00
Kádár Ákos
e4b4b67ef6 handle empty clusters 2022-03-28 11:29:00 +02:00
Adriane Boyd
85778dfcf4
Add edit tree lemmatizer (#10231)
* Add edit tree lemmatizer

Co-authored-by: Daniël de Kok <me@danieldk.eu>

* Hide edit tree lemmatizer labels

* Use relative imports

* Switch to single quotes in error message

* Type annotation fixes

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Reformat edit_tree_lemmatizer with black

* EditTreeLemmatizer.predict: take Iterable

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Validate edit trees during deserialization

This change also changes the serialized representation. Rather than
mirroring the deep C structure, we use a simple flat union of the match
and substitution node types.

* Move edit_trees to _edit_tree_internals

* Fix invalid edit tree format error message

* edit_tree_lemmatizer: remove outdated TODO comment

* Rename factory name to trainable_lemmatizer

* Ignore type instead of casting truths to List[Union[Ints1d, Floats2d, List[int], List[str]]] for thinc v8.0.14

* Switch to Tagger.v2

* Add documentation for EditTreeLemmatizer

* docs: Fix 3.2 -> 3.3 somewhere

* trainable_lemmatizer documentation fixes

* docs: EditTreeLemmatizer is in edit_tree_lemmatizer.py

Co-authored-by: Daniël de Kok <me@danieldk.eu>
Co-authored-by: Daniël de Kok <me@github.danieldk.eu>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-03-28 11:13:50 +02:00
Kádár Ákos
7304604edd make sure predicted and reference keeps aligned 2022-03-25 18:29:33 +01:00
Kádár Ákos
83ac0477c8 remove useless extra prefix and device from spanpredictor 2022-03-24 16:44:50 +01:00
Kádár Ákos
706b2e6f25 gearing up SpanPredictor for gold-heads 2022-03-24 16:06:20 +01:00
Kádár Ákos
1eaf8fb0cf span predictor debug start 2022-03-23 11:24:27 +01:00
Paul O'Leary McCann
2190cbc0e6 Add progress on SpanPredictor component
This isn't working. There is a CUDA error in the torch code during
initialization and it's not clear why.
2022-03-19 19:39:49 +09:00
Paul O'Leary McCann
a098849112 Add fake batching
The way fake batching works is that the pipeline component calls the
model repeatedly in a loop internally. It feels like this should break
something, but it worked in testing.

Another issue is that this changes the signature of some of the pipeline
functions, though I don't think that's an issue.

Tested with batch size of 2, so more testing is needed, but this is a
start.
2022-03-18 19:46:58 +09:00
Paul O'Leary McCann
1a79d18796 Formatting 2022-03-16 20:10:47 +09:00
Paul O'Leary McCann
6855df0e66 Skeleton for span predictor component
This should be moved into its own file, but for now just stubbing out
the methods.
2022-03-16 20:09:33 +09:00
Paul O'Leary McCann
7811a1194b Change architecture 2022-03-16 14:57:15 +09:00
Daniël de Kok
e5debc68e4
Tagger: use unnormalized probabilities for inference (#10197)
* Tagger: use unnormalized probabilities for inference

Using unnormalized softmax avoids use of the relatively expensive exp function,
which can significantly speed up non-transformer models (e.g. I got a speedup
of 27% on a German tagging + parsing pipeline).

* Add spacy.Tagger.v2 with configurable normalization

Normalization of probabilities is disabled by default to improve
performance.

* Update documentation, models, and tests to spacy.Tagger.v2

* Move Tagger.v1 to spacy-legacy

* docs/architectures: run prettier

* Unnormalized softmax is now a Softmax_v2 option

* Require thinc 8.0.14 and spacy-legacy 3.0.9
2022-03-15 14:15:31 +01:00
Paul O'Leary McCann
55039a66ad Remove old default config 2022-03-15 19:53:09 +09:00
Paul O'Leary McCann
17d017a177 Remove span2head
This doesn't work as a component because it needs to modify gold data,
so instead it's a conversion script (in another repo).
2022-03-15 19:52:20 +09:00
Paul O'Leary McCann
0522a43116 Make span2head component 2022-03-15 19:19:15 +09:00
Edward
2eef47dd26
Save span candidates produced by spancat suggesters (#10413)
* Add save_candidates attribute

* Change spancat api

* Add unit test

* reimplement method to produce a list of doc

* Add method to docs

* Add new version tag

* Add intended use to docstring

* prettier formatting
2022-03-14 16:46:58 +01:00
Paul O'Leary McCann
dfec6993d6 Training works now 2022-03-14 19:27:23 +09:00
Paul O'Leary McCann
8eadf3781b Training runs now
Evaluation needs fixing, and code still needs cleanup.
2022-03-14 19:02:17 +09:00
Paul O'Leary McCann
d22a002641 Forward/backward pass works
Evaluate does not work - predict hasn't been updated
2022-03-14 17:26:27 +09:00
Paul O'Leary McCann
91acc3ea75
Fix entity linker batching (#9669)
* Partial fix of entity linker batching

* Add import

* Better name

* Add `use_gold_ents` option, docs

* Change to v2, create stub v1, update docs etc.

* Fix error type

Honestly no idea what the right type to use here is.
ConfigValidationError seems wrong. Maybe a NotImplementedError?

* Make mypy happy

* Add hacky fix for init issue

* Add legacy pipeline entity linker

* Fix references to class name

* Add __init__.py for legacy

* Attempted fix for loss issue

* Remove placeholder V1

* formatting

* slightly more interesting train data

* Handle batches with no usable examples

This adds a test for batches that have docs but not entities, and a
check in the component that detects such cases and skips the update step
as thought the batch were empty.

* Remove todo about data verification

Check for empty data was moved further up so this should be OK now - the
case in question shouldn't be possible.

* Fix gradient calculation

The model doesn't know which entities are not in the kb, so it generates
embeddings for the context of all of them.

However, the loss does know which entities aren't in the kb, and it
ignores them, as there's no sensible gradient.

This has the issue that the gradient will not be calculated for some of
the input embeddings, which causes a dimension mismatch in backprop.
That should have caused a clear error, but with numpyops it was causing
nans to happen, which is another problem that should be addressed
separately.

This commit changes the loss to give a zero gradient for entities not in
the kb.

* add failing test for v1 EL legacy architecture

* Add nasty but simple working check for legacy arch

* Clarify why init hack works the way it does

* Clarify use_gold_ents use case

* Fix use gold ents related handling

* Add tests for no gold ents and fix other tests

* Use aligned ents function (not working)

This doesn't actually work because the "aligned" ents are gold-only. But
if I have a different function that returns the intersection, *then*
this will work as desired.

* Use proper matching ent check

This changes the process when gold ents are not used so that the
intersection of ents in the pred and gold is used.

* Move get_matching_ents to Example

* Use model attribute to check for legacy arch

* Rename flag

* bump spacy-legacy to lower 3.0.9

Co-authored-by: svlandeg <svlandeg@github.com>
2022-03-04 09:17:36 +01:00
kadarakos
249b97184d
Bugfixes and test for rehearse (#10347)
* fixing argument order for rehearse

* rehearse test for ner and tagger

* rehearse bugfix

* added test for parser

* test for multilabel textcat

* rehearse fix

* remove debug line

* Update spacy/tests/training/test_rehearse.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update spacy/tests/training/test_rehearse.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

Co-authored-by: Kádár Ákos <akos@onyx.uvt.nl>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-02-23 16:10:05 +01:00