Commit Graph

8791 Commits

Author SHA1 Message Date
Paul O'Leary McCann
9b63cbb775 Add extract spans import 2021-07-15 18:16:53 +09:00
Paul O'Leary McCann
e9626e38c1 Fix serialization test
This test was failing not because the thing it was testing wasn't
working, but because of the way span equality works. Span equality
relies on doc equality, and doc equality is object identity, so spans
from different docs will never be equal.
2021-07-14 18:37:34 +09:00
Paul O'Leary McCann
4a9dc00d86 Use relative indices for mentions
Was using batch absolute indices to manage mentions, but extract_spans
expects doc-relative ones.
2021-07-14 18:36:18 +09:00
Paul O'Leary McCann
3684f7fdfd Remove comment from fixed test 2021-07-14 18:22:14 +09:00
Paul O'Leary McCann
f1796e4af7 Fix mention list bug
There was an off-by-one error in how mentions are generated that would
affect mentions at the end of a sentence. This was pretty nasty.
2021-07-14 18:19:00 +09:00
Paul O'Leary McCann
80a17071d3 Remove unused code 2021-07-11 18:46:39 +09:00
Paul O'Leary McCann
447c7070e3 Fix loss
Accidentally deleted it
2021-07-10 22:45:25 +09:00
Paul O'Leary McCann
c25ec292a9 Cleanup 2021-07-10 22:42:55 +09:00
Paul O'Leary McCann
e00bd422d9 Fix span embeds
Some of the lengths and backprop weren't right.

Also various cleanup.
2021-07-10 21:38:53 +09:00
Paul O'Leary McCann
d7d317a1b5 Clean up span embedding code
This is now cleaner and significantly faster. There's still some messy
parts in the code (particularly variable names), will get to that later.
2021-07-10 19:59:08 +09:00
Paul O'Leary McCann
dc1f974d39 Merge branch 'master' into feature/coref 2021-07-10 18:10:40 +09:00
Paul O'Leary McCann
f34915c1e8 Use scatter_add to speed up span embed backprop
This was the slowest part of the code, and using scatter_add here
probably reduces the runtime by 50%.
2021-07-10 18:08:51 +09:00
Adriane Boyd
d8805a1073
Fix ru/uk lemmatizer mp with spawn (#8657)
Use an instance variable instead a class variable for the morphological
analzyer so that multiprocessing with spawn is possible.
2021-07-09 15:36:56 +02:00
Adriane Boyd
b8e720fdb9
Fix Azerbaijani init, extend lang init tests (#8656)
* Extend langs in initialize tests

* Fix az init
2021-07-09 15:36:35 +02:00
explosion-bot
334f1f98d8 Auto-format code with black 2021-07-09 08:06:06 +00:00
Paul O'Leary McCann
d0b041aff4 Switch to using Thinc tuplify
The tuplify code here was added to Thinc proper and that's been
released, so no need to have it here any more.
2021-07-08 16:08:36 +09:00
Sofie Van Landeghem
64fac754fe
add spacy prefix to ngram_suggester.v1 (#8623) 2021-07-07 08:09:30 +02:00
Sofie Van Landeghem
733e8ceea9
fix spancat initialize with labels (#8620) 2021-07-06 19:08:25 +02:00
Sofie Van Landeghem
608fc1d623
avoid msg var impliciteness (#8619)
* avoid msg var impliciteness

* rename local msg

* Add CI tests for debug data and train

* Adjust debug data CLI test

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2021-07-06 19:08:08 +02:00
Sofie Van Landeghem
e7d747e3ee
TransitionBasedParser.v1 to legacy (#8586)
* TransitionBasedParser.v1 to legacy

* register sublayers

* bump spacy-legacy to 3.0.7
2021-07-06 15:26:45 +02:00
Luca Dorigo
e8ef4a46d5
Add the right return type for Language.pipe and an overload for the as_tuples case (#8441)
* Add the right return type for Language.pipe and an overload for the as_tuples version

* Reformat, tidy up

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2021-07-06 14:18:40 +02:00
Sofie Van Landeghem
b9f59118bf
Fix silent evaluation (#8581)
* fix silentness

* sneak in docs typo fix

* pass silent boolean instead
2021-07-06 14:16:19 +02:00
Sofie Van Landeghem
3daf57d70c
Small spancat fixes (#8614)
* two small fixes + additional tests

* rename
2021-07-06 14:15:41 +02:00
Ines Montani
327f83573a
Move scores per type handling into util function (#8590) 2021-07-06 13:02:37 +02:00
Adriane Boyd
5fd0b5207e
Fix vectors check for sourced components (#8559)
* Fix vectors check for sourced components

Since vectors are not loaded when components are sourced, store a hash
for the vectors of each sourced component and compare it to the loaded
vectors after the vectors are loaded from the `[initialize]` block.

* Pop temporary info

* Remove stored hash in remove_pipe

* Add default for pop

* Add additional convert/debug/assemble CLI tests
2021-07-06 12:43:17 +02:00
Adriane Boyd
29906884c5
Raise an error for textcat with <2 labels (#8584)
* Raise an error for textcat with <2 labels

Raise an error if initializing a `textcat` component without at least
two labels.

* Add similar note to docs

* Update positive_label description in API docs
2021-07-06 12:35:22 +02:00
Paul O'Leary McCann
eb5820b593 Improve take_vecs implementation
This pulls out references to needed bits so that other parts (the larger
embeddings) can be freed before backprop.
2021-07-05 21:08:42 +09:00
Paul O'Leary McCann
13bef2ddb6 Add width prior feature
Not necessary for convergence, but in coref-hoi this seems to add a few
f1 points.

Note that there are two width-related features in coref-hoi. This is a
"prior" that is added to mention scores. The other width related feature
is appended to the span embedding representation for other layers to
reference.
2021-07-05 21:06:28 +09:00
Paul O'Leary McCann
8f66176b2d Fix loss?
This rewrites the loss to not use the Thinc crossentropy code at all.
The main difference here is that the negative predictions are being
masked out (= marginalized over), but negative gradient is still being
reflected.

I'm still not sure this is exactly right but models seem to train
reliably now.
2021-07-05 18:17:10 +09:00
Paul O'Leary McCann
5db28ec2fd Tweak mention limit calculation
The calculation of this in the coref-hoi code is hard to follow. Based
on comments and variable names it sounds like it's using the doc length,
but it might actually be the number of mentions? Number of mentions
should be much larger and seems more correct, but might want to revisit
this.
2021-07-03 21:13:32 +09:00
Paul O'Leary McCann
2d3c559dc4 On initialize, use just two samples
Coref docs are kind of long, and using 10 samples on a smallish GPU can
cause OOMs.
2021-07-03 18:43:03 +09:00
Paul O'Leary McCann
251a5b43ac Minor fix in crossing spans code
I think this was technically incorrect but harmless. The reason the code
here is different than the reference in coref-hoi is that the indices
there are such that they get +1 at the end of processing, while the code
here handles indices directly.
2021-07-03 18:41:46 +09:00
Paul O'Leary McCann
865caedebd Remove XXX comment
Comment wondered if there should be some subtraction to avoid double
counting, but it probably doesn't matter because the diagonal is 0.
2021-07-03 18:40:38 +09:00
Paul O'Leary McCann
d74fa82c80 Fix axis handling in topk
In practice this is only ever used with axis=1, so it wasn't causing
issues, even though it was wrong.
2021-07-03 18:39:25 +09:00
Paul O'Leary McCann
f2e0e9dc28 Move placeholder handling into model code 2021-07-03 18:38:48 +09:00
Paul O'Leary McCann
3f66e18592 Clean up pw_prod loss
This doesn't change the math but makes the transposes slightly easier to
understand (maybe?).
2021-07-03 18:33:17 +09:00
explosion-bot
ee37288a1f Auto-format code with black 2021-07-02 07:48:26 +00:00
Ines Montani
af9d984407
Merge pull request #8405 from svlandeg/fix/whitespace_tokenizer [ci skip] 2021-06-30 20:52:59 +10:00
Adriane Boyd
2b8c679a3d
Fix duplicate spacy package CLI opts (#8551)
Use `-c` for `--code` and not additionally for `--create-meta`, in line
with the docs.
2021-06-30 11:23:26 +02:00
Ines Montani
7f65902702
Merge pull request #8522 from adrianeboyd/chore/update-flake8
Update flake8 version in reqs and CI
2021-06-28 21:46:06 +10:00
Adriane Boyd
86d01e9229 Tidy up with flake8: imports, comparisons, etc. 2021-06-28 12:08:15 +02:00
Adriane Boyd
5eeb25f043 Tidy up code 2021-06-28 12:08:15 +02:00
Adriane Boyd
4b0ed73ed4 Update flake8 version in reqs and CI
* Update some unneeded forward refs related to flake8 checks
2021-06-28 11:29:36 +02:00
Paul O'Leary McCann
b02df61eb9 Add test for crossing spans
This should maybe go elsewhere?
2021-06-28 18:21:00 +09:00
Paul O'Leary McCann
4f377d8de8 Fix bug in crossing span detection 2021-06-28 18:20:33 +09:00
Paul O'Leary McCann
23344857b9 Remove unused function 2021-06-28 18:19:43 +09:00
Paul O'Leary McCann
f144888793
Merge pull request #8504 from bryant1410/patch-1
Fix typo in comment
2021-06-27 13:51:19 +09:00
Santiago Castro
ee63b2b199
Fix typo in train_cli docstring 2021-06-25 22:45:03 -07:00
Santiago Castro
a2bc743e47
Fix typo in comment 2021-06-25 18:58:38 -07:00
Adrian Zuber
f5aee0bbdf
Raise custom error in EntityLinker when KB is not set (#8442)
* Raise custom error in EntityLinker when KB is not set

* add contributor agreement

* Update E1018 error message
2021-06-25 23:04:00 +02:00