Commit Graph

14560 Commits

Author SHA1 Message Date
Paul O'Leary McCann
5c98c4c3b9 Probably fix pw prod backprop
I think this change is correct, but intuition doesn't really help
here...
2021-06-17 21:23:00 +09:00
Paul O'Leary McCann
ccf561112a Remove old comments 2021-06-17 21:22:17 +09:00
Paul O'Leary McCann
a62121e3b4 Expose more hyperparameters 2021-06-17 21:21:46 +09:00
Paul O'Leary McCann
848fd102e7 Small fix 2021-06-17 21:19:38 +09:00
Paul O'Leary McCann
fce804a79f Minor optimization 2021-06-17 21:10:46 +09:00
Paul O'Leary McCann
cb2364cf83 Fix type of mask
The call here was creating a float64 array, which was turning many
downstream scores into float64s. Later on these values were assigned to
a float32 array in backprop, and numerical underflow caused things to go
to zero.

That's almost certainly not the only reason things go to zero, but it is
incorrect.
2021-06-17 17:56:00 +09:00
Paul O'Leary McCann
8452d117ef Fix typo, remove old comment 2021-06-13 19:42:55 +09:00
Paul O'Leary McCann
96be7e8858 Change topk to sort descending
Shouldn't change correctness but is a little clearer
2021-06-13 19:42:24 +09:00
Paul O'Leary McCann
d71198ed36 Replace squeeze with flatten
At a few points in the code it's normal to get a "2d" array where each
row is a single entry. Calling squeeze will make that a proper 1d
array... unless it's just one entry, in which case it turns into a 0d
scalar. That's not what we want; flatten() provides the desired
behavior.
2021-06-12 19:48:01 +09:00
Paul O'Leary McCann
e728b0e45d Silence warning 2021-06-12 19:31:35 +09:00
Paul O'Leary McCann
7efbc721a1 Don't use is_sentenced 2021-06-12 19:29:27 +09:00
Paul O'Leary McCann
67d9ebc922 Transpose before calculating loss 2021-06-04 17:56:08 +09:00
Paul O'Leary McCann
18444fccd9 Remove old comment 2021-06-04 17:56:08 +09:00
Paul O'Leary McCann
4a4ef72191 Clean up unused functions
`make_clean_doc` is not needed and was removed.

`logsumexp` may be needed if I misunderstood the loss calculation, so I
left it in for now with a note.
2021-06-02 21:42:23 +09:00
svlandeg
0aa1083ce8 avoid repetitive entities in the output 2021-05-28 16:52:51 +02:00
svlandeg
0d81bce9cc add failing test for too short a sentence 2021-05-28 15:10:35 +02:00
svlandeg
0f5c586e2f add basic tests for debugging 2021-05-28 14:19:55 +02:00
svlandeg
391b512afd fix types of fwd functions 2021-05-27 16:36:46 +02:00
svlandeg
04b55bf054 removing unused imports 2021-05-27 16:31:38 +02:00
svlandeg
910026582d set versions to v1 instead of v0 2021-05-27 16:17:20 +02:00
svlandeg
2e3c0e2256 delete outdated tests 2021-05-27 13:54:31 +02:00
svlandeg
ba2e491cc4 Merge remote-tracking branch 'upstream/master' into feature/coref 2021-05-27 13:50:32 +02:00
Sofie Van Landeghem
3c58c0323f
fix docs (#8200) 2021-05-27 10:48:59 +02:00
Sofie Van Landeghem
290bd6ed39
ensure tolerance is properly passed on (#8158) 2021-05-27 18:10:28 +10:00
Paul O'Leary McCann
0c553ecd4e Fix docs (fix #8189) 2021-05-24 19:47:30 +09:00
Paul O'Leary McCann
a484245f35 Remove references to coref_er 2021-05-24 19:08:45 +09:00
Paul O'Leary McCann
d6389b133d Don't use a generator for no reason 2021-05-24 19:06:15 +09:00
Paul O'Leary McCann
d6fd5fe1c0 Minor cleanup 2021-05-24 14:56:43 +09:00
Paul O'Leary McCann
0942a0b51b Remove coref_er.py
The intent of this was that it would be a component pipeline that used
entities as input, but that's now covered by the get_mentions function
as a pipeline arg.
2021-05-21 18:20:25 +09:00
Paul O'Leary McCann
f6652c9252 Add new coref scoring
This is closer to the traditional evaluation method. That uses an
average of three scores, this is just using the bcubed metric for now
(nothing special about bcubed, just picked one).

The scoring implementation comes from the coval project. It relies on
scipy, which is one issue, and is rather involved, which is another.

Besides being comparable with traditional evaluations, this scoring is
relatively fast.
2021-05-21 15:56:40 +09:00
Paul O'Leary McCann
e1b4a85bb9 Fix loss
The loss was being returned as a single element array, which caused
training to die when it attempted to turn it into JSON.
2021-05-21 15:46:50 +09:00
Paul O'Leary McCann
ff3fed06cf Catch a stray reference 2021-05-20 21:30:46 +09:00
Sofie Van Landeghem
202943bc8c
KB & NEL to/from bytes (#8113)
* unit test for pickling KB

* add pickling test for NEL

* KB to_bytes and from_bytes

* NEL to_bytes and from_bytes

* xfail pickle tests for now

* fix docs

* cleanup
2021-05-20 18:11:30 +10:00
Paul O'Leary McCann
8c5df622d8 Help out python gc in coref backprop 2021-05-20 16:40:55 +09:00
Paul O'Leary McCann
fa92daf052 Break pairwise operations into pseudolayers
This makes their scope tighter and more contained, and has the nice side
effect that fewer things need to be passed around for backprop.
2021-05-20 15:59:51 +09:00
Adriane Boyd
f6128c06b0
Disable GPU CI tests (#8143) 2021-05-19 12:00:07 +02:00
Paul O'Leary McCann
d22acee4f7 Fix backprop
Training seems to actually run now!
2021-05-18 20:09:27 +09:00
Paul O'Leary McCann
2486b8ad4d Fix pipeline intialize 2021-05-18 19:56:27 +09:00
Paul O'Leary McCann
0620820857 Deal with generators in tuplify 2021-05-18 19:55:52 +09:00
Paul O'Leary McCann
a7d9c8156d Make get_sentence_map work with init
When sentences are not available, just treat the whole doc as one
sentence. A reasonable general fallback, but important due to the init
call, where upstream components aren't run.
2021-05-18 19:54:54 +09:00
Paul O'Leary McCann
883c137b26 Add basic tuplify init 2021-05-18 19:53:59 +09:00
Paul O'Leary McCann
051715506e Fiddle with get_mentions definition
Ended up not making a difference, but oh well.
2021-05-18 19:53:33 +09:00
Adriane Boyd
06324e5a5e
Update pydantic requirements (#8127)
Update pydantic requirements following
https://github.com/explosion/thinc/pull/499
2021-05-18 11:35:50 +02:00
Paul O'Leary McCann
a33d29441a Merge remote-tracking branch 'upstream/develop' into feature/coref 2021-05-18 17:00:17 +09:00
Adriane Boyd
6baab565eb
Minor updates to quickstart settings/instructions (#7965)
* Minor updates to quickstart settings/instructions

* set default value of textcat exclusive to `false` until the default
checkbox behavior is updated
* add the `morphologizer` to the list of components
* add a note that v3.0.6+ is required

* Switch to warning above quickstart

* Undo changes to textcat default in quickstart

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2021-05-17 16:55:22 +02:00
Adriane Boyd
2c545c4c5b
Fix offsets in Span.get_lca_matrix (#8116)
* Fix range in Span.get_lca_matrix

Fix the adjusted token index / lca matrix index ranges for
`_get_lca_matrix` for spans.

* The range for `k` should correspond to the adjusted indices in
`lca_matrix` with the `start` indexed at `0`

* Update test for v3.x
2021-05-17 16:54:23 +02:00
Sofie Van Landeghem
0dffc5d9e2
Custom warning if the doc_bin is too large (#8069)
* custom warning if the doc_bin is too large

* cleanup

* Update spacy/errors.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* fix numbering

* fixing numbering once more

* fixing this seems to be pretty hard

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2021-05-17 15:48:40 +02:00
Adriane Boyd
b120fb3511
Handle errors while multiprocessing (#8004)
* Handle errors while multiprocessing

Handle errors while multiprocessing without hanging.

* Return the traceback for errors raised while processing a batch, which
  can be handled by the top-level error handler
* Allow for shortened batches due to custom error handlers that ignore
  errors and skip documents

* Define custom components at a higher level

* Also move up custom error handler

* Use simpler component for test

* Switch error type

* Adjust test

* Only call top-level error handler for exceptions

* Register custom test components within tests

Use global functions (so they can be pickled) but register the
components only within the individual tests.
2021-05-17 13:28:39 +02:00
Adriane Boyd
8a2602051c
Update debug data for textcat (#8066)
* Check for unsupported cats values
* Only show labels if train/dev mismatched
* Don't show label counts (only counting positive labels seems odd)
* Use warnings for mismatched train/dev labels
2021-05-17 13:27:04 +02:00
Adriane Boyd
1d59fdbd39
Update Vietnamese tokenizer (#8099)
* Adapt tokenization methods from `pyvi` to preserve text encoding and
whitespace
* Add serialization support similar to Chinese and Japanese

Note: as for Chinese and Japanese, some settings are duplicated in
`config.cfg` and `tokenizer/cfg`.
2021-05-17 18:16:20 +10:00