Commit Graph

14517 Commits

Author SHA1 Message Date
Paul O'Leary McCann
e1b4a85bb9 Fix loss
The loss was being returned as a single element array, which caused
training to die when it attempted to turn it into JSON.
2021-05-21 15:46:50 +09:00
Paul O'Leary McCann
ff3fed06cf Catch a stray reference 2021-05-20 21:30:46 +09:00
Paul O'Leary McCann
8c5df622d8 Help out python gc in coref backprop 2021-05-20 16:40:55 +09:00
Paul O'Leary McCann
fa92daf052 Break pairwise operations into pseudolayers
This makes their scope tighter and more contained, and has the nice side
effect that fewer things need to be passed around for backprop.
2021-05-20 15:59:51 +09:00
Paul O'Leary McCann
d22acee4f7 Fix backprop
Training seems to actually run now!
2021-05-18 20:09:27 +09:00
Paul O'Leary McCann
2486b8ad4d Fix pipeline intialize 2021-05-18 19:56:27 +09:00
Paul O'Leary McCann
0620820857 Deal with generators in tuplify 2021-05-18 19:55:52 +09:00
Paul O'Leary McCann
a7d9c8156d Make get_sentence_map work with init
When sentences are not available, just treat the whole doc as one
sentence. A reasonable general fallback, but important due to the init
call, where upstream components aren't run.
2021-05-18 19:54:54 +09:00
Paul O'Leary McCann
883c137b26 Add basic tuplify init 2021-05-18 19:53:59 +09:00
Paul O'Leary McCann
051715506e Fiddle with get_mentions definition
Ended up not making a difference, but oh well.
2021-05-18 19:53:33 +09:00
Paul O'Leary McCann
a33d29441a Merge remote-tracking branch 'upstream/develop' into feature/coref 2021-05-18 17:00:17 +09:00
Adriane Boyd
1d59fdbd39
Update Vietnamese tokenizer (#8099)
* Adapt tokenization methods from `pyvi` to preserve text encoding and
whitespace
* Add serialization support similar to Chinese and Japanese

Note: as for Chinese and Japanese, some settings are duplicated in
`config.cfg` and `tokenizer/cfg`.
2021-05-17 18:16:20 +10:00
Paul O'Leary McCann
e303628205 Attempt to use registry correctly 2021-05-17 14:52:48 +09:00
Paul O'Leary McCann
91b111467b Minor fixes 2021-05-17 14:52:30 +09:00
Paul O'Leary McCann
7c42a8c90a Migrate coref code
This includes the coref code that was being tested separately, modified
to work in spaCy. It hasn't been tested yet and presumably still needs
fixes.

In particular, the evaluation code is currently omitted. It's unclear at
the moment whether we want to use a complex scorer similar to the
official one, or a simpler scorer using more modern evaluation methods.
2021-05-15 21:36:10 +09:00
Paul O'Leary McCann
3608b7b3f9 Merge branch 'master' into feature/coref 2021-05-15 20:05:17 +09:00
Paul O'Leary McCann
2dc6db53fd
Merge pull request #8072 from medianeuroscience/master
Added eMFDscore to universe.json
2021-05-14 11:58:30 +09:00
Frederic R. Hopp
c5962b9fba
Update universe.json
fixed typo
2021-05-13 07:40:05 -07:00
Frederic R. Hopp
a9ca221e03
Update universe.json
Added more detailed description to eMFDscore project
2021-05-12 09:20:17 -07:00
Frederic R. Hopp
7bba9cdc14
Update universe.json 2021-05-11 19:18:19 -07:00
Adriane Boyd
d5bbd1f94f
Handle partial entities in Span.as_doc (#8055)
* Handle partial entities in Span.as_doc

In `Span.as_doc` replace partial entities at the beginning or end of the
span with missing entity annotation.

Fixes a bug where invalid entity annotation (no initial `B`) was
returned for an initial partial entity.

* Check for empty span in ents conversion

Note: `Span.as_doc()` will still fail on an empty span due to failures
in `Span.vector`.
2021-05-11 17:10:16 +02:00
Ines Montani
3883d49446 Fix default transformer in quickstart generator (resolves #8018) [ci skip] 2021-05-11 11:27:08 +10:00
Paul O'Leary McCann
bdeaf3a18b
Fix/fix en ordinals (#8028)
* Fix #8019

"th" is not the only ordinal ending.

* Add some more ordinal tests
2021-05-07 10:26:42 +02:00
Adriane Boyd
71c2a3ab47
Fix new version for match_alignments (#8021) 2021-05-07 09:55:20 +02:00
Jeno Pizarro
5cf76ab608
Update negspacy example code for spaCy 3.0 (#8022) 2021-05-07 09:33:21 +02:00
Adriane Boyd
6788d90f61
Preserve existing ENT_KB_ID annotation in NER (#7988)
* Preserve existing ENT_KB_ID annotation in NER

Preserve `ent_kb_id` annotation on existing entity spans, which is not
preserved by the transition system.

* Simplify kb_id assignment

* Simplify further
2021-05-06 18:49:55 +10:00
Sofie Van Landeghem
02a6a5fea0
Fix 'debug model' for transformers + generalize (#7973)
* add overrides to docs

* fix debug model with transformer

* assume training data is set in config
2021-05-06 18:43:32 +10:00
Adriane Boyd
cc5aeaed29
Add Chinese PTB tags to glossary (#7993) 2021-05-06 18:43:03 +10:00
Adriane Boyd
0a22fed634
Fix span offsets for Matcher(as_spans) on spans (#7992)
Fix returned span offsets for `Matcher(as_spans=True)(span)`.
2021-05-06 18:42:44 +10:00
Adriane Boyd
7d5db41ac3
Skip vector ngram backoff if minn is not set (#7925) 2021-05-06 18:34:35 +10:00
Sofie Van Landeghem
e9037d8fc0
make EntityLinker robust for nO=None (#7930) 2021-05-06 18:14:47 +10:00
Paul O'Leary McCann
66bfabd839
Fix pretraining objectives fragment (#8005)
* Fix pretraining objectives fragment

The fragment here is reused from a heading higher up, so you couldn't
link to this section.

* Fix section link to new fragment
2021-05-06 08:27:36 +02:00
Adriane Boyd
a71194362f
Fix Docs.from_docs for all empty docs (#8009) 2021-05-05 18:44:14 +02:00
meghanabhange
debaab7021
Update details in universe denomme | Multilingual Name Detection (#7982)
* Add denomme

* spaCy contributor agreement

* Update install and thumb

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2021-05-05 17:12:13 +02:00
Adriane Boyd
31528f62ed
Add / to nb infixes (#7991) 2021-05-04 11:00:10 +02:00
Santiago Castro
e99ff6f255
Fix typo in Language docstrings (#7958) 2021-05-03 14:44:09 +02:00
Ines Montani
12d3d0fedd Fix quickstart default checked of conditional fields [ci skip] 2021-05-03 11:48:12 +10:00
Adriane Boyd
2320791f6d
Fix Transformer.initialize example (#7963) 2021-04-30 12:21:31 +02:00
Adriane Boyd
cf032ec31e
Update to catalogue>=2.0.4 (#7951) 2021-04-29 19:11:28 +02:00
Adriane Boyd
7cf5bd072f
Refactor util.to_ternary_int (#7944)
* Refactor to avoid literal comparison with `is`
* Extend tests
2021-04-29 16:58:54 +02:00
Sevdimali
49aed683cc
Azerbaijani language added (#7911) 2021-04-28 14:42:02 +02:00
Adriane Boyd
f4080983ea
Extend to cupy 9.0.0 (#7914) 2021-04-28 10:18:24 +02:00
Paul O'Leary McCann
8007d5c814
Check if the resume path points to a directory (#7919)
This came up in #7878, but if --resume-path is a directory then loading
the weights will fail. On Linux this will give a straightforward error
message, but on Windows it gives "Permission Denied", which is
confusing.
2021-04-28 09:17:15 +02:00
Paul O'Leary McCann
de6b5ed14d
Fix percent unk display in debug data (#7886)
* Fix percent unk display

This was showing (ratio %), so 10% would show as 0.10%. Fix by
multiplying ration by 100.

Might want to add a warning if this is over a threshold.

* Only show whole-integer percents
2021-04-27 09:16:35 +02:00
Janis Klaise
1690595e4d
Update load_lookups return type and docstring (#7907)
* Update load_lookups return type and docstring

* Add contributor agreement
2021-04-27 09:13:39 +02:00
Adriane Boyd
946a4284be Set spacy-legacy to >=3.0.5 (#7897)
Set `spacy-legacy` to `>=3.0.5` due to `spacy.StaticVectors.v1` init bug.
2021-04-26 18:25:39 +02:00
Adriane Boyd
874cd02539
Set spacy-legacy to >=3.0.5 (#7897)
Set `spacy-legacy` to `>=3.0.5` due to `spacy.StaticVectors.v1` init bug.
2021-04-26 17:06:32 +02:00
Adriane Boyd
ae855a4625
Clean up Morphology imports and definitions (#7441)
* Clean up Morphology imports and definitions

* Whitespace formatting
2021-04-26 16:54:23 +02:00
Adriane Boyd
ceee1ecf17
Replace cpdef variables with cdef (#7834) 2021-04-26 16:54:02 +02:00
Adriane Boyd
95c0833656
Add training option to set annotations on update (#7767)
* Add training option to set annotations on update

Add a `[training]` option called `set_annotations_on_update` to specify
a list of components for which the predicted annotations should be set
on `example.predicted` immediately after that component has been
updated. The predicted annotations can be accessed by later components
in the pipeline during the processing of the batch in the same `update`
call.

* Rename to annotates / annotating_components

* Add test for `annotating_components` when training from config

* Add documentation
2021-04-26 16:53:53 +02:00