Commit Graph

118 Commits

Author SHA1 Message Date
Paul O'Leary McCann
d6fd5fe1c0 Minor cleanup 2021-05-24 14:56:43 +09:00
Paul O'Leary McCann
ff3fed06cf Catch a stray reference 2021-05-20 21:30:46 +09:00
Paul O'Leary McCann
8c5df622d8 Help out python gc in coref backprop 2021-05-20 16:40:55 +09:00
Paul O'Leary McCann
fa92daf052 Break pairwise operations into pseudolayers
This makes their scope tighter and more contained, and has the nice side
effect that fewer things need to be passed around for backprop.
2021-05-20 15:59:51 +09:00
Paul O'Leary McCann
0620820857 Deal with generators in tuplify 2021-05-18 19:55:52 +09:00
Paul O'Leary McCann
a7d9c8156d Make get_sentence_map work with init
When sentences are not available, just treat the whole doc as one
sentence. A reasonable general fallback, but important due to the init
call, where upstream components aren't run.
2021-05-18 19:54:54 +09:00
Paul O'Leary McCann
883c137b26 Add basic tuplify init 2021-05-18 19:53:59 +09:00
Paul O'Leary McCann
051715506e Fiddle with get_mentions definition
Ended up not making a difference, but oh well.
2021-05-18 19:53:33 +09:00
Paul O'Leary McCann
e303628205 Attempt to use registry correctly 2021-05-17 14:52:48 +09:00
Paul O'Leary McCann
91b111467b Minor fixes 2021-05-17 14:52:30 +09:00
Paul O'Leary McCann
7c42a8c90a Migrate coref code
This includes the coref code that was being tested separately, modified
to work in spaCy. It hasn't been tested yet and presumably still needs
fixes.

In particular, the evaluation code is currently omitted. It's unclear at
the moment whether we want to use a complex scorer similar to the
official one, or a simpler scorer using more modern evaluation methods.
2021-05-15 21:36:10 +09:00
Paul O'Leary McCann
3608b7b3f9 Merge branch 'master' into feature/coref 2021-05-15 20:05:17 +09:00
Sofie Van Landeghem
e9037d8fc0
make EntityLinker robust for nO=None (#7930) 2021-05-06 18:14:47 +10:00
Adriane Boyd
d2bdaa7823
Replace negative rows with 0 in StaticVectors (#7674)
* Replace negative rows with 0 in StaticVectors

Replace negative row indices with 0-vectors in `StaticVectors`.

* Increase versions related to StaticVectors

* Increase versions of all architctures and layers related to
`StaticVectors`
* Improve efficiency of 0-vector operations

Parallel `spacy-legacy` PR: https://github.com/explosion/spacy-legacy/pull/5

* Update config defaults to new versions

* Update docs
2021-04-22 18:04:15 +10:00
Sofie Van Landeghem
cd70c3cb79
Fixing pretrain (#7342)
* initialize NLP with train corpus

* add more pretraining tests

* more tests

* function to fetch tok2vec layer for pretraining

* clarify parameter name

* test different objectives

* formatting

* fix check for static vectors when using vectors objective

* clarify docs

* logger statement

* fix init_tok2vec and proc.initialize order

* test training after pretraining

* add init_config tests for pretraining

* pop pretraining block to avoid config validation errors

* custom errors
2021-03-09 14:01:13 +11:00
Sofie Van Landeghem
e0c45c669a
Native coref component (#7243)
* initial coref_er pipe

* matcher more flexible

* base coref component without actual model

* initial setup of coref_er.score

* rename to include_label

* preliminary score_clusters method

* apply scoring in coref component

* IO fix

* return None loss for now

* rename to CoreferenceResolver

* some preliminary unit tests

* use registry as callable
2021-03-03 13:50:14 +01:00
svlandeg
d900c55061 consistently use registry as callable 2021-03-02 17:56:28 +01:00
René Octavio Queiroz Dias
59271e887a
fix: TransformerListener with TextCatEnsemble (#6951)
* bug: Regression test
Issue #6946

* fix: Fix issue #6946

* chore: Remove regression test
2021-02-06 13:44:51 +01:00
Matthew Honnibal
ffc371350a
Avoid assuming encode.get_dim('nO') is set in tok2vec (#6800) 2021-01-24 14:37:33 +11:00
Sofie Van Landeghem
c8761b0e6e
rewrite Maxout layer as separate layers to avoid shape inference trouble (#6760) 2021-01-19 07:37:17 +08:00
Adriane Boyd
26c34ab8b0
Fix parser resizing for cupy (#6758) 2021-01-18 20:43:15 +01:00
Matthew Honnibal
c2a18e4fa3 Update textcat ensemble model 2021-01-19 02:53:02 +11:00
Ines Montani
a203e3dbb8 Support spacy-legacy via the registry 2021-01-15 21:42:40 +11:00
Ines Montani
b0b743597c Tidy up and auto-format 2021-01-15 11:57:36 +11:00
Sofie Van Landeghem
75d9019343
Fix types of Tok2Vec encoding architectures (#6442)
* fix TorchBiLSTMEncoder documentation

* ensure the types of the encoding Tok2vec layers are correct

* update references from v1 to v2 for the new architectures
2021-01-07 16:39:27 +11:00
Sofie Van Landeghem
3983bc6b1e
Fix Transformer width in TextCatEnsemble (#6431)
* add convenience method to determine tok2vec width in a model

* fix transformer tok2vec dimensions in TextCatEnsemble architecture

* init function should not be nested to avoid pickle issues
2021-01-06 12:44:04 +01:00
Ines Montani
991669c934 Tidy up and auto-format 2021-01-05 13:41:53 +11:00
Sofie Van Landeghem
282a3b49ea
Fix parser resizing when there is no upper layer (#6460)
* allow resizing of the parser model even when upper=False

* update from spacy.TransitionBasedParser.v1 to v2

* bugfix
2020-12-18 18:56:57 +08:00
Sofie Van Landeghem
f98a04434a
pretrain architectures (#6451)
* define new architectures for the pretraining objective

* add loss function as attr of the omdel

* cleanup

* cleanup

* shorten name

* fix typo

* remove unused error
2020-12-08 14:41:03 +08:00
Sofie Van Landeghem
a0c899a0ff
Fix textcat + transformer architecture (#6371)
* add pooling to textcat TransformerListener

* maybe_get_dim in case it's null
2020-11-10 20:14:47 +08:00
Sofie Van Landeghem
75a202ce65
TextCat updates and fixes (#6263)
* small fix in example imports

* throw error when train_corpus or dev_corpus is not a string

* small fix in custom logger example

* limit macro_auc to labels with 2 annotations

* fix typo

* also create parents of output_dir if need be

* update documentation of textcat scores

* refactor TextCatEnsemble

* fix tests for new AUC definition

* bump to 3.0.0a42

* update docs

* rename to spacy.TextCatEnsemble.v2

* spacy.TextCatEnsemble.v1 in legacy

* cleanup

* small fix

* update to 3.0.0rc2

* fix import that got lost in merge

* cursed IDE

* fix two typos
2020-10-18 14:50:41 +02:00
svlandeg
40276fd3be update NEL docs after latest refactor 2020-10-12 11:41:27 +02:00
svlandeg
08cb085f6c Merge remote-tracking branch 'upstream/develop' into fix/various 2020-10-09 17:01:27 +02:00
svlandeg
040c7c0541 fix get_dim calls in build_simple_cnn_text_classifier 2020-10-09 15:40:58 +02:00
svlandeg
853edace37 fix MultiHashEmbed example in documentation 2020-10-09 14:11:06 +02:00
Adriane Boyd
39aabf50ab Also rename to include_static_vectors in CharEmbed 2020-10-09 11:54:48 +02:00
Ines Montani
1a554bdcb1 Update docs and docstring [ci skip] 2020-10-05 21:55:27 +02:00
Ines Montani
9614e53b02 Tidy up and auto-format 2020-10-05 21:55:18 +02:00
Matthew Honnibal
e50047f1c5 Check lengths match 2020-10-05 20:02:45 +02:00
Matthew Honnibal
cdd2b79b6d Remove deprecated MultiHashEmbed 2020-10-05 19:58:18 +02:00
Matthew Honnibal
6dcc4a0ba6 Simplify MultiHashEmbed signature 2020-10-05 19:57:45 +02:00
Matthew Honnibal
eb9ba61517 Format 2020-10-05 15:29:49 +02:00
Matthew Honnibal
8ec79ad3fa Allow configuration of MultiHashEmbed features
Update arguments to MultiHashEmbed layer so that the attributes can be
controlled. A kind of tricky scheme is used to allow optional
specification of the rows. I think it's an okay balance between
flexibility and convenience.
2020-10-05 15:22:00 +02:00
Ines Montani
bcd52e5486 Tidy up errors and warnings 2020-10-04 11:16:31 +02:00
Ines Montani
3bc3c05fcc Tidy up and auto-format 2020-10-03 17:20:18 +02:00
svlandeg
02247cccaf Merge remote-tracking branch 'upstream/develop' into feature/small-fixes 2020-10-02 20:48:11 +02:00
Matthew Honnibal
6965cdf16d Fix comment 2020-10-02 17:26:21 +02:00
Matthew Honnibal
75a1569908 Merge 2020-10-01 23:07:53 +02:00
Matthew Honnibal
300e5a9928
Avoid relying on NORM in default v3 models (#6176)
* Allow CharacterEmbed to specify feature

* Default to LOWER in character embed

* Update tok2vec

* Use LOWER, not NORM
2020-10-01 23:05:55 +02:00
Matthew Honnibal
b854bca15c Default to LOWER in character embed 2020-10-01 22:17:58 +02:00