Commit Graph

3892 Commits

Author SHA1 Message Date
Matthew Honnibal
563f46f026 Fix multi-label support for text classification
The TextCategorizer class is supposed to support multi-label
text classification, and allow training data to contain missing
values.

For this to work, the gradient of the loss should be 0 when labels
are missing. Instead, there was no way to actually denote "missing"
in the GoldParse class, and so the TextCategorizer class treated
the label set within gold.cats as complete.

To fix this, we change GoldParse.cats to be a dict instead of a list.
The GoldParse.cats dict should map to floats, with 1. denoting
'present' and 0. denoting 'absent'. Gradients are zeroed for categories
absent from the gold.cats dict. A nice bonus is that you can also set
values between 0 and 1 for partial membership. You can also set numeric
values, if you're using a text classification model that uses an
appropriate loss function.

Unfortunately this is a breaking change; although the functionality
was only recently introduced and hasn't been properly documented
yet. I've updated the example script accordingly.
2017-10-05 18:43:02 -05:00
Matthew Honnibal
40edb65ee7 Make test work for Python 2.7 2017-10-04 16:36:50 +02:00
Matthew Honnibal
bd8e84998a Add nO attribute to TextCategorizer model 2017-10-04 16:07:30 +02:00
Matthew Honnibal
f8a0614527 Improve textcat model slightly 2017-10-04 15:15:53 +02:00
Matthew Honnibal
39798b0172 Uncomment layernorm adjustment hack 2017-10-04 15:12:09 +02:00
Matthew Honnibal
b3a7082bf8 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-10-04 14:56:46 +02:00
Matthew Honnibal
db05d4d582 Add test for #1380. Passes without fix? 2017-10-04 14:56:31 +02:00
Matthew Honnibal
774f5732bd Fix dimensionality of textcat when no vectors available 2017-10-04 14:55:15 +02:00
Ines Montani
28ba0b9b51 Merge pull request #1385 from explosion/feature/new-website
💫 New spaCy website
2017-10-04 14:35:52 +02:00
Matthew Honnibal
af75b74208 Unset LayerNorm backwards compat hack 2017-10-03 20:47:10 -05:00
ines
73ac0aa0b5 Update spacy evaluate and add displaCy option 2017-10-04 00:03:15 +02:00
Matthew Honnibal
f24c2e3a8a Fix evaluate for non-GPU 2017-10-03 22:47:31 +02:00
Matthew Honnibal
5cbefcba17 Set backwards compatibility flag 2017-10-03 20:29:58 +02:00
Matthew Honnibal
5454b20cd7 Update thinc imports for 6.9 2017-10-03 20:07:17 +02:00
Matthew Honnibal
4a59f6358c Fix thinc imports 2017-10-03 19:21:26 +02:00
Matthew Honnibal
e514d6aa0a Import thinc modules more explicitly, to avoid cycles 2017-10-03 18:49:25 +02:00
Matthew Honnibal
338e1fda0e Unbreak merge artefact 2017-10-03 09:41:05 -05:00
Matthew Honnibal
1289187279 Fix circular import 2017-10-03 09:33:21 -05:00
Matthew Honnibal
a44c4c3a5b Add timer to evaluate 2017-10-03 09:15:35 -05:00
Matthew Honnibal
96da86b3e5 Add support for verbose flag to Language 2017-10-03 09:14:57 -05:00
Matthew Honnibal
02586a5243 Add timing to spacy evaluate command 2017-10-03 09:14:34 -05:00
ines
e49cd7aeaf Move import into load to avoid circular imports 2017-10-03 15:22:19 +02:00
ines
b0dfa059db Update docs link in about.py 2017-10-03 15:19:55 +02:00
Matthew Honnibal
8902df44de Fix component disabling during training 2017-10-02 21:07:23 +02:00
Matthew Honnibal
c617d288d8 Update pipeline component names in spaCy train 2017-10-02 17:20:19 +02:00
Matthew Honnibal
f942903429 Improve sentence merging in iob2json 2017-10-02 17:02:10 +02:00
Matthew Honnibal
31681d20e0 Fix concatenation in iob2json converter 2017-10-02 16:50:26 +02:00
Matthew Honnibal
4896ce3320 Remove misleading comment 2017-10-02 00:09:14 +02:00
Matthew Honnibal
d90cc917fa Merge vectors.pyx doc strings 2017-10-01 17:05:54 -05:00
Matthew Honnibal
b2a8b9be77 Fix inconsistency of Vectors class API 2017-10-01 17:00:34 -05:00
Matthew Honnibal
e38089d598 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-10-01 22:10:54 +02:00
Matthew Honnibal
97c409b602 Add docstrings for spacy.vectors 2017-10-01 22:10:33 +02:00
ines
b776f48e58 Fix typo 2017-10-01 21:58:45 +02:00
Matthew Honnibal
94df115a81 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-10-01 14:06:23 -05:00
Matthew Honnibal
2cf0f4622f Fix loading of models with pre-trained vectors 2017-10-01 14:05:32 -05:00
Matthew Honnibal
69c7c642c2 Add spacy evaluate 2017-10-01 14:05:04 -05:00
ines
8dbe49ecb8 Always compare lowercase package names
Otherwise, is_package will return False if model name contains
uppercase characters. See this issue:
https://support.prodi.gy/t/saving-a-trained-ner-model-as-a-loadable-modu
le/46/6
2017-09-29 20:55:17 +02:00
ines
153c2589d4 Revert "Always compare lowercase package names"
This reverts commit 7d77dc490f.
2017-09-29 20:53:36 +02:00
ines
fd1a9225d8 Handle conversion of pipeline components correctly
Allow both comma and comma + whitespace as separators
2017-09-29 20:52:56 +02:00
ines
7d77dc490f Always compare lowercase package names
Otherwise, is_package will return False if model name contains
uppercase characters. See this issue:
https://support.prodi.gy/t/saving-a-trained-ner-model-as-a-loadable-modu
le/46/6
2017-09-29 20:52:28 +02:00
Matthew Honnibal
cdb2d83e16 Pass dropout in parser 2017-09-28 18:47:13 -05:00
Matthew Honnibal
158e177cae Fix default embed size 2017-09-28 08:25:23 -05:00
Matthew Honnibal
f6330d69e6 Default embed size to 7000 2017-09-28 08:07:41 -05:00
Matthew Honnibal
ac8481a7b0 Print NER loss 2017-09-28 08:05:31 -05:00
Matthew Honnibal
542ebfa498 Improve defaults 2017-09-27 18:54:37 -05:00
Matthew Honnibal
dcb86bdc43 Default batch size to 32 2017-09-27 11:48:19 -05:00
Matthew Honnibal
1a37a2c0a0 Update training defaults 2017-09-27 11:48:07 -05:00
Matthew Honnibal
13d7a97f3a Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-09-27 11:44:37 -05:00
Matthew Honnibal
66c388ee01 Remove unhelpful multitask objectives 2017-09-27 11:44:16 -05:00
Matthew Honnibal
983201a83a Fix hard-coded vector width 2017-09-27 11:43:58 -05:00