Commit Graph

6601 Commits

Author SHA1 Message Date
Matthew Honnibal
689349e32f Merge pull request #1400 from explosion/feature/sentence-parsing
💫 Force parser to respect preset sentence boundaries
2017-10-09 04:31:43 +02:00
Matthew Honnibal
e79fc41ff8 Merge pull request #1391 from explosion/feature/multilabel-textcat
💫 Fix multi-label support for text classification
2017-10-09 04:22:31 +02:00
Matthew Honnibal
6c79841c0d Fix tests for history features 2017-10-09 04:12:24 +02:00
Matthew Honnibal
81a64119db Fix string-to-unicode problem 2017-10-09 00:59:49 +02:00
Matthew Honnibal
02c2af7119 Fix test 2017-10-09 00:29:37 +02:00
Matthew Honnibal
4cc84b0234 Prohibit Break when sent_start < 0 2017-10-09 00:02:45 +02:00
Matthew Honnibal
5a67efeccc Add tests for sentence segmentation presetting 2017-10-09 00:02:23 +02:00
Matthew Honnibal
e938bce320 Adjust parsing transition system to allow preset sentence segments. 2017-10-08 23:53:34 +02:00
Matthew Honnibal
080afd4924 Add ternary value setting to Token.sent_start 2017-10-08 23:51:58 +02:00
Matthew Honnibal
7ae67ec6a1 Add Span.as_doc method 2017-10-08 23:50:20 +02:00
Matthew Honnibal
20309fb9db Make history features default to zero 2017-10-08 20:32:14 +02:00
Matthew Honnibal
e74c8d2fad Merge remote-tracking branch 'origin/develop' into feature/sentence-parsing 2017-10-08 20:20:41 +02:00
Matthew Honnibal
18063803de Make TokenC.sent_tart an int, to allow ternary value 2017-10-08 19:58:54 +02:00
Matthew Honnibal
be4f0b6460 Update defaults 2017-10-08 02:08:12 -05:00
Matthew Honnibal
42b401d08b Change default hidden depth to 1 2017-10-07 21:05:21 -05:00
Matthew Honnibal
9d66a915da Update training defaults 2017-10-07 21:02:38 -05:00
Matthew Honnibal
d163115e91 Add non-linearity after history features 2017-10-07 21:00:43 -05:00
Matthew Honnibal
92c5d78b42 Unhack NER.add_action 2017-10-07 19:02:40 +02:00
Matthew Honnibal
f2b590f672 Increment version 2017-10-07 19:01:01 +02:00
Matthew Honnibal
eb0595bea9 Merge pull request #1392 from explosion/feature/parser-history-model
💫 Parser history features
2017-10-07 15:07:02 +02:00
ines
d70cf19158 Fix formatting 2017-10-07 15:06:38 +02:00
Ines Montani
36c68015f3 Merge pull request #1397 from explosion/feature/matcher-wildcard-token
💫 Allow empty dictionaries to match any token in Matcher
2017-10-07 15:05:24 +02:00
ines
c970b4f226 Add missing token attribute 2017-10-07 15:04:16 +02:00
ines
37f755897f Update rule-based matching docs 2017-10-07 15:04:09 +02:00
Matthew Honnibal
3d22ccf495 Update default hyper-parameters 2017-10-07 07:16:41 -05:00
Matthew Honnibal
e22067e3b5 Document new hyper-parameters 2017-10-07 07:10:10 -05:00
Matthew Honnibal
09442d25ec Merge remote-tracking branch 'origin/develop' into feature/parser-history-model 2017-10-07 07:05:04 -05:00
Matthew Honnibal
3b67eabfea Allow empty dictionaries to match any token in Matcher
Often patterns need to match "any token". A clean way to denote this
is with the empty dict {}: this sets no constraints on the token,
so should always match.

The problem was that having attributes length==0 was used as an
end-of-array signal, so the matcher didn't handle this case correctly.

This patch compiles empty token spec dicts into a constraint
NULL_ATTR==0. The NULL_ATTR attribute, 0, is always set to 0 on the
lexeme -- so this always matches.
2017-10-07 03:36:15 +02:00
Matthew Honnibal
8be46d766e Remove print statement 2017-10-06 16:19:02 -05:00
ines
3468d535ad Update model benchmarks 2017-10-06 21:39:06 +02:00
Matthew Honnibal
8e731009fe Fix parser config serialization 2017-10-06 13:50:52 -05:00
Matthew Honnibal
f4c9a98166 Fix spacy evaluate command on non-GPU 2017-10-06 13:17:47 -05:00
Matthew Honnibal
16ba6aa8a6 Fix parser config serialization 2017-10-06 13:17:31 -05:00
ines
96a4e79d13 Fix PhraseMatcher example 2017-10-06 18:22:10 +02:00
Matthew Honnibal
c66399d8ae Fix depth definition with history features 2017-10-06 06:20:05 -05:00
Matthew Honnibal
5c750a9c2f Reserve 0 for 'missing' in history features 2017-10-06 06:10:13 -05:00
Matthew Honnibal
fbba7c517e Pass dropout through to embed tables 2017-10-06 06:09:18 -05:00
Matthew Honnibal
21d11936fe Fix significant train/test skew error in history feats 2017-10-06 06:08:50 -05:00
Matthew Honnibal
555d8c8bff Fix beam history features 2017-10-05 22:21:50 -05:00
Matthew Honnibal
3db0a32fd6 Fix dropout for history features 2017-10-05 22:21:30 -05:00
Matthew Honnibal
b0618def8d Add support for 2-token state option 2017-10-05 21:54:12 -05:00
Matthew Honnibal
363aa47b40 Clean up dead parsing code 2017-10-05 21:53:49 -05:00
Matthew Honnibal
ca12764772 Enable history features for beam parser 2017-10-05 21:53:29 -05:00
Matthew Honnibal
fc06b0a333 Fix training when hist_size==0 2017-10-05 21:52:28 -05:00
Matthew Honnibal
e25ffcb11f Move history size under feature flags 2017-10-05 19:38:13 -05:00
Matthew Honnibal
563f46f026 Fix multi-label support for text classification
The TextCategorizer class is supposed to support multi-label
text classification, and allow training data to contain missing
values.

For this to work, the gradient of the loss should be 0 when labels
are missing. Instead, there was no way to actually denote "missing"
in the GoldParse class, and so the TextCategorizer class treated
the label set within gold.cats as complete.

To fix this, we change GoldParse.cats to be a dict instead of a list.
The GoldParse.cats dict should map to floats, with 1. denoting
'present' and 0. denoting 'absent'. Gradients are zeroed for categories
absent from the gold.cats dict. A nice bonus is that you can also set
values between 0 and 1 for partial membership. You can also set numeric
values, if you're using a text classification model that uses an
appropriate loss function.

Unfortunately this is a breaking change; although the functionality
was only recently introduced and hasn't been properly documented
yet. I've updated the example script accordingly.
2017-10-05 18:43:02 -05:00
Matthew Honnibal
c36d4596bf Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-10-05 18:27:56 +02:00
Matthew Honnibal
056b08c0df Delete obsolete nn_text_class example 2017-10-05 18:27:10 +02:00
Matthew Honnibal
c6cd81f192 Wrap try/except around model saving 2017-10-05 08:14:24 -05:00
Matthew Honnibal
5743b06e36 Wrap model saving in try/except 2017-10-05 08:12:50 -05:00