spaCy/spacy
Matthew Honnibal 563f46f026 Fix multi-label support for text classification
The TextCategorizer class is supposed to support multi-label
text classification, and allow training data to contain missing
values.

For this to work, the gradient of the loss should be 0 when labels
are missing. Instead, there was no way to actually denote "missing"
in the GoldParse class, and so the TextCategorizer class treated
the label set within gold.cats as complete.

To fix this, we change GoldParse.cats to be a dict instead of a list.
The GoldParse.cats dict should map to floats, with 1. denoting
'present' and 0. denoting 'absent'. Gradients are zeroed for categories
absent from the gold.cats dict. A nice bonus is that you can also set
values between 0 and 1 for partial membership. You can also set numeric
values, if you're using a text classification model that uses an
appropriate loss function.

Unfortunately this is a breaking change; although the functionality
was only recently introduced and hasn't been properly documented
yet. I've updated the example script accordingly.
2017-10-05 18:43:02 -05:00
..
cli Update spacy evaluate and add displaCy option 2017-10-04 00:03:15 +02:00
data Make spacy/data a package 2017-03-18 20:04:22 +01:00
displacy Add workaround for displaCy server on Python 2/3 (resolves #1227) 2017-08-01 01:11:35 +02:00
lang Merge pull request #1365 from wannaphongcom/develop 2017-09-26 23:43:05 +02:00
syntax Update thinc imports for 6.9 2017-10-03 20:07:17 +02:00
tests Make test work for Python 2.7 2017-10-04 16:36:50 +02:00
tokens Fix parameter name in .pxd file 2017-09-26 07:28:50 -05:00
__init__.pxd * Seems to be working after refactor. Need to wire up more POS tag features, and wire up save/load of POS tags. 2014-10-24 02:23:42 +11:00
__init__.py Unbreak merge artefact 2017-10-03 09:41:05 -05:00
__main__.py Add spacy evaluate 2017-10-01 14:05:04 -05:00
_cfile.pxd Restore CFile loader 2017-08-18 20:46:16 +02:00
_cfile.pyx Restore CFile loader 2017-08-18 20:46:16 +02:00
_ml.py Add nO attribute to TextCategorizer model 2017-10-04 16:07:30 +02:00
about.py Update docs link in about.py 2017-10-03 15:19:55 +02:00
attrs.pxd Fix cpdef enum in attrs.pyx 2017-09-17 12:28:53 -05:00
attrs.pyx Fix cpdef enum in attrs.pyx 2017-09-17 12:28:53 -05:00
cfile.pxd Get spaCy train command working with neural network 2017-05-17 12:04:50 +02:00
cfile.pyx Get spaCy train command working with neural network 2017-05-17 12:04:50 +02:00
compat.py Don't escape forward slashes on ujson.dumps 2017-08-19 22:32:16 +02:00
deprecated.py Change python -m spacy to spacy 2017-08-14 13:04:48 +02:00
glossary.py Fix typos and commands in alpha docs 2017-08-21 13:40:11 +02:00
gold.pxd Add support for sent_start to GoldParse 2017-08-25 20:03:14 -05:00
gold.pyx Fix multi-label support for text classification 2017-10-05 18:43:02 -05:00
language.py Add support for verbose flag to Language 2017-10-03 09:14:57 -05:00
lemmatizer.py Remove print statement 2017-09-14 13:38:28 +02:00
lemmatizerlookup.py Adding unitest for tokenization in french (with title) 2017-04-27 11:53:44 +02:00
lexeme.pxd WIP on stringstore change. 27 failures 2017-05-28 14:06:40 +02:00
lexeme.pyx Allow Lexeme.rank to be set 2017-08-24 21:43:00 +02:00
matcher.pyx Fix PhraseMatcher.__contains__ 2017-09-26 08:35:53 -05:00
morphology.pxd Fix loading of morphology exceptions 2017-06-04 16:34:32 -05:00
morphology.pyx Handle lemmatization for unknown string IDs 2017-09-24 05:01:31 -05:00
parts_of_speech.pxd Add support for Universal Dependencies v2.0 2017-03-03 13:17:34 +01:00
parts_of_speech.pyx Clean up imports, unused code, whitespace, docstrings 2017-04-15 12:05:47 +02:00
pipeline.pxd Data running through, likely errors in model 2017-05-06 14:22:20 +02:00
pipeline.pyx Fix multi-label support for text classification 2017-10-05 18:43:02 -05:00
scorer.py Clean up imports, unused code, whitespace, docstrings 2017-04-15 12:05:47 +02:00
strings.pxd Work on changing StringStore to return hashes. 2017-05-28 12:36:27 +02:00
strings.pyx Prevent strings from being lost during from_disk and from_bytes 2017-08-19 22:42:17 +02:00
structs.pxd WIP on stringstore change. 27 failures 2017-05-28 14:06:40 +02:00
symbols.pxd Fix code explosion from long enum in Python 3, Cython 0.24+ 2017-09-16 12:20:04 +02:00
symbols.pyx Fix code explosion from long enum in Python 3, Cython 0.24+ 2017-09-16 12:20:04 +02:00
tagger.pxd Add cfg field to Tagger 2016-10-17 01:03:41 +02:00
tagger.pyx Update docstrings and remove deprecated load classmethod 2017-05-21 13:27:52 +02:00
tokenizer.pxd Revert "Revert "Merge remote-tracking branch 'origin/master'"" 2017-01-09 13:28:13 +01:00
tokenizer.pyx Make sure serializers and deserializers are ordered 2017-06-03 17:05:09 +02:00
typedefs.pxd Work on changing StringStore to return hashes. 2017-05-28 12:36:27 +02:00
typedefs.pyx * Move POS tag definitions to parts_of_speech.pxd 2015-01-25 16:31:07 +11:00
util.py Fix evaluate for non-GPU 2017-10-03 22:47:31 +02:00
vectors.pyx Merge vectors.pyx doc strings 2017-10-01 17:05:54 -05:00
vocab.pxd Work on vectors 2017-05-30 23:34:50 +02:00
vocab.pyx Fix typo 2017-10-01 21:58:45 +02:00