mirror of
https://github.com/explosion/spaCy.git
synced 2025-10-24 12:41:23 +03:00
The TextCategorizer class is supposed to support multi-label text classification, and allow training data to contain missing values. For this to work, the gradient of the loss should be 0 when labels are missing. Instead, there was no way to actually denote "missing" in the GoldParse class, and so the TextCategorizer class treated the label set within gold.cats as complete. To fix this, we change GoldParse.cats to be a dict instead of a list. The GoldParse.cats dict should map to floats, with 1. denoting 'present' and 0. denoting 'absent'. Gradients are zeroed for categories absent from the gold.cats dict. A nice bonus is that you can also set values between 0 and 1 for partial membership. You can also set numeric values, if you're using a text classification model that uses an appropriate loss function. Unfortunately this is a breaking change; although the functionality was only recently introduced and hasn't been properly documented yet. I've updated the example script accordingly. |
||
---|---|---|
.. | ||
load_ner.py | ||
train_ner_standalone.py | ||
train_ner.py | ||
train_new_entity_type.py | ||
train_parser.py | ||
train_tagger.py | ||
train_textcat.py |