Entries in GoldParse.{words, heads, tags, deps, ner} can now be lists
instead of single values, to handle getting the analysis for fused
tokens. For instance, let's say we have a token like "hows", while the
gold-standard has two tokens, ["how", "s"]. We need to store the gold
data for each of the two subtokens.
Example gold.words: [["how", "s"], "it", "going"]
Things get more complicated for heads, as we need to address particular
subtokens. Let's say the gold heads for ["how", "s", "it", "going"] is
[1, 1, 3, 1], i.e. the root "s" is within the subtoken. The gold.heads
list would be:
[[(0, 1), (0, 1)], 2, (0, 1)]
The tuples indicate token 0, subtoken 1. A helper method
_flatten_fused_heads is available that unpacks the above to
[1, 1, 3, 1].
The TextCategorizer class is supposed to support multi-label
text classification, and allow training data to contain missing
values.
For this to work, the gradient of the loss should be 0 when labels
are missing. Instead, there was no way to actually denote "missing"
in the GoldParse class, and so the TextCategorizer class treated
the label set within gold.cats as complete.
To fix this, we change GoldParse.cats to be a dict instead of a list.
The GoldParse.cats dict should map to floats, with 1. denoting
'present' and 0. denoting 'absent'. Gradients are zeroed for categories
absent from the gold.cats dict. A nice bonus is that you can also set
values between 0 and 1 for partial membership. You can also set numeric
values, if you're using a text classification model that uses an
appropriate loss function.
Unfortunately this is a breaking change; although the functionality
was only recently introduced and hasn't been properly documented
yet. I've updated the example script accordingly.