This is necessary because one of the three old methods relied on scipy
for some complex problem solving. LEA is generally better for
evaluations.
The downside is that this means evaluations aren't comparable with many
papers, but canonical scoring can be supported using external eval
scripts or other methods.
The way fake batching works is that the pipeline component calls the
model repeatedly in a loop internally. It feels like this should break
something, but it worked in testing.
Another issue is that this changes the signature of some of the pipeline
functions, though I don't think that's an issue.
Tested with batch size of 2, so more testing is needed, but this is a
start.
This calculates scores as an average of three metrics. As noted in the
code, these metrics all have issues, but we want to use them to match up
with prior work.
This should be replaced with some simpler default scoring and the scorer
here should be moved to an external project to be passed in just for
generating the traditional scores.
This rewrites the loss to not use the Thinc crossentropy code at all.
The main difference here is that the negative predictions are being
masked out (= marginalized over), but negative gradient is still being
reflected.
I'm still not sure this is exactly right but models seem to train
reliably now.
This is closer to the traditional evaluation method. That uses an
average of three scores, this is just using the bcubed metric for now
(nothing special about bcubed, just picked one).
The scoring implementation comes from the coval project. It relies on
scipy, which is one issue, and is rather involved, which is another.
Besides being comparable with traditional evaluations, this scoring is
relatively fast.
This includes the coref code that was being tested separately, modified
to work in spaCy. It hasn't been tested yet and presumably still needs
fixes.
In particular, the evaluation code is currently omitted. It's unclear at
the moment whether we want to use a complex scorer similar to the
official one, or a simpler scorer using more modern evaluation methods.
* initial coref_er pipe
* matcher more flexible
* base coref component without actual model
* initial setup of coref_er.score
* rename to include_label
* preliminary score_clusters method
* apply scoring in coref component
* IO fix
* return None loss for now
* rename to CoreferenceResolver
* some preliminary unit tests
* use registry as callable