The call here was creating a float64 array, which was turning many
downstream scores into float64s. Later on these values were assigned to
a float32 array in backprop, and numerical underflow caused things to go
to zero.
That's almost certainly not the only reason things go to zero, but it is
incorrect.
At a few points in the code it's normal to get a "2d" array where each
row is a single entry. Calling squeeze will make that a proper 1d
array... unless it's just one entry, in which case it turns into a 0d
scalar. That's not what we want; flatten() provides the desired
behavior.
`make_clean_doc` is not needed and was removed.
`logsumexp` may be needed if I misunderstood the loss calculation, so I
left it in for now with a note.
The intent of this was that it would be a component pipeline that used
entities as input, but that's now covered by the get_mentions function
as a pipeline arg.
This is closer to the traditional evaluation method. That uses an
average of three scores, this is just using the bcubed metric for now
(nothing special about bcubed, just picked one).
The scoring implementation comes from the coval project. It relies on
scipy, which is one issue, and is rather involved, which is another.
Besides being comparable with traditional evaluations, this scoring is
relatively fast.
* unit test for pickling KB
* add pickling test for NEL
* KB to_bytes and from_bytes
* NEL to_bytes and from_bytes
* xfail pickle tests for now
* fix docs
* cleanup
When sentences are not available, just treat the whole doc as one
sentence. A reasonable general fallback, but important due to the init
call, where upstream components aren't run.
* Minor updates to quickstart settings/instructions
* set default value of textcat exclusive to `false` until the default
checkbox behavior is updated
* add the `morphologizer` to the list of components
* add a note that v3.0.6+ is required
* Switch to warning above quickstart
* Undo changes to textcat default in quickstart
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Fix range in Span.get_lca_matrix
Fix the adjusted token index / lca matrix index ranges for
`_get_lca_matrix` for spans.
* The range for `k` should correspond to the adjusted indices in
`lca_matrix` with the `start` indexed at `0`
* Update test for v3.x
* custom warning if the doc_bin is too large
* cleanup
* Update spacy/errors.py
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* fix numbering
* fixing numbering once more
* fixing this seems to be pretty hard
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Handle errors while multiprocessing
Handle errors while multiprocessing without hanging.
* Return the traceback for errors raised while processing a batch, which
can be handled by the top-level error handler
* Allow for shortened batches due to custom error handlers that ignore
errors and skip documents
* Define custom components at a higher level
* Also move up custom error handler
* Use simpler component for test
* Switch error type
* Adjust test
* Only call top-level error handler for exceptions
* Register custom test components within tests
Use global functions (so they can be pickled) but register the
components only within the individual tests.
* Check for unsupported cats values
* Only show labels if train/dev mismatched
* Don't show label counts (only counting positive labels seems odd)
* Use warnings for mismatched train/dev labels
* Adapt tokenization methods from `pyvi` to preserve text encoding and
whitespace
* Add serialization support similar to Chinese and Japanese
Note: as for Chinese and Japanese, some settings are duplicated in
`config.cfg` and `tokenizer/cfg`.