When sentences are not available, just treat the whole doc as one
sentence. A reasonable general fallback, but important due to the init
call, where upstream components aren't run.
* Minor updates to quickstart settings/instructions
* set default value of textcat exclusive to `false` until the default
checkbox behavior is updated
* add the `morphologizer` to the list of components
* add a note that v3.0.6+ is required
* Switch to warning above quickstart
* Undo changes to textcat default in quickstart
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Fix range in Span.get_lca_matrix
Fix the adjusted token index / lca matrix index ranges for
`_get_lca_matrix` for spans.
* The range for `k` should correspond to the adjusted indices in
`lca_matrix` with the `start` indexed at `0`
* Update test for v3.x
* custom warning if the doc_bin is too large
* cleanup
* Update spacy/errors.py
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* fix numbering
* fixing numbering once more
* fixing this seems to be pretty hard
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Handle errors while multiprocessing
Handle errors while multiprocessing without hanging.
* Return the traceback for errors raised while processing a batch, which
can be handled by the top-level error handler
* Allow for shortened batches due to custom error handlers that ignore
errors and skip documents
* Define custom components at a higher level
* Also move up custom error handler
* Use simpler component for test
* Switch error type
* Adjust test
* Only call top-level error handler for exceptions
* Register custom test components within tests
Use global functions (so they can be pickled) but register the
components only within the individual tests.
* Check for unsupported cats values
* Only show labels if train/dev mismatched
* Don't show label counts (only counting positive labels seems odd)
* Use warnings for mismatched train/dev labels
* Adapt tokenization methods from `pyvi` to preserve text encoding and
whitespace
* Add serialization support similar to Chinese and Japanese
Note: as for Chinese and Japanese, some settings are duplicated in
`config.cfg` and `tokenizer/cfg`.
This includes the coref code that was being tested separately, modified
to work in spaCy. It hasn't been tested yet and presumably still needs
fixes.
In particular, the evaluation code is currently omitted. It's unclear at
the moment whether we want to use a complex scorer similar to the
official one, or a simpler scorer using more modern evaluation methods.
* Handle partial entities in Span.as_doc
In `Span.as_doc` replace partial entities at the beginning or end of the
span with missing entity annotation.
Fixes a bug where invalid entity annotation (no initial `B`) was
returned for an initial partial entity.
* Check for empty span in ents conversion
Note: `Span.as_doc()` will still fail on an empty span due to failures
in `Span.vector`.
* Preserve existing ENT_KB_ID annotation in NER
Preserve `ent_kb_id` annotation on existing entity spans, which is not
preserved by the transition system.
* Simplify kb_id assignment
* Simplify further
* Fix pretraining objectives fragment
The fragment here is reused from a heading higher up, so you couldn't
link to this section.
* Fix section link to new fragment
This came up in #7878, but if --resume-path is a directory then loading
the weights will fail. On Linux this will give a straightforward error
message, but on Windows it gives "Permission Denied", which is
confusing.
* Fix percent unk display
This was showing (ratio %), so 10% would show as 0.10%. Fix by
multiplying ration by 100.
Might want to add a warning if this is over a threshold.
* Only show whole-integer percents