* Add all symbols in Unicode Currency Symbols block
In #8102 it came up that the rupee symbol was treated different from
dollar / euro / yen symbols. This adds many symbols not already
included.
* Fix test
* Fix training test
* unit test for pickling KB
* add pickling test for NEL
* KB to_bytes and from_bytes
* NEL to_bytes and from_bytes
* xfail pickle tests for now
* fix docs
* cleanup
* Fix range in Span.get_lca_matrix
Fix the adjusted token index / lca matrix index ranges for
`_get_lca_matrix` for spans.
* The range for `k` should correspond to the adjusted indices in
`lca_matrix` with the `start` indexed at `0`
* Update test for v3.x
* custom warning if the doc_bin is too large
* cleanup
* Update spacy/errors.py
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* fix numbering
* fixing numbering once more
* fixing this seems to be pretty hard
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Handle errors while multiprocessing
Handle errors while multiprocessing without hanging.
* Return the traceback for errors raised while processing a batch, which
can be handled by the top-level error handler
* Allow for shortened batches due to custom error handlers that ignore
errors and skip documents
* Define custom components at a higher level
* Also move up custom error handler
* Use simpler component for test
* Switch error type
* Adjust test
* Only call top-level error handler for exceptions
* Register custom test components within tests
Use global functions (so they can be pickled) but register the
components only within the individual tests.
* Check for unsupported cats values
* Only show labels if train/dev mismatched
* Don't show label counts (only counting positive labels seems odd)
* Use warnings for mismatched train/dev labels
* Handle partial entities in Span.as_doc
In `Span.as_doc` replace partial entities at the beginning or end of the
span with missing entity annotation.
Fixes a bug where invalid entity annotation (no initial `B`) was
returned for an initial partial entity.
* Check for empty span in ents conversion
Note: `Span.as_doc()` will still fail on an empty span due to failures
in `Span.vector`.
* Preserve existing ENT_KB_ID annotation in NER
Preserve `ent_kb_id` annotation on existing entity spans, which is not
preserved by the transition system.
* Simplify kb_id assignment
* Simplify further
This came up in #7878, but if --resume-path is a directory then loading
the weights will fail. On Linux this will give a straightforward error
message, but on Windows it gives "Permission Denied", which is
confusing.
* Fix percent unk display
This was showing (ratio %), so 10% would show as 0.10%. Fix by
multiplying ration by 100.
Might want to add a warning if this is over a threshold.
* Only show whole-integer percents
* Set up CI for tests with GPU agent
* Update tests for enabled GPU
* Fix steps filename
* Add parallel build jobs as a setting
* Fix test requirements
* Fix install test requirements condition
* Fix pipeline models test
* Reset current ops in prefer/require testing
* Fix more tests
* Remove separate test_models test
* Fix regression 5551
* fix StaticVectors for GPU use
* fix vocab tests
* Fix regression test 5082
* Move azure steps to .github and reenable default pool jobs
* Consolidate/rename azure steps
Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>
* Add callback to copy vocab/tokenizer from model
Add callback `spacy.copy_from_base_model.v1` to copy the tokenizer
settings and/or vocab (including vectors) from a base model.
* Move spacy.copy_from_base_model.v1 to spacy.training.callbacks
* Add documentation
* Modify to specify model as tokenizer and vocab params
* Update sent_starts in Example.from_dict
Update `sent_starts` for `Example.from_dict` so that `Optional[bool]`
values have the same meaning as for `Token.is_sent_start`.
Use `Optional[bool]` as the type for sent start values in the docs.
* Use helper function for conversion to ternary ints
* Fix tokenizer cache flushing
Fix/simplify tokenizer init detection in order to fix cache flushing
when properties are modified.
* Remove init reloading logic
* Remove logic disabling `_reload_special_cases` on init
* Setting `rules` last in `__init__` (as before) means that setting
other properties doesn't reload any special cases
* Reset `rules` first in `from_bytes` so that setting other properties
during deserialization doesn't reload any special cases
unnecessarily
* Reset all properties in `Tokenizer.from_bytes` to allow any settings
to be `None`
* Also reset special matcher when special cache is flushed
* Remove duplicate special case validation
* Add test for special cases flushing
* Extend test for tokenizer deserialization of None values
* Replace negative rows with 0 in StaticVectors
Replace negative row indices with 0-vectors in `StaticVectors`.
* Increase versions related to StaticVectors
* Increase versions of all architctures and layers related to
`StaticVectors`
* Improve efficiency of 0-vector operations
Parallel `spacy-legacy` PR: https://github.com/explosion/spacy-legacy/pull/5
* Update config defaults to new versions
* Update docs
* Update Tokenizer.explain with special matches
Update `Tokenizer.explain` and the pseudo-code in the docs to include
the processing of special cases that contain affixes or whitespace.
* Handle optional settings in explain
* Add test for special matches in explain
Add test for `Tokenizer.explain` for special cases containing affixes.
* Set catalogue lower pin to v2.0.2
* Update importlib-metadata pins to match
* Require catalogue v2.0.3
Switch to vendored `importlib-metadata` v3.2.0 provided by `catalogue`.
* ensure vectors data is stored on right device
* ensure the added vector is on the right device
* move vector to numpy before iterating
* move best_rows to numpy before iterating
* Terminology: deprecated vs obsolete
Typically, deprecated is used for functionality that is bound to become unavailable but that can still be used. Obsolete is used for features that have been removed. In E941, I think what is meant is "obsolete" since loading a model by a shortcut simply does not work anymore (and throws an error). This is different from downloading a model with a shortcut, which is deprecated but still works.
In light of this, perhaps all other error codes should be checked as well.
* clarify that the link command is removed and not just deprecated
Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>
* Update debug data further for v3
* Remove new/existing label distinction (new labels are not immediately
distinguishable because the pipeline is already initialized)
* Warn on missing labels in training data for all components except parser
* Separate textcat and textcat_multilabel sections
* Add section for morphologizer
* Reword missing label warnings
* Make vocab update in get_docs deterministic
The attribute `DocBin.strings` is a set. In `DocBin.get_docs`
a given vocab is updated by iterating over this set.
Iteration over a python set produces an arbitrary ordering,
therefore vocab is updated non-deterministically.
When training (fine-tuning) a spacy model, the base model's
vocabulary will be updated with the new vocabulary in the
training data in exactly the way described above. After
serialization, the file `model/vocab/strings.json` will
be sorted in an arbitrary way. This prevents reproducible
model training.
* Revert "Make vocab update in get_docs deterministic"
This reverts commit d6b87a2f55.
* Sort strings in StringStore serialization
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>