spaCy/spacy/tokens
single-fingal 6c6b8da7cc
Fix: De/Serialize SpanGroups including the SpanGroup keys (#10707)
* fix: De/Serialize `SpanGroups` including the SpanGroup keys

This prevents the loss of `SpanGroup`s that have the same .name as other `SpanGroup`s within the same `SpanGroups` object (upon de/serialization of the `SpanGroups`).

Fixes #10685

* Maintain backwards compatibility for serialized `SpanGroups`

(serialized as: a list of `SpanGroup`s, or b'')

* Add tests for `SpanGroups` deserialization backwards-compatibility

* Move a `SpanGroups` de/serialization test (test_issue10685)
  to tests/serialize/test_serialize_spangroups.py

* Output a warning if deserializing a `SpanGroups` with duplicate .name-d `SpanGroup`s

* Minor refactor

* `SpanGroups.from_bytes` handles only `list` and `dict` types with
`dict` as the expected default
* For lists, keep first rather than last value encountered
* Update error message
* Rename and update tests

* Update to preserve list serialization of SpanGroups

To avoid breaking compatibility of serialized `Doc` and `DocBin` with
earlier versions of spacy v3, revert back to a list-only serialization,
but update the names just for serialization so that the SpanGroups keys
override the SpanGroup names.

* Preserve object identity and current key overwrite

* Preserve SpanGroup object identity
* Preserve last rather than first span group from SpanGroup list
  format without SpanGroups keys

* Update inline comments

* Fix types

* Add type info for SpanGroup.copy

* Deserialize `SpanGroup`s as copies

when a single SpanGroup is the value for more than 1 `SpanGroups` key.
This is because we serialize `SpanGroups` as dicts (to maintain backward-
and forward-compatibility) and we can't assume `SpanGroup`s with the same
bytes/serialization were the same (identical) object, pre-serialization.

* Update spacy/tokens/_dict_proxies.py

* Add more SpanGroups serialization tests

Test that serialized SpanGroups maintain their Span order

* small clarification on older spaCy version

* Update spacy/tests/serialize/test_serialize_span_groups.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-06-02 15:56:27 +02:00
..
__init__.pxd * Break up tokens.pyx into tokens/doc.pyx, tokens/token.pyx, tokens/spans.pyx 2015-07-13 20:20:58 +02:00
__init__.py Fix SpanGroup import (#7182) 2021-02-24 21:06:16 +11:00
_dict_proxies.py Fix: De/Serialize SpanGroups including the SpanGroup keys (#10707) 2022-06-02 15:56:27 +02:00
_retokenize.pyi 🏷 Add Mypy check to CI and ignore all existing Mypy errors (#9167) 2021-10-14 15:21:40 +02:00
_retokenize.pyx Fix tensor retokenization for non-numpy ops (#7527) 2021-03-29 22:34:48 +11:00
_serialize.py Maintain support for empty DocBin span groups (#10538) 2022-03-24 11:51:07 +01:00
doc.pxd Set as_tuples on Doc during processing (#9592) 2021-11-02 15:08:22 +01:00
doc.pyi Add Doc.from_json() (#10688) 2022-06-02 14:03:47 +02:00
doc.pyx Fix schemas import in Doc (#10898) 2022-06-02 15:53:03 +02:00
graph.pxd Add SpanGroup and Graph container types to represent arbitrary annotations (#6696) 2021-01-14 17:30:41 +11:00
graph.pyx Refactor error messages to remove hardcoded strings (#10729) 2022-05-02 13:38:46 +02:00
morphanalysis.pxd Modify morphology to support arbitrary features (#4932) 2020-01-23 22:01:54 +01:00
morphanalysis.pyi 🏷 Add Mypy check to CI and ignore all existing Mypy errors (#9167) 2021-10-14 15:21:40 +02:00
morphanalysis.pyx Minor refactor for Morphology and MorphAnalysis (#5804) 2020-07-24 09:28:06 +02:00
span_group.pxd Add SpanGroup and Graph container types to represent arbitrary annotations (#6696) 2021-01-14 17:30:41 +11:00
span_group.pyi Fix: De/Serialize SpanGroups including the SpanGroup keys (#10707) 2022-06-02 15:56:27 +02:00
span_group.pyx Fix a few minor bugs in the SpanGroup API web docs (#10650) 2022-04-14 09:59:48 +02:00
span.pxd Add SpanGroup and Graph container types to represent arbitrary annotations (#6696) 2021-01-14 17:30:41 +11:00
span.pyi Add SpanRuler component (#9880) 2022-06-02 13:12:53 +02:00
span.pyx Add SpanRuler component (#9880) 2022-06-02 13:12:53 +02:00
token.pxd cleanup 2021-01-13 14:20:05 +01:00
token.pyi 🏷 Add Mypy check to CI and ignore all existing Mypy errors (#9167) 2021-10-14 15:21:40 +02:00
token.pyx Token sent attributes more consistent (#10164) 2022-02-08 08:35:37 +01:00
underscore.py Update typing hints (#10109) 2022-01-28 16:59:54 +01:00