spaCy/spacy
single-fingal 6c6b8da7cc
Fix: De/Serialize SpanGroups including the SpanGroup keys (#10707)
* fix: De/Serialize `SpanGroups` including the SpanGroup keys

This prevents the loss of `SpanGroup`s that have the same .name as other `SpanGroup`s within the same `SpanGroups` object (upon de/serialization of the `SpanGroups`).

Fixes #10685

* Maintain backwards compatibility for serialized `SpanGroups`

(serialized as: a list of `SpanGroup`s, or b'')

* Add tests for `SpanGroups` deserialization backwards-compatibility

* Move a `SpanGroups` de/serialization test (test_issue10685)
  to tests/serialize/test_serialize_spangroups.py

* Output a warning if deserializing a `SpanGroups` with duplicate .name-d `SpanGroup`s

* Minor refactor

* `SpanGroups.from_bytes` handles only `list` and `dict` types with
`dict` as the expected default
* For lists, keep first rather than last value encountered
* Update error message
* Rename and update tests

* Update to preserve list serialization of SpanGroups

To avoid breaking compatibility of serialized `Doc` and `DocBin` with
earlier versions of spacy v3, revert back to a list-only serialization,
but update the names just for serialization so that the SpanGroups keys
override the SpanGroup names.

* Preserve object identity and current key overwrite

* Preserve SpanGroup object identity
* Preserve last rather than first span group from SpanGroup list
  format without SpanGroups keys

* Update inline comments

* Fix types

* Add type info for SpanGroup.copy

* Deserialize `SpanGroup`s as copies

when a single SpanGroup is the value for more than 1 `SpanGroups` key.
This is because we serialize `SpanGroups` as dicts (to maintain backward-
and forward-compatibility) and we can't assume `SpanGroup`s with the same
bytes/serialization were the same (identical) object, pre-serialization.

* Update spacy/tokens/_dict_proxies.py

* Add more SpanGroups serialization tests

Test that serialized SpanGroups maintain their Span order

* small clarification on older spaCy version

* Update spacy/tests/serialize/test_serialize_span_groups.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-06-02 15:56:27 +02:00
..
cli Simplify GPU check 2022-05-25 14:06:45 +02:00
displacy #10672: fixes displacy output for manual unsorted entities (#10673) 2022-04-27 09:51:58 +02:00
lang Auto-format code with black (#10687) 2022-04-22 11:24:53 +02:00
matcher Better errors for has_annotation and Matcher (#10830) 2022-05-25 11:12:29 +02:00
ml Minor NEL type fixes (#10860) 2022-06-01 00:41:28 +02:00
pipeline Add SpanRuler component (#9880) 2022-06-02 13:12:53 +02:00
tests Fix: De/Serialize SpanGroups including the SpanGroup keys (#10707) 2022-06-02 15:56:27 +02:00
tokens Fix: De/Serialize SpanGroups including the SpanGroup keys (#10707) 2022-06-02 15:56:27 +02:00
training fix typo + CI slow testing (#10835) 2022-06-02 00:10:16 +02:00
__init__.pxd * Seems to be working after refactor. Need to wire up more POS tag features, and wire up save/load of POS tags. 2014-10-24 02:23:42 +11:00
__init__.py Tidy up and auto-format 2021-07-18 15:44:56 +10:00
__main__.py Tidy up 2020-06-22 00:45:40 +02:00
about.py Set version to v3.3.0 (#10614) 2022-04-28 13:07:49 +02:00
attrs.pxd Merge branch 'develop' into master-tmp 2020-05-21 18:39:06 +02:00
attrs.pyx Intify IOB (#9738) 2022-01-20 13:19:38 +01:00
compat.py Custom component types in spacy.ty (#9469) 2021-10-21 15:31:06 +02:00
default_config_pretraining.cfg Add new parameter for saving every n epoch in pretraining (#8912) 2021-08-12 11:14:48 +02:00
default_config.cfg Add a few docs to the default_config.cfg (#9981) 2022-01-05 09:16:40 +01:00
errors.py Fix: De/Serialize SpanGroups including the SpanGroup keys (#10707) 2022-06-02 15:56:27 +02:00
glossary.py Add glossary entry for root (#10821) 2022-05-20 09:56:32 +02:00
kb.pxd Replace cpdef variables with cdef (#7834) 2021-04-26 16:54:02 +02:00
kb.pyx Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-v3.2-1 2021-10-26 11:53:50 +02:00
language.py Ignore overrides for pipe names in config argument (#10779) 2022-05-12 11:46:08 +02:00
lexeme.pxd Fix Lexeme.from_ptr 2020-08-10 16:43:37 +02:00
lexeme.pyi fix type of lexeme.rank (#9979) 2022-01-04 13:15:25 +01:00
lexeme.pyx Bugfix for similarity return types (#10051) 2022-01-20 11:40:46 +01:00
lookups.py Fix issues for Mypy 0.950 and Pydantic 1.9.0 (#10786) 2022-05-25 09:33:54 +02:00
morphology.pxd Clean up Morphology imports and definitions (#7441) 2021-04-26 16:54:23 +02:00
morphology.pyx Clean up Morphology imports and definitions (#7441) 2021-04-26 16:54:23 +02:00
parts_of_speech.pxd Add support for Universal Dependencies v2.0 2017-03-03 13:17:34 +01:00
parts_of_speech.pyx Drop Python 2.7 and 3.5 (#4828) 2019-12-22 01:53:56 +01:00
pipe_analysis.py 🏷 Add Mypy check to CI and ignore all existing Mypy errors (#9167) 2021-10-14 15:21:40 +02:00
py.typed Add py.typed 2021-03-16 09:48:31 +01:00
schemas.py Add Doc.from_json() (#10688) 2022-06-02 14:03:47 +02:00
scorer.py Alignment: use a simplified ragged type for performance (#10319) 2022-04-01 09:02:06 +02:00
strings.pxd Update Cython string types (#9143) 2021-09-13 17:02:17 +02:00
strings.pyi Fix StringStore.__getitem__ return type depending on parameter types (#10741) 2022-05-03 17:57:07 +02:00
strings.pyx Update Cython string types (#9143) 2021-09-13 17:02:17 +02:00
structs.pxd Add SpanGroup and Graph container types to represent arbitrary annotations (#6696) 2021-01-14 17:30:41 +11:00
symbols.pxd introduce token.has_head and refer to MISSING_DEP_ (WIP) 2021-01-12 17:17:06 +01:00
symbols.pyx introduce token.has_head and refer to MISSING_DEP_ (WIP) 2021-01-12 17:17:06 +01:00
tokenizer.pxd Add tokenizer option to allow Matcher handling for all rules (#10452) 2022-03-24 13:21:32 +01:00
tokenizer.pyx Add tokenizer option to allow Matcher handling for all rules (#10452) 2022-03-24 13:21:32 +01:00
ty.py Custom component types in spacy.ty (#9469) 2021-10-21 15:31:06 +02:00
typedefs.pxd Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master 2020-11-25 11:49:34 +01:00
typedefs.pyx Tidy up rest 2017-10-27 21:07:59 +02:00
util.py Add SpanRuler component (#9880) 2022-06-02 13:12:53 +02:00
vectors.pyx Save vectors as little endian, load with Ops.asarray (#10201) 2022-03-21 14:24:46 +01:00
vocab.pxd Add support for floret vectors (#8909) 2021-10-27 14:08:31 +02:00
vocab.pyi Add vector deduplication (#10551) 2022-03-30 08:54:23 +02:00
vocab.pyx Add vector deduplication (#10551) 2022-03-30 08:54:23 +02:00