spaCy/spacy/tests/serialize
Stanislav Schmidt 2516896849
Make vocab update in get_docs deterministic (#7603)
* Make vocab update in get_docs deterministic

The attribute `DocBin.strings` is a set. In `DocBin.get_docs`
a given vocab is updated by iterating over this set.
Iteration over a python set produces an arbitrary ordering,
therefore vocab is updated non-deterministically.

When training (fine-tuning) a spacy model, the base model's
vocabulary will be updated with the new vocabulary in the
training data in exactly the way described above. After
serialization, the file `model/vocab/strings.json` will
be sorted in an arbitrary way. This prevents reproducible
model training.

* Revert "Make vocab update in get_docs deterministic"

This reverts commit d6b87a2f55.

* Sort strings in StringStore serialization

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2021-04-09 11:53:13 +02:00
..
__init__.py Revert #4334 2019-09-29 17:32:12 +02:00
test_resource_warning.py Tidy up tests 2020-10-15 10:20:21 +02:00
test_serialize_config.py Fixing pretrain (#7342) 2021-03-09 14:01:13 +11:00
test_serialize_doc.py Add SpanGroup and Graph container types to represent arbitrary annotations (#6696) 2021-01-14 17:30:41 +11:00
test_serialize_extension_attrs.py Merge branch 'master' into develop 2020-02-18 14:47:23 +01:00
test_serialize_kb.py consistently use registry as callable 2021-03-02 17:56:28 +01:00
test_serialize_language.py Remove dead and/or deprecated code (#5710) 2020-07-06 13:06:25 +02:00
test_serialize_pipeline.py multi-label textcat component (#6474) 2021-01-06 13:07:14 +11:00
test_serialize_tokenizer.py Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-rc3 2021-01-14 11:49:58 +01:00
test_serialize_vocab_strings.py Make vocab update in get_docs deterministic (#7603) 2021-04-09 11:53:13 +02:00