Make vocab update in get_docs deterministic

The attribute `DocBin.strings` is a set. In `DocBin.get_docs`
a given vocab is updated by iterating over this set.
Iteration over a python set produces an arbitrary ordering,
therefore vocab is updated non-deterministically.

When training (fine-tuning) a spacy model, the base model's
vocabulary will be updated with the new vocabulary in the
training data in exactly the way described above. After
serialization, the file `model/vocab/strings.json` will
be sorted in an arbitrary way. This prevents reproducible
model training.
This commit is contained in:
Stanislav Schmidt 2021-03-29 15:24:39 +02:00
parent 3ae8661085
commit d6b87a2f55

View File

@ -124,7 +124,7 @@ class DocBin:
DOCS: https://spacy.io/api/docbin#get_docs
"""
for string in self.strings:
for string in sorted(self.strings):
vocab[string]
orth_col = self.attrs.index(ORTH)
for i in range(len(self.tokens)):