mirror of
https://github.com/explosion/spaCy.git
synced 2024-12-25 09:26:27 +03:00
Make vocab update in get_docs deterministic
The attribute `DocBin.strings` is a set. In `DocBin.get_docs` a given vocab is updated by iterating over this set. Iteration over a python set produces an arbitrary ordering, therefore vocab is updated non-deterministically. When training (fine-tuning) a spacy model, the base model's vocabulary will be updated with the new vocabulary in the training data in exactly the way described above. After serialization, the file `model/vocab/strings.json` will be sorted in an arbitrary way. This prevents reproducible model training.
This commit is contained in:
parent
3ae8661085
commit
d6b87a2f55
|
@ -124,7 +124,7 @@ class DocBin:
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/docbin#get_docs
|
DOCS: https://spacy.io/api/docbin#get_docs
|
||||||
"""
|
"""
|
||||||
for string in self.strings:
|
for string in sorted(self.strings):
|
||||||
vocab[string]
|
vocab[string]
|
||||||
orth_col = self.attrs.index(ORTH)
|
orth_col = self.attrs.index(ORTH)
|
||||||
for i in range(len(self.tokens)):
|
for i in range(len(self.tokens)):
|
||||||
|
|
Loading…
Reference in New Issue
Block a user