p By default spaCy loads a #[code data/vocab/vec.bin] file, where the #[em data] directory is within the #[code spacy.en] module directory. This file can be replaced, to customize the word vectors that spaCy loads. You can also replace the word vectors at run-time.
p The function #[code spacy.vocab.write_binary_vectors] creates a word vectors file in spaCy's binary data format. It expects a #[code bz2] file in the following format:
+code.
word_key1 0.92 0.45 -0.9 0.0
word_key2 0.3 0.1 0.6 0.3
...
p That is, each line is a single entry. Each entry consists of a key string, followed by a sequence of floats. Each entry should have the same number of floats.
p The following example script will replace the #[code vec.bin] file with vectors read from a #[code bz2] archive:
p Since v0.93, you can assign to the #[code .vector] attribute of #[code Lexeme] instances. Tokens of that lexical type will then inherit the updated vector. For instance:
p All tokens which have the #[code orth] attribute #[em apples] will inherit the updated vector.
p Note that the updated vectors won't persist after exit, unless you persist them yourself, and then replace the #[code vec.bin] file as described above.
p A popular source of word vectors are the #[a(href="http://nlp.stanford.edu/projects/glove/" target="_blank") GloVe word vectors], particularly those calculated off the #[a(href="https://commoncrawl.org/" target="_blank") Common Crawl]. Note that the provided vector file has a few entries which are not valid UTF8 strings. These should be filtered out.
p Future versions of spaCy will allow you to provide a file-like object, instead of a location of a #[bz2] file.