//- 💫 DOCS > USAGE > VECTORS & SIMILARITY > CUSTOM VECTORS p | Word vectors let you import knowledge from raw text into your model. The | knowledge is represented as a table of numbers, with one row per term in | your vocabulary. If two terms are used in similar contexts, the algorithm | that learns the vectors should assign them | #[strong rows that are quite similar], while words that are used in | different contexts will have quite different values. This lets you use | the row-values assigned to the words as a kind of dictionary, to tell you | some things about what the words in your text mean. p | Word vectors are particularly useful for terms which | #[strong aren't well represented in your labelled training data]. | For instance, if you're doing named entity recognition, there will always | be lots of names that you don't have examples of. For instance, imagine | your training data happens to contain some examples of the term | "Microsoft", but it doesn't contain any examples of the term "Symantec". | In your raw text sample, there are plenty of examples of both terms, and | they're used in similar contexts. The word vectors make that fact | available to the entity recognition model. It still won't see examples of | "Symantec" labelled as a company. However, it'll see that "Symantec" has | a word vector that usually corresponds to company terms, so it can | #[strong make the inference]. p | In order to make best use of the word vectors, you want the word vectors | table to cover a #[strong very large vocabulary]. However, most words are | rare, so most of the rows in a large word vectors table will be accessed | very rarely, or never at all. You can usually cover more than | #[strong 95% of the tokens] in your corpus with just | #[strong a few thousand rows] in the vector table. However, it's those | #[strong 5% of rare terms] where the word vectors are | #[strong most useful]. The problem is that increasing the size of the | vector table produces rapidly diminishing returns in coverage over these | rare terms. +h(3, "converting") Converting word vectors for use in spaCy +tag-new("2.0.10") p | Custom word vectors can be trained using a number of open-source libraries, | such as #[+a("https://radimrehurek.com/gensim") Gensim], | #[+a("https://fasttext.cc") Fast Text], or Tomas Mikolov's original | #[+a("https://code.google.com/archive/p/word2vec/") word2vec implementation]. | Most word vector libraries output an easy-to-read text-based format, where | each line consists of the word followed by its vector. For everyday use, | we want to convert the vectors model into a binary format that loads faster | and takes up less space on disk. The easiest way to do this is the | #[+api("cli#init-model") #[code init-model]] command-line utility: +code(false, "bash"). wget https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.la.300.vec.gz python -m spacy init-model en /tmp/la_vectors_wiki_lg --vectors-loc cc.la.300.vec.gz p | This will output a spaCy model in the directory | #[code /tmp/la_vectors_wiki_lg], giving you access to some nice Latin | vectors 😉 You can then pass the directory path to | #[+api("spacy#load") #[code spacy.load()]]. +code. nlp_latin = spacy.load('/tmp/la_vectors_wiki_lg') doc1 = nlp_latin(u"Caecilius est in horto") doc2 = nlp_latin(u"servus est in atrio") doc1.similarity(doc2) p | The model directory will have a #[code /vocab] directory with the strings, | lexical entries and word vectors from the input vectors model. The | #[+api("cli#init-model") #[code init-model]] command supports a number of | archive formats for the word vectors: the vectors can be in plain text | (#[code .txt]), zipped (#[code .zip]), or tarred and zipped | (#[code .tgz]). +h(3, "custom-vectors-coverage") Optimising vector coverage +tag-new(2) p | To help you strike a good balance between coverage and memory usage, | spaCy's #[+api("vectors") #[code Vectors]] class lets you map | #[strong multiple keys] to the #[strong same row] of the table. If | you're using the #[+api("cli#vocab") #[code spacy vocab]] command to | create a vocabulary, pruning the vectors will be taken care of | automatically. You can also do it manually in the following steps: +list("numbers") +item | Start with a #[strong word vectors model] that covers a huge | vocabulary. For instance, the | #[+a("/models/en#en_vectors_web_lg") #[code en_vectors_web_lg]] model | provides 300-dimensional GloVe vectors for over 1 million terms of | English. +item | If your vocabulary has values set for the #[code Lexeme.prob] | attribute, the lexemes will be sorted by descending probability to | determine which vectors to prune. Otherwise, lexemes will be sorted | by their order in the #[code Vocab]. +item | Call #[+api("vocab#prune_vectors") #[code Vocab.prune_vectors]] with | the number of vectors you want to keep. +code. nlp = spacy.load('en_vectors_web_lg') n_vectors = 105000 # number of vectors to keep removed_words = nlp.vocab.prune_vectors(n_vectors) assert len(nlp.vocab.vectors) <= n_vectors # unique vectors have been pruned assert nlp.vocab.vectors.n_keys > n_vectors # but not the total entries p | #[+api("vocab#prune_vectors") #[code Vocab.prune_vectors]] reduces the | current vector table to a given number of unique entries, and returns a | dictionary containing the removed words, mapped to #[code (string, score)] | tuples, where #[code string] is the entry the removed word was mapped | to, and #[code score] the similarity score between the two words. +code("Removed words"). { 'Shore': ('coast', 0.732257), 'Precautionary': ('caution', 0.490973), 'hopelessness': ('sadness', 0.742366), 'Continous': ('continuous', 0.732549), 'Disemboweled': ('corpse', 0.499432), 'biostatistician': ('scientist', 0.339724), 'somewheres': ('somewheres', 0.402736), 'observing': ('observe', 0.823096), 'Leaving': ('leaving', 1.0) } p | In the example above, the vector for "Shore" was removed and remapped | to the vector of "coast", which is deemed about 73% similar. "Leaving" | was remapped to the vector of "leaving", which is identical. p | If you're using the #[+api("cli#init-model") #[code init-model]] command, | you can set the #[code --prune-vectors] option to easily reduce the size | of the vectors as you add them to a spaCy model: +code(false, "bash", "$"). python -m spacy init-model /tmp/la_vectors_web_md --vectors-loc la.300d.vec.tgz --prune-vectors 10000 p | This will create a spaCy model with vectors for the first 10,000 words in | the vectors model. All other words in the vectors model are mapped to the | closest vector among those retained. +h(3, "custom-vectors-add") Adding vectors +tag-new(2) p | spaCy's new #[+api("vectors") #[code Vectors]] class greatly improves the | way word vectors are stored, accessed and used. The data is stored in | two structures: +list +item | An array, which can be either on CPU or #[+a("#gpu") GPU]. +item | A dictionary mapping string-hashes to rows in the table. p | Keep in mind that the #[code Vectors] class itself has no | #[+api("stringstore") #[code StringStore]], so you have to store the | hash-to-string mapping separately. If you need to manage the strings, | you should use the #[code Vectors] via the | #[+api("vocab") #[code Vocab]] class, e.g. #[code vocab.vectors]. To | add vectors to the vocabulary, you can use the | #[+api("vocab#set_vector") #[code Vocab.set_vector]] method. +code("Adding vectors"). from spacy.vocab import Vocab vector_data = {u'dog': numpy.random.uniform(-1, 1, (300,)), u'cat': numpy.random.uniform(-1, 1, (300,)), u'orange': numpy.random.uniform(-1, 1, (300,))} vocab = Vocab() for word, vector in vector_data.items(): vocab.set_vector(word, vector) +h(3, "custom-loading-glove") Loading GloVe vectors +tag-new(2) p | spaCy comes with built-in support for loading | #[+a("https://nlp.stanford.edu/projects/glove/") GloVe] vectors from | a directory. The #[+api("vectors#from_glove") #[code Vectors.from_glove]] | method assumes a binary format, the vocab provided in a | #[code vocab.txt], and the naming scheme of | #[code vectors.{size}.[fd].bin]. For example: +aside-code("Directory structure", "yaml"). └── vectors ├── vectors.128.f.bin # vectors file └── vocab.txt # vocabulary +table(["File name", "Dimensions", "Data type"]) +row +cell #[code vectors.128.f.bin] +cell 128 +cell float32 +row +cell #[code vectors.300.d.bin] +cell 300 +cell float64 (double) +code. nlp = spacy.load('en') nlp.vocab.vectors.from_glove('/path/to/vectors') p | If your instance of #[code Language] already contains vectors, they will | be overwritten. To create your own GloVe vectors model package like | spaCy's #[+a("/models/en#en_vectors_web_lg") #[code en_vectors_web_lg]], | you can call #[+api("language#to_disk") #[code nlp.to_disk]], and then | package the model using the #[+api("cli#package") #[code package]] | command. +h(3, "custom-similarity") Using custom similarity methods p | By default, #[+api("token#vector") #[code Token.vector]] returns the | vector for its underlying #[+api("lexeme") #[code Lexeme]], while | #[+api("doc#vector") #[code Doc.vector]] and | #[+api("span#vector") #[code Span.vector]] return an average of the | vectors of their tokens. You can customise these | behaviours by modifying the #[code doc.user_hooks], | #[code doc.user_span_hooks] and #[code doc.user_token_hooks] | dictionaries. +infobox | For more details on #[strong adding hooks] and #[strong overwriting] the | built-in #[code Doc], #[code Span] and #[code Token] methods, see the | usage guide on #[+a("/usage/processing-pipelines#user-hooks") user hooks].