mirror of
https://github.com/explosion/spaCy.git
synced 2024-11-11 04:08:09 +03:00
5f988b8e9c
It seems based on the doc and trying out that the `en` or `[lang]` is missing from the `spacy model-init`
242 lines
10 KiB
Plaintext
242 lines
10 KiB
Plaintext
//- 💫 DOCS > USAGE > VECTORS & SIMILARITY > CUSTOM VECTORS
|
|
|
|
p
|
|
| Word vectors let you import knowledge from raw text into your model. The
|
|
| knowledge is represented as a table of numbers, with one row per term in
|
|
| your vocabulary. If two terms are used in similar contexts, the algorithm
|
|
| that learns the vectors should assign them
|
|
| #[strong rows that are quite similar], while words that are used in
|
|
| different contexts will have quite different values. This lets you use
|
|
| the row-values assigned to the words as a kind of dictionary, to tell you
|
|
| some things about what the words in your text mean.
|
|
|
|
p
|
|
| Word vectors are particularly useful for terms which
|
|
| #[strong aren't well represented in your labelled training data].
|
|
| For instance, if you're doing named entity recognition, there will always
|
|
| be lots of names that you don't have examples of. For instance, imagine
|
|
| your training data happens to contain some examples of the term
|
|
| "Microsoft", but it doesn't contain any examples of the term "Symantec".
|
|
| In your raw text sample, there are plenty of examples of both terms, and
|
|
| they're used in similar contexts. The word vectors make that fact
|
|
| available to the entity recognition model. It still won't see examples of
|
|
| "Symantec" labelled as a company. However, it'll see that "Symantec" has
|
|
| a word vector that usually corresponds to company terms, so it can
|
|
| #[strong make the inference].
|
|
|
|
p
|
|
| In order to make best use of the word vectors, you want the word vectors
|
|
| table to cover a #[strong very large vocabulary]. However, most words are
|
|
| rare, so most of the rows in a large word vectors table will be accessed
|
|
| very rarely, or never at all. You can usually cover more than
|
|
| #[strong 95% of the tokens] in your corpus with just
|
|
| #[strong a few thousand rows] in the vector table. However, it's those
|
|
| #[strong 5% of rare terms] where the word vectors are
|
|
| #[strong most useful]. The problem is that increasing the size of the
|
|
| vector table produces rapidly diminishing returns in coverage over these
|
|
| rare terms.
|
|
|
|
+h(3, "converting") Converting word vectors for use in spaCy
|
|
+tag-new("2.0.10")
|
|
|
|
p
|
|
| Custom word vectors can be trained using a number of open-source libraries,
|
|
| such as #[+a("https://radimrehurek.com/gensim") Gensim],
|
|
| #[+a("https://fasttext.cc") Fast Text], or Tomas Mikolov's original
|
|
| #[+a("https://code.google.com/archive/p/word2vec/") word2vec implementation].
|
|
| Most word vector libraries output an easy-to-read text-based format, where
|
|
| each line consists of the word followed by its vector. For everyday use,
|
|
| we want to convert the vectors model into a binary format that loads faster
|
|
| and takes up less space on disk. The easiest way to do this is the
|
|
| #[+api("cli#init-model") #[code init-model]] command-line utility:
|
|
|
|
+code(false, "bash").
|
|
wget https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.la.300.vec.gz
|
|
python -m spacy init-model en /tmp/la_vectors_wiki_lg --vectors-loc cc.la.300.vec.gz
|
|
|
|
p
|
|
| This will output a spaCy model in the directory
|
|
| #[code /tmp/la_vectors_wiki_lg], giving you access to some nice Latin
|
|
| vectors 😉 You can then pass the directory path to
|
|
| #[+api("spacy#load") #[code spacy.load()]].
|
|
|
|
+code.
|
|
nlp_latin = spacy.load('/tmp/la_vectors_wiki_lg')
|
|
doc1 = nlp(u"Caecilius est in horto")
|
|
doc2 = nlp(u"servus est in atrio")
|
|
doc1.similarity(doc2)
|
|
|
|
p
|
|
| The model directory will have a #[code /vocab] directory with the strings,
|
|
| lexical entries and word vectors from the input vectors model. The
|
|
| #[+api("cli#init-model") #[code init-model]] command supports a number of
|
|
| archive formats for the word vectors: the vectors can be in plain text
|
|
| (#[code .txt]), zipped (#[code .zip]), or tarred and zipped
|
|
| (#[code .tgz]).
|
|
|
|
+h(3, "custom-vectors-coverage") Optimising vector coverage
|
|
+tag-new(2)
|
|
|
|
p
|
|
| To help you strike a good balance between coverage and memory usage,
|
|
| spaCy's #[+api("vectors") #[code Vectors]] class lets you map
|
|
| #[strong multiple keys] to the #[strong same row] of the table. If
|
|
| you're using the #[+api("cli#vocab") #[code spacy vocab]] command to
|
|
| create a vocabulary, pruning the vectors will be taken care of
|
|
| automatically. You can also do it manually in the following steps:
|
|
|
|
+list("numbers")
|
|
+item
|
|
| Start with a #[strong word vectors model] that covers a huge
|
|
| vocabulary. For instance, the
|
|
| #[+a("/models/en#en_vectors_web_lg") #[code en_vectors_web_lg]] model
|
|
| provides 300-dimensional GloVe vectors for over 1 million terms of
|
|
| English.
|
|
|
|
+item
|
|
| If your vocabulary has values set for the #[code Lexeme.prob]
|
|
| attribute, the lexemes will be sorted by descending probability to
|
|
| determine which vectors to prune. Otherwise, lexemes will be sorted
|
|
| by their order in the #[code Vocab].
|
|
|
|
+item
|
|
| Call #[+api("vocab#prune_vectors") #[code Vocab.prune_vectors]] with
|
|
| the number of vectors you want to keep.
|
|
|
|
+code.
|
|
nlp = spacy.load('en_vectors_web_lg')
|
|
n_vectors = 105000 # number of vectors to keep
|
|
removed_words = nlp.vocab.prune_vectors(n_vectors)
|
|
|
|
assert len(nlp.vocab.vectors) <= n_vectors # unique vectors have been pruned
|
|
assert nlp.vocab.vectors.n_keys > n_vectors # but not the total entries
|
|
|
|
p
|
|
| #[+api("vocab#prune_vectors") #[code Vocab.prune_vectors]] reduces the
|
|
| current vector table to a given number of unique entries, and returns a
|
|
| dictionary containing the removed words, mapped to #[code (string, score)]
|
|
| tuples, where #[code string] is the entry the removed word was mapped
|
|
| to, and #[code score] the similarity score between the two words.
|
|
|
|
+code("Removed words").
|
|
{
|
|
'Shore': ('coast', 0.732257),
|
|
'Precautionary': ('caution', 0.490973),
|
|
'hopelessness': ('sadness', 0.742366),
|
|
'Continous': ('continuous', 0.732549),
|
|
'Disemboweled': ('corpse', 0.499432),
|
|
'biostatistician': ('scientist', 0.339724),
|
|
'somewheres': ('somewheres', 0.402736),
|
|
'observing': ('observe', 0.823096),
|
|
'Leaving': ('leaving', 1.0)
|
|
}
|
|
|
|
p
|
|
| In the example above, the vector for "Shore" was removed and remapped
|
|
| to the vector of "coast", which is deemed about 73% similar. "Leaving"
|
|
| was remapped to the vector of "leaving", which is identical.
|
|
|
|
p
|
|
| If you're using the #[+api("cli#init-model") #[code init-model]] command,
|
|
| you can set the #[code --prune-vectors] option to easily reduce the size
|
|
| of the vectors as you add them to a spaCy model:
|
|
|
|
+code(false, "bash", "$").
|
|
python -m spacy init-model /tmp/la_vectors_web_md --vectors-loc la.300d.vec.tgz --prune-vectors 10000
|
|
|
|
p
|
|
| This will create a spaCy model with vectors for the first 10,000 words in
|
|
| the vectors model. All other words in the vectors model are mapped to the
|
|
| closest vector among those retained.
|
|
|
|
+h(3, "custom-vectors-add") Adding vectors
|
|
+tag-new(2)
|
|
|
|
p
|
|
| spaCy's new #[+api("vectors") #[code Vectors]] class greatly improves the
|
|
| way word vectors are stored, accessed and used. The data is stored in
|
|
| two structures:
|
|
|
|
+list
|
|
+item
|
|
| An array, which can be either on CPU or #[+a("#gpu") GPU].
|
|
|
|
+item
|
|
| A dictionary mapping string-hashes to rows in the table.
|
|
|
|
p
|
|
| Keep in mind that the #[code Vectors] class itself has no
|
|
| #[+api("stringstore") #[code StringStore]], so you have to store the
|
|
| hash-to-string mapping separately. If you need to manage the strings,
|
|
| you should use the #[code Vectors] via the
|
|
| #[+api("vocab") #[code Vocab]] class, e.g. #[code vocab.vectors]. To
|
|
| add vectors to the vocabulary, you can use the
|
|
| #[+api("vocab#set_vector") #[code Vocab.set_vector]] method.
|
|
|
|
+code("Adding vectors").
|
|
from spacy.vocab import Vocab
|
|
|
|
vector_data = {u'dog': numpy.random.uniform(-1, 1, (300,)),
|
|
u'cat': numpy.random.uniform(-1, 1, (300,)),
|
|
u'orange': numpy.random.uniform(-1, 1, (300,))}
|
|
|
|
vocab = Vocab()
|
|
for word, vector in vector_data.items():
|
|
vocab.set_vector(word, vector)
|
|
|
|
+h(3, "custom-loading-glove") Loading GloVe vectors
|
|
+tag-new(2)
|
|
|
|
p
|
|
| spaCy comes with built-in support for loading
|
|
| #[+a("https://nlp.stanford.edu/projects/glove/") GloVe] vectors from
|
|
| a directory. The #[+api("vectors#from_glove") #[code Vectors.from_glove]]
|
|
| method assumes a binary format, the vocab provided in a
|
|
| #[code vocab.txt], and the naming scheme of
|
|
| #[code vectors.{size}.[fd].bin]. For example:
|
|
|
|
+aside-code("Directory structure", "yaml").
|
|
└── vectors
|
|
├── vectors.128.f.bin # vectors file
|
|
└── vocab.txt # vocabulary
|
|
|
|
+table(["File name", "Dimensions", "Data type"])
|
|
+row
|
|
+cell #[code vectors.128.f.bin]
|
|
+cell 128
|
|
+cell float32
|
|
|
|
+row
|
|
+cell #[code vectors.300.d.bin]
|
|
+cell 300
|
|
+cell float64 (double)
|
|
|
|
+code.
|
|
nlp = spacy.load('en')
|
|
nlp.vocab.vectors.from_glove('/path/to/vectors')
|
|
|
|
p
|
|
| If your instance of #[code Language] already contains vectors, they will
|
|
| be overwritten. To create your own GloVe vectors model package like
|
|
| spaCy's #[+a("/models/en#en_vectors_web_lg") #[code en_vectors_web_lg]],
|
|
| you can call #[+api("language#to_disk") #[code nlp.to_disk]], and then
|
|
| package the model using the #[+api("cli#package") #[code package]]
|
|
| command.
|
|
|
|
+h(3, "custom-similarity") Using custom similarity methods
|
|
|
|
p
|
|
| By default, #[+api("token#vector") #[code Token.vector]] returns the
|
|
| vector for its underlying #[+api("lexeme") #[code Lexeme]], while
|
|
| #[+api("doc#vector") #[code Doc.vector]] and
|
|
| #[+api("span#vector") #[code Span.vector]] return an average of the
|
|
| vectors of their tokens. You can customise these
|
|
| behaviours by modifying the #[code doc.user_hooks],
|
|
| #[code doc.user_span_hooks] and #[code doc.user_token_hooks]
|
|
| dictionaries.
|
|
|
|
+infobox
|
|
| For more details on #[strong adding hooks] and #[strong overwriting] the
|
|
| built-in #[code Doc], #[code Span] and #[code Token] methods, see the
|
|
| usage guide on #[+a("/usage/processing-pipelines#user-hooks") user hooks].
|