mirror of
https://github.com/explosion/spaCy.git
synced 2024-12-25 17:36:30 +03:00
Update word vectors & similarity workflow
This commit is contained in:
parent
b6c62baab3
commit
af348025ec
|
@ -6,46 +6,40 @@ p
|
||||||
| Dense, real valued vectors representing distributional similarity
|
| Dense, real valued vectors representing distributional similarity
|
||||||
| information are now a cornerstone of practical NLP. The most common way
|
| information are now a cornerstone of practical NLP. The most common way
|
||||||
| to train these vectors is the #[+a("https://en.wikipedia.org/wiki/Word2vec") word2vec]
|
| to train these vectors is the #[+a("https://en.wikipedia.org/wiki/Word2vec") word2vec]
|
||||||
| family of algorithms.
|
| family of algorithms. The default
|
||||||
|
| #[+a("/docs/usage/models#available") English model] installs
|
||||||
+aside("Tip")
|
| 300-dimensional vectors trained on the Common Crawl
|
||||||
| If you need to train a word2vec model, we recommend the implementation in
|
|
||||||
| the Python library #[+a("https://radimrehurek.com/gensim/") Gensim].
|
|
||||||
|
|
||||||
p
|
|
||||||
| spaCy makes using word vectors very easy. The
|
|
||||||
| #[+api("lexeme") #[code Lexeme]], #[+api("token") #[code Token]],
|
|
||||||
| #[+api("span") #[code Span]] and #[+api("doc") #[code Doc]] classes all
|
|
||||||
| have a #[code .vector] property, which is a 1-dimensional numpy array of
|
|
||||||
| 32-bit floats:
|
|
||||||
|
|
||||||
+code.
|
|
||||||
import numpy
|
|
||||||
|
|
||||||
apples, and_, oranges = nlp(u'apples and oranges')
|
|
||||||
print(apples.vector.shape)
|
|
||||||
# (1,)
|
|
||||||
apples.similarity(oranges)
|
|
||||||
|
|
||||||
p
|
|
||||||
| By default, #[code Token.vector] returns the vector for its underlying
|
|
||||||
| lexeme, while #[code Doc.vector] and #[code Span.vector] return an
|
|
||||||
| average of the vectors of their tokens. You can customize these
|
|
||||||
| behaviours by modifying the #[code doc.user_hooks],
|
|
||||||
| #[code doc.user_span_hooks] and #[code doc.user_token_hooks]
|
|
||||||
| dictionaries.
|
|
||||||
|
|
||||||
+aside-code("Example").
|
|
||||||
# TODO
|
|
||||||
|
|
||||||
p
|
|
||||||
| The default English model installs vectors for one million vocabulary
|
|
||||||
| entries, using the 300-dimensional vectors trained on the Common Crawl
|
|
||||||
| corpus using the #[+a("http://nlp.stanford.edu/projects/glove/") GloVe]
|
| corpus using the #[+a("http://nlp.stanford.edu/projects/glove/") GloVe]
|
||||||
| algorithm. The GloVe common crawl vectors have become a de facto
|
| algorithm. The GloVe common crawl vectors have become a de facto
|
||||||
| standard for practical NLP.
|
| standard for practical NLP.
|
||||||
|
|
||||||
+aside-code("Example").
|
+aside("Tip: Training a word2vec model")
|
||||||
|
| If you need to train a word2vec model, we recommend the implementation in
|
||||||
|
| the Python library #[+a("https://radimrehurek.com/gensim/") Gensim].
|
||||||
|
|
||||||
|
+h(2, "101") Similarity and word vectors 101
|
||||||
|
+tag-model("vectors")
|
||||||
|
|
||||||
|
include _spacy-101/_similarity
|
||||||
|
include _spacy-101/_word-vectors
|
||||||
|
|
||||||
|
|
||||||
|
+h(2, "custom") Customising word vectors
|
||||||
|
|
||||||
|
p
|
||||||
|
| By default, #[+api("token#vector") #[code Token.vector]] returns the
|
||||||
|
| vector for its underlying #[+api("lexeme") #[code Lexeme]], while
|
||||||
|
| #[+api("doc#vector") #[code Doc.vector]] and
|
||||||
|
| #[+api("span#vector") #[code Span.vector]] return an average of the
|
||||||
|
| vectors of their tokens.
|
||||||
|
|
||||||
|
p
|
||||||
|
| You can customize these
|
||||||
|
| behaviours by modifying the #[code doc.user_hooks],
|
||||||
|
| #[code doc.user_span_hooks] and #[code doc.user_token_hooks]
|
||||||
|
| dictionaries.
|
||||||
|
|
||||||
|
+code("Example").
|
||||||
# TODO
|
# TODO
|
||||||
|
|
||||||
p
|
p
|
||||||
|
@ -56,11 +50,14 @@ p
|
||||||
| can use the #[code vocab.vectors_from_bin_loc()] method, which accepts a
|
| can use the #[code vocab.vectors_from_bin_loc()] method, which accepts a
|
||||||
| path to a binary file written by #[code vocab.dump_vectors()].
|
| path to a binary file written by #[code vocab.dump_vectors()].
|
||||||
|
|
||||||
+aside-code("Example").
|
+code("Example").
|
||||||
# TODO
|
# TODO
|
||||||
|
|
||||||
p
|
p
|
||||||
| You can also load vectors from memory, by writing to the #[code lexeme.vector]
|
| You can also load vectors from memory by writing to the
|
||||||
| property. If the vectors you are writing are of different dimensionality
|
| #[+api("lexeme#vector") #[code Lexeme.vector]] property. If the vectors
|
||||||
|
| you are writing are of different dimensionality
|
||||||
| from the ones currently loaded, you should first call
|
| from the ones currently loaded, you should first call
|
||||||
| #[code vocab.resize_vectors(new_size)].
|
| #[code vocab.resize_vectors(new_size)].
|
||||||
|
|
||||||
|
+h(2, "similarity") Similarity
|
||||||
|
|
Loading…
Reference in New Issue
Block a user