mirror of
				https://github.com/explosion/spaCy.git
				synced 2025-10-25 13:11:03 +03:00 
			
		
		
		
	* Integrate Python kernel via Binder * Add live model test for languages with examples * Update docs and code examples * Adjust margin (if not bootstrapped) * Add binder version to global config * Update terminal and executable code mixins * Pass attributes through infobox and section * Hide v-cloak * Fix example * Take out model comparison for now * Add meta text for compat * Remove chart.js dependency * Tidy up and simplify JS and port big components over to Vue * Remove chartjs example * Add Twitter icon * Add purple stylesheet option * Add utility for hand cursor (special cases only) * Add transition classes * Add small option for section * Add thumb object for small round thumbnail images * Allow unset code block language via "none" value (workaround to still allow unset language to default to DEFAULT_SYNTAX) * Pass through attributes * Add syntax highlighting definitions for Julia, R and Docker * Add website icon * Remove user survey from navigation * Don't hide GitHub icon on small screens * Make top navigation scrollable on small screens * Remove old resources page and references to it * Add Universe * Add helper functions for better page URL and title * Update site description * Increment versions * Update preview images * Update mentions of resources * Fix image * Fix social images * Fix problem with cover sizing and floats * Add divider and move badges into heading * Add docstrings * Reference converting section * Add section on converting word vectors * Move converting section to custom section and fix formatting * Remove old fastText example * Move extensions content to own section Keep weird ID to not break permalinks for now (we don't want to rewrite URLs if not absolutely necessary) * Use better component example and add factories section * Add note on larger model * Use better example for non-vector * Remove similarity in context section Only works via small models with tensors so has always been kind of confusing * Add note on init-model command * Fix lightning tour examples and make excutable if possible * Add spacy train CLI section to train * Fix formatting and add video * Fix formatting * Fix textcat example description (resolves #2246) * Add dummy file to try resolve conflict * Delete dummy file * Tidy up [ci skip] * Ensure sufficient height of loading container * Add loading animation to universe * Update Thebelab build and use better startup message * Fix asset versioning * Fix typo [ci skip] * Add note on project idea label
		
			
				
	
	
		
			242 lines
		
	
	
		
			10 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
			
		
		
	
	
			242 lines
		
	
	
		
			10 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
| //- 💫 DOCS > USAGE > VECTORS & SIMILARITY > CUSTOM VECTORS
 | |
| 
 | |
| p
 | |
|     |  Word vectors let you import knowledge from raw text into your model. The
 | |
|     |  knowledge is represented as a table of numbers, with one row per term in
 | |
|     |  your vocabulary. If two terms are used in similar contexts, the algorithm
 | |
|     |  that learns the vectors should assign them
 | |
|     |  #[strong rows that are quite similar], while words that are used in
 | |
|     |  different contexts will have quite different values. This lets you use
 | |
|     |  the row-values assigned to the words as a kind of dictionary, to tell you
 | |
|     |  some things about what the words in your text mean.
 | |
| 
 | |
| p
 | |
|     |  Word vectors are particularly useful for terms which
 | |
|     |  #[strong aren't well represented in your labelled training data].
 | |
|     |  For instance, if you're doing named entity recognition, there will always
 | |
|     |  be lots of names that you don't have examples of. For instance, imagine
 | |
|     |  your training data happens to contain some examples of the term
 | |
|     |  "Microsoft", but it doesn't contain any examples of the term "Symantec".
 | |
|     |  In your raw text sample, there are plenty of examples of both terms, and
 | |
|     |  they're used in similar contexts. The word vectors make that fact
 | |
|     |  available to the entity recognition model. It still won't see examples of
 | |
|     |  "Symantec" labelled as a company. However, it'll see that "Symantec" has
 | |
|     |  a word vector that usually corresponds to company terms, so it can
 | |
|     |  #[strong make the inference].
 | |
| 
 | |
| p
 | |
|     |  In order to make best use of the word vectors, you want the word vectors
 | |
|     |  table to cover a #[strong very large vocabulary]. However, most words are
 | |
|     |  rare, so most of the rows in a large word vectors table will be accessed
 | |
|     |  very rarely, or never at all. You can usually cover more than
 | |
|     |  #[strong 95% of the tokens] in your corpus with just
 | |
|     |  #[strong a few thousand rows] in the vector table. However, it's those
 | |
|     |  #[strong 5% of rare terms] where the word vectors are
 | |
|     |  #[strong most useful]. The problem is that increasing the size of the
 | |
|     |  vector table produces rapidly diminishing returns in coverage over these
 | |
|     |  rare terms.
 | |
| 
 | |
| +h(3, "converting") Converting word vectors for use in spaCy
 | |
|     +tag-new("2.0.10")
 | |
| 
 | |
| p
 | |
|     |  Custom word vectors can be trained using a number of open-source libraries,
 | |
|     |  such as #[+a("https://radimrehurek.com/gensim") Gensim],
 | |
|     |  #[+a("https://fasttext.cc") Fast Text], or Tomas Mikolov's original
 | |
|     |  #[+a("https://code.google.com/archive/p/word2vec/") word2vec implementation].
 | |
|     |  Most word vector libraries output an easy-to-read text-based format, where
 | |
|     |  each line consists of the word followed by its vector. For everyday use,
 | |
|     |  we want to convert the vectors model into a binary format that loads faster
 | |
|     |  and takes up less space on disk. The easiest way to do this is the
 | |
|     |  #[+api("cli#init-model") #[code init-model]] command-line utility:
 | |
| 
 | |
| +code(false, "bash").
 | |
|     wget https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.la.300.vec.gz
 | |
|     python -m spacy init-model /tmp/la_vectors_wiki_lg --vectors-loc cc.la.300.vec.gz
 | |
| 
 | |
| p
 | |
|     |  This will output a spaCy model in the directory
 | |
|     |  #[code /tmp/la_vectors_wiki_lg], giving you access to some nice Latin
 | |
|     |  vectors 😉 You can then pass the directory path to
 | |
|     |  #[+api("spacy#load") #[code spacy.load()]].
 | |
| 
 | |
| +code.
 | |
|     nlp_latin = spacy.load('/tmp/la_vectors_wiki_lg')
 | |
|     doc1 = nlp(u"Caecilius est in horto")
 | |
|     doc2 = nlp(u"servus est in atrio")
 | |
|     doc1.similarity(doc2)
 | |
| 
 | |
| p
 | |
|     |  The model directory will have a #[code /vocab] directory with the strings,
 | |
|     |  lexical entries and word vectors from the input vectors model. The
 | |
|     |  #[+api("cli#init-model") #[code init-model]] command supports a number of
 | |
|     |  archive formats for the word vectors: the vectors can be in plain text
 | |
|     |  (#[code .txt]), zipped (#[code .zip]), or tarred and zipped
 | |
|     |  (#[code .tgz]).
 | |
| 
 | |
| +h(3, "custom-vectors-coverage") Optimising vector coverage
 | |
|     +tag-new(2)
 | |
| 
 | |
| p
 | |
|     |  To help you strike a good balance between coverage and memory usage,
 | |
|     |  spaCy's #[+api("vectors") #[code Vectors]] class lets you map
 | |
|     |  #[strong multiple keys] to the #[strong same row] of the table. If
 | |
|     |  you're using the #[+api("cli#vocab") #[code spacy vocab]] command to
 | |
|     |  create a vocabulary, pruning the vectors will be taken care of
 | |
|     |  automatically. You can also do it manually in the following steps:
 | |
| 
 | |
| +list("numbers")
 | |
|     +item
 | |
|         |  Start with a #[strong word vectors model] that covers a huge
 | |
|         |  vocabulary. For instance, the
 | |
|         |  #[+a("/models/en#en_vectors_web_lg") #[code en_vectors_web_lg]] model
 | |
|         |  provides 300-dimensional GloVe vectors for over 1 million terms of
 | |
|         |  English.
 | |
| 
 | |
|     +item
 | |
|         |  If your vocabulary has values set for the #[code Lexeme.prob]
 | |
|         |  attribute, the lexemes will be sorted by descending probability to
 | |
|         |  determine which vectors to prune. Otherwise, lexemes will be sorted
 | |
|         |  by their order in the #[code Vocab].
 | |
| 
 | |
|     +item
 | |
|         |  Call #[+api("vocab#prune_vectors") #[code Vocab.prune_vectors]] with
 | |
|         |  the number of vectors you want to keep.
 | |
| 
 | |
| +code.
 | |
|     nlp = spacy.load('en_vectors_web_lg')
 | |
|     n_vectors = 105000  # number of vectors to keep
 | |
|     removed_words = nlp.vocab.prune_vectors(n_vectors)
 | |
| 
 | |
|     assert len(nlp.vocab.vectors) <= n_vectors  # unique vectors have been pruned
 | |
|     assert nlp.vocab.vectors.n_keys > n_vectors  # but not the total entries
 | |
| 
 | |
| p
 | |
|     |  #[+api("vocab#prune_vectors") #[code Vocab.prune_vectors]] reduces the
 | |
|     |  current vector table to a given number of unique entries, and returns a
 | |
|     |  dictionary containing the removed words, mapped to #[code (string, score)]
 | |
|     |  tuples, where #[code string] is the entry the removed word was mapped
 | |
|     |  to, and #[code score] the similarity score between the two words.
 | |
| 
 | |
| +code("Removed words").
 | |
|     {
 | |
|         'Shore': ('coast', 0.732257),
 | |
|         'Precautionary': ('caution', 0.490973),
 | |
|         'hopelessness': ('sadness', 0.742366),
 | |
|         'Continous': ('continuous', 0.732549),
 | |
|         'Disemboweled': ('corpse', 0.499432),
 | |
|         'biostatistician': ('scientist', 0.339724),
 | |
|         'somewheres': ('somewheres', 0.402736),
 | |
|         'observing': ('observe', 0.823096),
 | |
|         'Leaving': ('leaving', 1.0)
 | |
|     }
 | |
| 
 | |
| p
 | |
|     |  In the example above, the vector for "Shore" was removed and remapped
 | |
|     |  to the vector of "coast", which is deemed about 73% similar. "Leaving"
 | |
|     |  was remapped to the vector of "leaving", which is identical.
 | |
| 
 | |
| p
 | |
|     |  If you're using the #[+api("cli#init-model") #[code init-model]] command,
 | |
|     |  you can set the #[code --prune-vectors] option to easily reduce the size
 | |
|     |  of the vectors as you add them to a spaCy model:
 | |
| 
 | |
| +code(false, "bash", "$").
 | |
|     python -m spacy init-model /tmp/la_vectors_web_md --vectors-loc la.300d.vec.tgz --prune-vectors 10000
 | |
| 
 | |
| p
 | |
|     |  This will create a spaCy model with vectors for the first 10,000 words in
 | |
|     |  the vectors model. All other words in the vectors model are mapped to the
 | |
|     |  closest vector among those retained.
 | |
| 
 | |
| +h(3, "custom-vectors-add") Adding vectors
 | |
|     +tag-new(2)
 | |
| 
 | |
| p
 | |
|     |  spaCy's new #[+api("vectors") #[code Vectors]] class greatly improves the
 | |
|     |  way word vectors are stored, accessed and used. The data is stored in
 | |
|     |  two structures:
 | |
| 
 | |
| +list
 | |
|     +item
 | |
|         |  An array, which can be either on CPU or #[+a("#gpu") GPU].
 | |
| 
 | |
|     +item
 | |
|         |  A dictionary mapping string-hashes to rows in the table.
 | |
| 
 | |
| p
 | |
|     |  Keep in mind that the #[code Vectors] class itself has no
 | |
|     |  #[+api("stringstore") #[code StringStore]], so you have to store the
 | |
|     |  hash-to-string mapping separately. If you need to manage the strings,
 | |
|     |  you should use the #[code Vectors] via the
 | |
|     |  #[+api("vocab") #[code Vocab]] class, e.g. #[code vocab.vectors]. To
 | |
|     |  add vectors to the vocabulary, you can use the
 | |
|     |  #[+api("vocab#set_vector") #[code Vocab.set_vector]] method.
 | |
| 
 | |
| +code("Adding vectors").
 | |
|     from spacy.vocab import Vocab
 | |
| 
 | |
|     vector_data = {u'dog': numpy.random.uniform(-1, 1, (300,)),
 | |
|                    u'cat': numpy.random.uniform(-1, 1, (300,)),
 | |
|                    u'orange': numpy.random.uniform(-1, 1, (300,))}
 | |
| 
 | |
|     vocab = Vocab()
 | |
|     for word, vector in vector_data.items():
 | |
|         vocab.set_vector(word, vector)
 | |
| 
 | |
| +h(3, "custom-loading-glove") Loading GloVe vectors
 | |
|     +tag-new(2)
 | |
| 
 | |
| p
 | |
|     |  spaCy comes with built-in support for loading
 | |
|     |  #[+a("https://nlp.stanford.edu/projects/glove/") GloVe] vectors from
 | |
|     |  a directory. The #[+api("vectors#from_glove") #[code Vectors.from_glove]]
 | |
|     |  method assumes a binary format, the vocab provided in a
 | |
|     |  #[code vocab.txt], and the naming scheme of
 | |
|     |  #[code vectors.{size}.[fd].bin]. For example:
 | |
| 
 | |
| +aside-code("Directory structure", "yaml").
 | |
|     └── vectors
 | |
|         ├── vectors.128.f.bin  # vectors file
 | |
|         └── vocab.txt          # vocabulary
 | |
| 
 | |
| +table(["File name", "Dimensions", "Data type"])
 | |
|     +row
 | |
|         +cell #[code vectors.128.f.bin]
 | |
|         +cell 128
 | |
|         +cell float32
 | |
| 
 | |
|     +row
 | |
|         +cell #[code vectors.300.d.bin]
 | |
|         +cell 300
 | |
|         +cell float64 (double)
 | |
| 
 | |
| +code.
 | |
|     nlp = spacy.load('en')
 | |
|     nlp.vocab.vectors.from_glove('/path/to/vectors')
 | |
| 
 | |
| p
 | |
|     |  If your instance of #[code Language] already contains vectors, they will
 | |
|     |  be overwritten. To create your own GloVe vectors model package like
 | |
|     |  spaCy's #[+a("/models/en#en_vectors_web_lg") #[code en_vectors_web_lg]],
 | |
|     |  you can call #[+api("language#to_disk") #[code nlp.to_disk]], and then
 | |
|     |  package the model using the #[+api("cli#package") #[code package]]
 | |
|     |  command.
 | |
| 
 | |
| +h(3, "custom-similarity") Using custom similarity methods
 | |
| 
 | |
| p
 | |
|     |  By default, #[+api("token#vector") #[code Token.vector]] returns the
 | |
|     |  vector for its underlying #[+api("lexeme") #[code Lexeme]], while
 | |
|     |  #[+api("doc#vector") #[code Doc.vector]] and
 | |
|     |  #[+api("span#vector") #[code Span.vector]] return an average of the
 | |
|     |  vectors of their tokens. You can customise these
 | |
|     |  behaviours by modifying the #[code doc.user_hooks],
 | |
|     |  #[code doc.user_span_hooks] and #[code doc.user_token_hooks]
 | |
|     |  dictionaries.
 | |
| 
 | |
| +infobox
 | |
|     |  For more details on #[strong adding hooks] and #[strong overwriting] the
 | |
|     |  built-in #[code Doc], #[code Span] and #[code Token] methods, see the
 | |
|     |  usage guide on #[+a("/usage/processing-pipelines#user-hooks") user hooks].
 |