* Write updated load-new-word-vectors documentation

2025-11-27 05:15:43 +03:00 · 2015-09-24 19:24:23 +10:00 · 2015-09-24 19:24:23 +10:00 · 916de3c215
commit 916de3c215
parent 3b3547251c
1 changed files with 46 additions and 26 deletions
--- a/website/src/jade/tutorials/load-new-word-vectors/index.jade
+++ b/website/src/jade/tutorials/load-new-word-vectors/index.jade
@ -3,11 +3,22 @@ include ../header.jade

 +WritePost(Meta)

-    p By default spaCy loads a #[code data/vocab/vec.bin] file, where the #[em data] directory is within the #[code spacy.en] module directory.
+    p By default spaCy loads a #[code data/vocab/vec.bin] file, where the #[em data] directory is within the #[code spacy.en] module directory.  This file can be replaced, to customize the word vectors that spaCy loads. You can also replace the word vectors at run-time.

-    p You can customize the word vectors loaded by spaCy in three different ways. For the first two, you'll need to convert your vectors into spaCy's binary file format. The binary format is used because it's smaller and loads faster.

-    p You can either place the binary file in the location spaCy expects
+    h4 Replacing vec.bin
+
+    p The function #[code spacy.vocab.write_binary_vectors] creates a word vectors file in spaCy's binary data format. It expects a #[code bz2] file in the following format:
+
+    pre
+        code
+            word_key1 0.92 0.45 -0.9 0.0
+            word_key2 0.3 0.1 0.6 0.3
+            ...
+
+    p That is, each line is a single entry. Each entry consists of a key string, followed by a sequence of floats. Each entry should have the same number of floats.
+
+    p The following example script will replace the #[code vec.bin] file with vectors read from a #[code bz2] archive:

    pre
        code.language-python
@ -24,28 +35,37 @@ include ../header.jade
            |     plac.call(main)

    
+    h4 Replace the vectors at run-time, from an archive

-    ol
-        li Replace the vec.bin, so your vectors will be loaded by default. The function #[code spacy.vocab.write_binary_vectors] is provided to convert files to spaCy's binary format. The advantage of the binary format is that it's smaller and loads faster.
-
-        li Load vectors at run-time
-        
-
-Create the vec.bin file from a bz2 file using spacy.vocab.write_binary_vectors
-Either replace spaCy's vec.bin file, or call nlp.vocab.load_rep_vectors at run-time, with the path to the binary file.
-The above is a bit inconvenient at first, but the binary file format is much smaller and faster to load, and the vectors files are fairly big. Note that GloVe distributes in gzip format, not bzip.
-
-Out of interest: are you using the GloVe vectors, or something you trained on your own data? If your own data, did you use Gensim? I'd like to make this much easier, so I'd appreciate suggestions for what work-flow you'd like to see.
-    Load new vectors at run-time, optionally converting them
+    p Since v0.93, instances of #[code Vocab] allow new vectors to be loaded from #[code bz2] archive files. This allows vectors to be loaded as follows:

    pre
        code.language-python
-            | import spacy.vocab
+            | >>> from spacy.en import English
+            | >>> nlp = English()
+            | >>> n_dimensions = nlp.vocab.load_vectors('glove.840B.300d.txt.bz2')
+            | >>> n_dimensions
+            | 300

-            | def set_spacy_vectors(nlp, binary_loc, bz2_loc=None):
-            |     if bz2_loc is not None:
-            |         spacy.vocab.write_binary_vectors(bz2_loc, binary_loc)
-            |     write_binary_vectors(bz2_input_loc, binary_loc)
-            |  
-            |     nlp.vocab.load_rep_vectors(binary_loc)
+    h4 Replace vectors at run-time, per word

+    p Since v0.93, you can assign to the #[code .vector] attribute of #[code Lexeme] instances. Tokens of that lexical type will then inherit the updated vector. For instance:
+
+    pre
+        code.language-python
+            | >>> from spacy.en import English
+            | >>> nlp = English()
+            | >>> apples, oranges = nlp(u'apples oranges')
+            | <type 'spacy.tokens.token.Token'>
+            | >>> apples_lexeme = nlp.vocab[u'apples']
+            | >>> type(apples), type(apples_lexeme)
+            | (<type 'spacy.tokens.token.Token'>, <type 'spacy.lexeme.Lexeme'>)
+            | >>> sum(apples.vector)
+            | 0.56299778164247982
+            | >>> apples_lexeme.vector *= 2
+            | >>> sum(apples.vector)
+            | 1.1259955632849596
+
+    p All tokens which have the #[code orth] attribute #[em apples] will inherit the updated vector.
+
+    p Note that the updated vectors won't persist after exit, unless you persist them yourself, and then replace the #[code vec.bin] file as described above.