From f7adc9d3b713ea473469e2371f5ce816bdc7e406 Mon Sep 17 00:00:00 2001 From: Matthew Honnibal Date: Wed, 29 Jul 2020 17:10:06 +0200 Subject: [PATCH] Start rewriting vectors docs --- website/docs/usage/vectors-embeddings.md | 156 ++++++++++------------- 1 file changed, 68 insertions(+), 88 deletions(-) diff --git a/website/docs/usage/vectors-embeddings.md b/website/docs/usage/vectors-embeddings.md index 7725068ec..8f6315901 100644 --- a/website/docs/usage/vectors-embeddings.md +++ b/website/docs/usage/vectors-embeddings.md @@ -5,54 +5,82 @@ menu: - ['Other Embeddings', 'embeddings'] --- - - ## Word vectors and similarity -> #### Training word vectors -> -> Dense, real valued vectors representing distributional similarity information -> are now a cornerstone of practical NLP. The most common way to train these -> vectors is the [Word2vec](https://en.wikipedia.org/wiki/Word2vec) family of -> algorithms. If you need to train a word2vec model, we recommend the -> implementation in the Python library -> [Gensim](https://radimrehurek.com/gensim/). +An old idea in linguistics is that you can "know a word by the company it +keeps": that is, word meanings can be understood relationally, based on their +patterns of usage. This idea inspired a branch of NLP research known as +"distributional semantics" that has aimed to compute databases of lexical knowledge +automatically. The [Word2vec](https://en.wikipedia.org/wiki/Word2vec) family of +algorithms are a key milestone in this line of research. For simplicity, we +will refer to a distributional word representation as a "word vector", and +algorithms that computes word vectors (such as GloVe, FastText, etc) as +"word2vec algorithms". -import Vectors101 from 'usage/101/\_vectors-similarity.md' +Word vector tables are included in some of the spaCy model packages we +distribute, and you can easily create your own model packages with word vectors +you train or download yourself. In some cases you can also add word vectors to +an existing pipeline, although each pipeline can only have a single word +vectors table, and a model package that already has word vectors is unlikely to +work correctly if you replace the vectors with new ones. - +## What's a word vector? -### Customizing word vectors {#custom} +For spaCy's purposes, a "word vector" is a 1-dimensional slice from +a 2-dimensional _vectors table_, with a deterministic mapping from word types +to rows in the table. -Word vectors let you import knowledge from raw text into your model. The -knowledge is represented as a table of numbers, with one row per term in your -vocabulary. If two terms are used in similar contexts, the algorithm that learns -the vectors should assign them **rows that are quite similar**, while words that -are used in different contexts will have quite different values. This lets you -use the row-values assigned to the words as a kind of dictionary, to tell you -some things about what the words in your text mean. +```python +def what_is_a_word_vector( + word_id: int, + key2row: Dict[int, int], + vectors_table: Floats2d, + *, + default_row: int=0 +) -> Floats1d: + return vectors_table[key2row.get(word_id, default_row)] +``` -Word vectors are particularly useful for terms which **aren't well represented -in your labelled training data**. For instance, if you're doing named entity -recognition, there will always be lots of names that you don't have examples of. -For instance, imagine your training data happens to contain some examples of the -term "Microsoft", but it doesn't contain any examples of the term "Symantec". In -your raw text sample, there are plenty of examples of both terms, and they're -used in similar contexts. The word vectors make that fact available to the -entity recognition model. It still won't see examples of "Symantec" labelled as -a company. However, it'll see that "Symantec" has a word vector that usually -corresponds to company terms, so it can **make the inference**. +word2vec algorithms try to produce vectors tables that let you estimate useful +relationships between words using simple linear algebra operations. For +instance, you can often find close synonyms of a word by finding the vectors +closest to it by cosine distance, and then finding the words that are mapped to +those neighboring vectors. Word vectors can also be useful as features in +statistical models. -In order to make best use of the word vectors, you want the word vectors table -to cover a **very large vocabulary**. However, most words are rare, so most of -the rows in a large word vectors table will be accessed very rarely, or never at -all. You can usually cover more than **95% of the tokens** in your corpus with -just **a few thousand rows** in the vector table. However, it's those **5% of -rare terms** where the word vectors are **most useful**. The problem is that -increasing the size of the vector table produces rapidly diminishing returns in -coverage over these rare terms. +The key difference between word vectors and contextual language models such as +ElMo, BERT and GPT-2 is that word vectors model _lexical types_, rather than +_tokens_. If you have a list of terms with no context around them, a model like +BERT can't really help you. BERT is designed to understand language in context, +which isn't what you have. A word vectors table will be a much better fit for +your task. However, if you do have words in context --- whole sentences or +paragraphs of running text --- word vectors will only provide a very rough +approximation of what the text is about. -### Converting word vectors for use in spaCy {#converting new="2.0.10"} +Word vectors are also very computationally efficient, as they map a word to a +vector with a single indexing operation. Word vectors are therefore useful as a +way to improve the accuracy of neural network models, especially models that +are small or have received little or no pretraining. In spaCy, word vector +tables are only used as static features. spaCy does not backpropagate gradients +to the pretrained word vectors table. The static vectors table is usually used +in combination with a smaller table of learned task-specific embeddings. + +## Using word vectors directly + +spaCy stores word vector information in the `vocab.vectors` attribute, so you +can access the whole vectors table from most spaCy objects. You can also access +the vector for a `Doc`, `Span`, `Token` or `Lexeme` instance via the `vector` +attribute. If your `Doc` or `Span` has multiple tokens, the average of the +word vectors will be returned, excluding any "out of vocabulary" entries that +have no vector available. If none of the words have a vector, a zeroed vector +will be returned. + +The `vector` attribute is a read-only numpy or cupy array (depending on whether +you've configured spaCy to use GPU memory), with dtype `float32`. The array is +read-only so that spaCy can avoid unnecessary copy operations where possible. +You can modify the vectors via the `Vocab` or `Vectors` table. + +### Converting word vectors for use in spaCy Custom word vectors can be trained using a number of open-source libraries, such as [Gensim](https://radimrehurek.com/gensim), [Fast Text](https://fasttext.cc), @@ -151,20 +179,7 @@ This will create a spaCy model with vectors for the first 10,000 words in the vectors model. All other words in the vectors model are mapped to the closest vector among those retained. -### Adding vectors {#custom-vectors-add new="2"} - -spaCy's new [`Vectors`](/api/vectors) class greatly improves the way word -vectors are stored, accessed and used. The data is stored in two structures: - -- An array, which can be either on CPU or [GPU](#gpu). -- A dictionary mapping string-hashes to rows in the table. - -Keep in mind that the `Vectors` class itself has no -[`StringStore`](/api/stringstore), so you have to store the hash-to-string -mapping separately. If you need to manage the strings, you should use the -`Vectors` via the [`Vocab`](/api/vocab) class, e.g. `vocab.vectors`. To add -vectors to the vocabulary, you can use the -[`Vocab.set_vector`](/api/vocab#set_vector) method. +### Adding vectors ```python ### Adding vectors @@ -196,38 +211,3 @@ For more details on **adding hooks** and **overwriting** the built-in `Doc`, ### Storing vectors on a GPU {#gpu} -If you're using a GPU, it's much more efficient to keep the word vectors on the -device. You can do that by setting the [`Vectors.data`](/api/vectors#attributes) -attribute to a `cupy.ndarray` object if you're using spaCy or -[Chainer](https://chainer.org), or a `torch.Tensor` object if you're using -[PyTorch](http://pytorch.org). The `data` object just needs to support -`__iter__` and `__getitem__`, so if you're using another library such as -[TensorFlow](https://www.tensorflow.org), you could also create a wrapper for -your vectors data. - -```python -### spaCy, Thinc or Chainer -import cupy.cuda -from spacy.vectors import Vectors - -vector_table = numpy.zeros((3, 300), dtype="f") -vectors = Vectors(["dog", "cat", "orange"], vector_table) -with cupy.cuda.Device(0): - vectors.data = cupy.asarray(vectors.data) -``` - -```python -### PyTorch -import torch -from spacy.vectors import Vectors - -vector_table = numpy.zeros((3, 300), dtype="f") -vectors = Vectors(["dog", "cat", "orange"], vector_table) -vectors.data = torch.Tensor(vectors.data).cuda(0) -``` - -## Other embeddings {#embeddings} - - - -