mirror of
https://github.com/explosion/spaCy.git
synced 2024-12-26 18:06:29 +03:00
Start rewriting vectors docs
This commit is contained in:
parent
a2d573c039
commit
f7adc9d3b7
|
@ -5,54 +5,82 @@ menu:
|
||||||
- ['Other Embeddings', 'embeddings']
|
- ['Other Embeddings', 'embeddings']
|
||||||
---
|
---
|
||||||
|
|
||||||
<!-- TODO: rewrite and include both details on word vectors, other word embeddings, spaCy transformers, doc.tensor, tok2vec -->
|
|
||||||
|
|
||||||
## Word vectors and similarity
|
## Word vectors and similarity
|
||||||
|
|
||||||
> #### Training word vectors
|
An old idea in linguistics is that you can "know a word by the company it
|
||||||
>
|
keeps": that is, word meanings can be understood relationally, based on their
|
||||||
> Dense, real valued vectors representing distributional similarity information
|
patterns of usage. This idea inspired a branch of NLP research known as
|
||||||
> are now a cornerstone of practical NLP. The most common way to train these
|
"distributional semantics" that has aimed to compute databases of lexical knowledge
|
||||||
> vectors is the [Word2vec](https://en.wikipedia.org/wiki/Word2vec) family of
|
automatically. The [Word2vec](https://en.wikipedia.org/wiki/Word2vec) family of
|
||||||
> algorithms. If you need to train a word2vec model, we recommend the
|
algorithms are a key milestone in this line of research. For simplicity, we
|
||||||
> implementation in the Python library
|
will refer to a distributional word representation as a "word vector", and
|
||||||
> [Gensim](https://radimrehurek.com/gensim/).
|
algorithms that computes word vectors (such as GloVe, FastText, etc) as
|
||||||
|
"word2vec algorithms".
|
||||||
|
|
||||||
import Vectors101 from 'usage/101/\_vectors-similarity.md'
|
Word vector tables are included in some of the spaCy model packages we
|
||||||
|
distribute, and you can easily create your own model packages with word vectors
|
||||||
|
you train or download yourself. In some cases you can also add word vectors to
|
||||||
|
an existing pipeline, although each pipeline can only have a single word
|
||||||
|
vectors table, and a model package that already has word vectors is unlikely to
|
||||||
|
work correctly if you replace the vectors with new ones.
|
||||||
|
|
||||||
<Vectors101 />
|
## What's a word vector?
|
||||||
|
|
||||||
### Customizing word vectors {#custom}
|
For spaCy's purposes, a "word vector" is a 1-dimensional slice from
|
||||||
|
a 2-dimensional _vectors table_, with a deterministic mapping from word types
|
||||||
|
to rows in the table.
|
||||||
|
|
||||||
Word vectors let you import knowledge from raw text into your model. The
|
```python
|
||||||
knowledge is represented as a table of numbers, with one row per term in your
|
def what_is_a_word_vector(
|
||||||
vocabulary. If two terms are used in similar contexts, the algorithm that learns
|
word_id: int,
|
||||||
the vectors should assign them **rows that are quite similar**, while words that
|
key2row: Dict[int, int],
|
||||||
are used in different contexts will have quite different values. This lets you
|
vectors_table: Floats2d,
|
||||||
use the row-values assigned to the words as a kind of dictionary, to tell you
|
*,
|
||||||
some things about what the words in your text mean.
|
default_row: int=0
|
||||||
|
) -> Floats1d:
|
||||||
|
return vectors_table[key2row.get(word_id, default_row)]
|
||||||
|
```
|
||||||
|
|
||||||
Word vectors are particularly useful for terms which **aren't well represented
|
word2vec algorithms try to produce vectors tables that let you estimate useful
|
||||||
in your labelled training data**. For instance, if you're doing named entity
|
relationships between words using simple linear algebra operations. For
|
||||||
recognition, there will always be lots of names that you don't have examples of.
|
instance, you can often find close synonyms of a word by finding the vectors
|
||||||
For instance, imagine your training data happens to contain some examples of the
|
closest to it by cosine distance, and then finding the words that are mapped to
|
||||||
term "Microsoft", but it doesn't contain any examples of the term "Symantec". In
|
those neighboring vectors. Word vectors can also be useful as features in
|
||||||
your raw text sample, there are plenty of examples of both terms, and they're
|
statistical models.
|
||||||
used in similar contexts. The word vectors make that fact available to the
|
|
||||||
entity recognition model. It still won't see examples of "Symantec" labelled as
|
|
||||||
a company. However, it'll see that "Symantec" has a word vector that usually
|
|
||||||
corresponds to company terms, so it can **make the inference**.
|
|
||||||
|
|
||||||
In order to make best use of the word vectors, you want the word vectors table
|
The key difference between word vectors and contextual language models such as
|
||||||
to cover a **very large vocabulary**. However, most words are rare, so most of
|
ElMo, BERT and GPT-2 is that word vectors model _lexical types_, rather than
|
||||||
the rows in a large word vectors table will be accessed very rarely, or never at
|
_tokens_. If you have a list of terms with no context around them, a model like
|
||||||
all. You can usually cover more than **95% of the tokens** in your corpus with
|
BERT can't really help you. BERT is designed to understand language in context,
|
||||||
just **a few thousand rows** in the vector table. However, it's those **5% of
|
which isn't what you have. A word vectors table will be a much better fit for
|
||||||
rare terms** where the word vectors are **most useful**. The problem is that
|
your task. However, if you do have words in context --- whole sentences or
|
||||||
increasing the size of the vector table produces rapidly diminishing returns in
|
paragraphs of running text --- word vectors will only provide a very rough
|
||||||
coverage over these rare terms.
|
approximation of what the text is about.
|
||||||
|
|
||||||
### Converting word vectors for use in spaCy {#converting new="2.0.10"}
|
Word vectors are also very computationally efficient, as they map a word to a
|
||||||
|
vector with a single indexing operation. Word vectors are therefore useful as a
|
||||||
|
way to improve the accuracy of neural network models, especially models that
|
||||||
|
are small or have received little or no pretraining. In spaCy, word vector
|
||||||
|
tables are only used as static features. spaCy does not backpropagate gradients
|
||||||
|
to the pretrained word vectors table. The static vectors table is usually used
|
||||||
|
in combination with a smaller table of learned task-specific embeddings.
|
||||||
|
|
||||||
|
## Using word vectors directly
|
||||||
|
|
||||||
|
spaCy stores word vector information in the `vocab.vectors` attribute, so you
|
||||||
|
can access the whole vectors table from most spaCy objects. You can also access
|
||||||
|
the vector for a `Doc`, `Span`, `Token` or `Lexeme` instance via the `vector`
|
||||||
|
attribute. If your `Doc` or `Span` has multiple tokens, the average of the
|
||||||
|
word vectors will be returned, excluding any "out of vocabulary" entries that
|
||||||
|
have no vector available. If none of the words have a vector, a zeroed vector
|
||||||
|
will be returned.
|
||||||
|
|
||||||
|
The `vector` attribute is a read-only numpy or cupy array (depending on whether
|
||||||
|
you've configured spaCy to use GPU memory), with dtype `float32`. The array is
|
||||||
|
read-only so that spaCy can avoid unnecessary copy operations where possible.
|
||||||
|
You can modify the vectors via the `Vocab` or `Vectors` table.
|
||||||
|
|
||||||
|
### Converting word vectors for use in spaCy
|
||||||
|
|
||||||
Custom word vectors can be trained using a number of open-source libraries, such
|
Custom word vectors can be trained using a number of open-source libraries, such
|
||||||
as [Gensim](https://radimrehurek.com/gensim), [Fast Text](https://fasttext.cc),
|
as [Gensim](https://radimrehurek.com/gensim), [Fast Text](https://fasttext.cc),
|
||||||
|
@ -151,20 +179,7 @@ This will create a spaCy model with vectors for the first 10,000 words in the
|
||||||
vectors model. All other words in the vectors model are mapped to the closest
|
vectors model. All other words in the vectors model are mapped to the closest
|
||||||
vector among those retained.
|
vector among those retained.
|
||||||
|
|
||||||
### Adding vectors {#custom-vectors-add new="2"}
|
### Adding vectors
|
||||||
|
|
||||||
spaCy's new [`Vectors`](/api/vectors) class greatly improves the way word
|
|
||||||
vectors are stored, accessed and used. The data is stored in two structures:
|
|
||||||
|
|
||||||
- An array, which can be either on CPU or [GPU](#gpu).
|
|
||||||
- A dictionary mapping string-hashes to rows in the table.
|
|
||||||
|
|
||||||
Keep in mind that the `Vectors` class itself has no
|
|
||||||
[`StringStore`](/api/stringstore), so you have to store the hash-to-string
|
|
||||||
mapping separately. If you need to manage the strings, you should use the
|
|
||||||
`Vectors` via the [`Vocab`](/api/vocab) class, e.g. `vocab.vectors`. To add
|
|
||||||
vectors to the vocabulary, you can use the
|
|
||||||
[`Vocab.set_vector`](/api/vocab#set_vector) method.
|
|
||||||
|
|
||||||
```python
|
```python
|
||||||
### Adding vectors
|
### Adding vectors
|
||||||
|
@ -196,38 +211,3 @@ For more details on **adding hooks** and **overwriting** the built-in `Doc`,
|
||||||
|
|
||||||
### Storing vectors on a GPU {#gpu}
|
### Storing vectors on a GPU {#gpu}
|
||||||
|
|
||||||
If you're using a GPU, it's much more efficient to keep the word vectors on the
|
|
||||||
device. You can do that by setting the [`Vectors.data`](/api/vectors#attributes)
|
|
||||||
attribute to a `cupy.ndarray` object if you're using spaCy or
|
|
||||||
[Chainer](https://chainer.org), or a `torch.Tensor` object if you're using
|
|
||||||
[PyTorch](http://pytorch.org). The `data` object just needs to support
|
|
||||||
`__iter__` and `__getitem__`, so if you're using another library such as
|
|
||||||
[TensorFlow](https://www.tensorflow.org), you could also create a wrapper for
|
|
||||||
your vectors data.
|
|
||||||
|
|
||||||
```python
|
|
||||||
### spaCy, Thinc or Chainer
|
|
||||||
import cupy.cuda
|
|
||||||
from spacy.vectors import Vectors
|
|
||||||
|
|
||||||
vector_table = numpy.zeros((3, 300), dtype="f")
|
|
||||||
vectors = Vectors(["dog", "cat", "orange"], vector_table)
|
|
||||||
with cupy.cuda.Device(0):
|
|
||||||
vectors.data = cupy.asarray(vectors.data)
|
|
||||||
```
|
|
||||||
|
|
||||||
```python
|
|
||||||
### PyTorch
|
|
||||||
import torch
|
|
||||||
from spacy.vectors import Vectors
|
|
||||||
|
|
||||||
vector_table = numpy.zeros((3, 300), dtype="f")
|
|
||||||
vectors = Vectors(["dog", "cat", "orange"], vector_table)
|
|
||||||
vectors.data = torch.Tensor(vectors.data).cuda(0)
|
|
||||||
```
|
|
||||||
|
|
||||||
## Other embeddings {#embeddings}
|
|
||||||
|
|
||||||
<!-- TODO: explain spacy-transformers, doc.tensor, tok2vec? -->
|
|
||||||
|
|
||||||
<!-- TODO: mention sense2vec somewhere? -->
|
|
||||||
|
|
Loading…
Reference in New Issue
Block a user