Update docs [ci skip]

This commit is contained in:
Ines Montani 2020-07-29 19:09:44 +02:00
parent 7a21775cd0
commit 9f69afdd1e
3 changed files with 69 additions and 58 deletions

View File

@ -394,14 +394,13 @@ Split a `TransformerData` object that represents a batch into a list with one
## Span getters {#span_getters tag="registered functions" source="github.com/explosion/spacy-transformers/blob/master/spacy_transformers/span_getters.py"} ## Span getters {#span_getters tag="registered functions" source="github.com/explosion/spacy-transformers/blob/master/spacy_transformers/span_getters.py"}
<!-- TODO: details on what this is for -->
Span getters are functions that take a batch of [`Doc`](/api/doc) objects and Span getters are functions that take a batch of [`Doc`](/api/doc) objects and
return a lists of [`Span`](/api/span) objects for each doc, to be processed by return a lists of [`Span`](/api/span) objects for each doc, to be processed by
the transformer. The returned spans can overlap. the transformer. The returned spans can overlap. Span getters can be referenced
in the config's `[components.transformer.model.get_spans]` block to customize
<!-- TODO: details on what this is for --> Span getters can be referenced in the the sequences processed by the transformer. You can also register custom span
config's `[components.transformer.model.get_spans]` block to customize the
sequences processed by the transformer. You can also register custom span
getters using the `@registry.span_getters` decorator. getters using the `@registry.span_getters` decorator.
> #### Example > #### Example
@ -415,10 +414,10 @@ getters using the `@registry.span_getters` decorator.
> return get_sent_spans > return get_sent_spans
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ----------- | ------------------ | ------------------------------------------------------------ | | ----------- | ------------------ | ---------------------------------------- |
| `docs` | `Iterable[Doc]` | A batch of `Doc` objects. | | `docs` | `Iterable[Doc]` | A batch of `Doc` objects. |
| **RETURNS** | `List[List[Span]]` | The spans to process by the transformer, one list per `Doc`. | | **RETURNS** | `List[List[Span]]` | The spans to process by the transformer. |
The following built-in functions are available: The following built-in functions are available:

View File

@ -5,6 +5,7 @@ menu:
- ['Installation', 'install'] - ['Installation', 'install']
- ['Runtime Usage', 'runtime'] - ['Runtime Usage', 'runtime']
- ['Training Usage', 'training'] - ['Training Usage', 'training']
next: /usage/training
--- ---
## Installation {#install hidden="true"} ## Installation {#install hidden="true"}

View File

@ -1,34 +1,35 @@
--- ---
title: Word Vectors and Embeddings title: Vectors and Embeddings
menu: menu:
- ["What's a Word Vector?", 'whats-a-vector']
- ['Word Vectors', 'vectors'] - ['Word Vectors', 'vectors']
- ['Other Embeddings', 'embeddings'] - ['Other Embeddings', 'embeddings']
next: /usage/transformers
--- ---
## Word vectors and similarity
An old idea in linguistics is that you can "know a word by the company it An old idea in linguistics is that you can "know a word by the company it
keeps": that is, word meanings can be understood relationally, based on their keeps": that is, word meanings can be understood relationally, based on their
patterns of usage. This idea inspired a branch of NLP research known as patterns of usage. This idea inspired a branch of NLP research known as
"distributional semantics" that has aimed to compute databases of lexical knowledge "distributional semantics" that has aimed to compute databases of lexical
automatically. The [Word2vec](https://en.wikipedia.org/wiki/Word2vec) family of knowledge automatically. The [Word2vec](https://en.wikipedia.org/wiki/Word2vec)
algorithms are a key milestone in this line of research. For simplicity, we family of algorithms are a key milestone in this line of research. For
will refer to a distributional word representation as a "word vector", and simplicity, we will refer to a distributional word representation as a "word
algorithms that computes word vectors (such as GloVe, FastText, etc) as vector", and algorithms that computes word vectors (such as
"word2vec algorithms". [GloVe](https://nlp.stanford.edu/projects/glove/),
[FastText](https://fasttext.cc), etc.) as "Word2vec algorithms".
Word vector tables are included in some of the spaCy model packages we Word vector tables are included in some of the spaCy [model packages](/models)
distribute, and you can easily create your own model packages with word vectors we distribute, and you can easily create your own model packages with word
you train or download yourself. In some cases you can also add word vectors to vectors you train or download yourself. In some cases you can also add word
an existing pipeline, although each pipeline can only have a single word vectors to an existing pipeline, although each pipeline can only have a single
vectors table, and a model package that already has word vectors is unlikely to word vectors table, and a model package that already has word vectors is
work correctly if you replace the vectors with new ones. unlikely to work correctly if you replace the vectors with new ones.
## What's a word vector? ## What's a word vector? {#whats-a-vector}
For spaCy's purposes, a "word vector" is a 1-dimensional slice from For spaCy's purposes, a "word vector" is a 1-dimensional slice from a
a 2-dimensional _vectors table_, with a deterministic mapping from word types 2-dimensional **vectors table**, with a deterministic mapping from word types to
to rows in the table. rows in the table.
```python ```python
def what_is_a_word_vector( def what_is_a_word_vector(
@ -41,51 +42,55 @@ def what_is_a_word_vector(
return vectors_table[key2row.get(word_id, default_row)] return vectors_table[key2row.get(word_id, default_row)]
``` ```
word2vec algorithms try to produce vectors tables that let you estimate useful Word2vec algorithms try to produce vectors tables that let you estimate useful
relationships between words using simple linear algebra operations. For relationships between words using simple linear algebra operations. For
instance, you can often find close synonyms of a word by finding the vectors instance, you can often find close synonyms of a word by finding the vectors
closest to it by cosine distance, and then finding the words that are mapped to closest to it by cosine distance, and then finding the words that are mapped to
those neighboring vectors. Word vectors can also be useful as features in those neighboring vectors. Word vectors can also be useful as features in
statistical models. statistical models.
### Word vectors vs. contextual language models {#vectors-vs-language-models}
The key difference between word vectors and contextual language models such as The key difference between word vectors and contextual language models such as
ElMo, BERT and GPT-2 is that word vectors model _lexical types_, rather than ElMo, BERT and GPT-2 is that word vectors model **lexical types**, rather than
_tokens_. If you have a list of terms with no context around them, a model like _tokens_. If you have a list of terms with no context around them, a model like
BERT can't really help you. BERT is designed to understand language in context, BERT can't really help you. BERT is designed to understand language **in
which isn't what you have. A word vectors table will be a much better fit for context**, which isn't what you have. A word vectors table will be a much better
your task. However, if you do have words in context --- whole sentences or fit for your task. However, if you do have words in context whole sentences or
paragraphs of running text --- word vectors will only provide a very rough paragraphs of running text word vectors will only provide a very rough
approximation of what the text is about. approximation of what the text is about.
Word vectors are also very computationally efficient, as they map a word to a Word vectors are also very computationally efficient, as they map a word to a
vector with a single indexing operation. Word vectors are therefore useful as a vector with a single indexing operation. Word vectors are therefore useful as a
way to improve the accuracy of neural network models, especially models that way to **improve the accuracy** of neural network models, especially models that
are small or have received little or no pretraining. In spaCy, word vector are small or have received little or no pretraining. In spaCy, word vector
tables are only used as static features. spaCy does not backpropagate gradients tables are only used as **static features**. spaCy does not backpropagate
to the pretrained word vectors table. The static vectors table is usually used gradients to the pretrained word vectors table. The static vectors table is
in combination with a smaller table of learned task-specific embeddings. usually used in combination with a smaller table of learned task-specific
embeddings.
## Using word vectors directly ## Using word vectors directly {#vectors}
spaCy stores word vector information in the `vocab.vectors` attribute, so you spaCy stores word vector information in the
can access the whole vectors table from most spaCy objects. You can also access [`Vocab.vectors`](/api/vocab#attributes) attribute, so you can access the whole
the vector for a `Doc`, `Span`, `Token` or `Lexeme` instance via the `vector` vectors table from most spaCy objects. You can also access the vector for a
attribute. If your `Doc` or `Span` has multiple tokens, the average of the [`Doc`](/api/doc), [`Span`](/api/span), [`Token`](/api/token) or
word vectors will be returned, excluding any "out of vocabulary" entries that [`Lexeme`](/api/lexeme) instance via the `vector` attribute. If your `Doc` or
have no vector available. If none of the words have a vector, a zeroed vector `Span` has multiple tokens, the average of the word vectors will be returned,
will be returned. excluding any "out of vocabulary" entries that have no vector available. If none
of the words have a vector, a zeroed vector will be returned.
The `vector` attribute is a read-only numpy or cupy array (depending on whether The `vector` attribute is a **read-only** numpy or cupy array (depending on
you've configured spaCy to use GPU memory), with dtype `float32`. The array is whether you've configured spaCy to use GPU memory), with dtype `float32`. The
read-only so that spaCy can avoid unnecessary copy operations where possible. array is read-only so that spaCy can avoid unnecessary copy operations where
You can modify the vectors via the `Vocab` or `Vectors` table. possible. You can modify the vectors via the `Vocab` or `Vectors` table.
### Converting word vectors for use in spaCy ### Converting word vectors for use in spaCy
Custom word vectors can be trained using a number of open-source libraries, such Custom word vectors can be trained using a number of open-source libraries, such
as [Gensim](https://radimrehurek.com/gensim), [Fast Text](https://fasttext.cc), as [Gensim](https://radimrehurek.com/gensim), [Fast Text](https://fasttext.cc),
or Tomas Mikolov's original or Tomas Mikolov's original
[word2vec implementation](https://code.google.com/archive/p/word2vec/). Most [Word2vec implementation](https://code.google.com/archive/p/word2vec/). Most
word vector libraries output an easy-to-read text-based format, where each line word vector libraries output an easy-to-read text-based format, where each line
consists of the word followed by its vector. For everyday use, we want to consists of the word followed by its vector. For everyday use, we want to
convert the vectors model into a binary format that loads faster and takes up convert the vectors model into a binary format that loads faster and takes up
@ -165,11 +170,10 @@ the two words.
In the example above, the vector for "Shore" was removed and remapped to the In the example above, the vector for "Shore" was removed and remapped to the
vector of "coast", which is deemed about 73% similar. "Leaving" was remapped to vector of "coast", which is deemed about 73% similar. "Leaving" was remapped to
the vector of "leaving", which is identical. the vector of "leaving", which is identical. If you're using the
[`init-model`](/api/cli#init-model) command, you can set the `--prune-vectors`
If you're using the [`init-model`](/api/cli#init-model) command, you can set the option to easily reduce the size of the vectors as you add them to a spaCy
`--prune-vectors` option to easily reduce the size of the vectors as you add model:
them to a spaCy model:
```bash ```bash
$ python -m spacy init-model /tmp/la_vectors_web_md --vectors-loc la.300d.vec.tgz --prune-vectors 10000 $ python -m spacy init-model /tmp/la_vectors_web_md --vectors-loc la.300d.vec.tgz --prune-vectors 10000
@ -179,7 +183,7 @@ This will create a spaCy model with vectors for the first 10,000 words in the
vectors model. All other words in the vectors model are mapped to the closest vectors model. All other words in the vectors model are mapped to the closest
vector among those retained. vector among those retained.
### Adding vectors ### Adding vectors {#adding-vectors}
```python ```python
### Adding vectors ### Adding vectors
@ -209,5 +213,12 @@ For more details on **adding hooks** and **overwriting** the built-in `Doc`,
</Infobox> </Infobox>
<!-- TODO:
### Storing vectors on a GPU {#gpu} ### Storing vectors on a GPU {#gpu}
-->
## Other embeddings {#embeddings}
<!-- TODO: something about other embeddings -->