💫 Update website (#3285)
<!--- Provide a general summary of your changes in the title. -->
## Description
The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.
This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.
### Types of change
enhancement
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-02-17 21:31:19 +03:00
|
|
|
---
|
2020-07-29 20:09:44 +03:00
|
|
|
title: Vectors and Embeddings
|
💫 Update website (#3285)
<!--- Provide a general summary of your changes in the title. -->
## Description
The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.
This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.
### Types of change
enhancement
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-02-17 21:31:19 +03:00
|
|
|
menu:
|
2020-07-29 20:09:44 +03:00
|
|
|
- ["What's a Word Vector?", 'whats-a-vector']
|
2020-07-05 17:11:16 +03:00
|
|
|
- ['Word Vectors', 'vectors']
|
|
|
|
- ['Other Embeddings', 'embeddings']
|
2020-07-29 20:09:44 +03:00
|
|
|
next: /usage/transformers
|
💫 Update website (#3285)
<!--- Provide a general summary of your changes in the title. -->
## Description
The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.
This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.
### Types of change
enhancement
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-02-17 21:31:19 +03:00
|
|
|
---
|
|
|
|
|
2020-07-29 18:10:06 +03:00
|
|
|
An old idea in linguistics is that you can "know a word by the company it
|
|
|
|
keeps": that is, word meanings can be understood relationally, based on their
|
|
|
|
patterns of usage. This idea inspired a branch of NLP research known as
|
2020-07-29 20:09:44 +03:00
|
|
|
"distributional semantics" that has aimed to compute databases of lexical
|
|
|
|
knowledge automatically. The [Word2vec](https://en.wikipedia.org/wiki/Word2vec)
|
|
|
|
family of algorithms are a key milestone in this line of research. For
|
|
|
|
simplicity, we will refer to a distributional word representation as a "word
|
|
|
|
vector", and algorithms that computes word vectors (such as
|
|
|
|
[GloVe](https://nlp.stanford.edu/projects/glove/),
|
|
|
|
[FastText](https://fasttext.cc), etc.) as "Word2vec algorithms".
|
|
|
|
|
|
|
|
Word vector tables are included in some of the spaCy [model packages](/models)
|
|
|
|
we distribute, and you can easily create your own model packages with word
|
|
|
|
vectors you train or download yourself. In some cases you can also add word
|
|
|
|
vectors to an existing pipeline, although each pipeline can only have a single
|
|
|
|
word vectors table, and a model package that already has word vectors is
|
|
|
|
unlikely to work correctly if you replace the vectors with new ones.
|
|
|
|
|
|
|
|
## What's a word vector? {#whats-a-vector}
|
|
|
|
|
|
|
|
For spaCy's purposes, a "word vector" is a 1-dimensional slice from a
|
|
|
|
2-dimensional **vectors table**, with a deterministic mapping from word types to
|
|
|
|
rows in the table.
|
2020-07-29 18:10:06 +03:00
|
|
|
|
|
|
|
```python
|
|
|
|
def what_is_a_word_vector(
|
|
|
|
word_id: int,
|
|
|
|
key2row: Dict[int, int],
|
|
|
|
vectors_table: Floats2d,
|
|
|
|
*,
|
|
|
|
default_row: int=0
|
|
|
|
) -> Floats1d:
|
|
|
|
return vectors_table[key2row.get(word_id, default_row)]
|
|
|
|
```
|
|
|
|
|
2020-07-29 20:09:44 +03:00
|
|
|
Word2vec algorithms try to produce vectors tables that let you estimate useful
|
2020-07-29 18:10:06 +03:00
|
|
|
relationships between words using simple linear algebra operations. For
|
|
|
|
instance, you can often find close synonyms of a word by finding the vectors
|
|
|
|
closest to it by cosine distance, and then finding the words that are mapped to
|
|
|
|
those neighboring vectors. Word vectors can also be useful as features in
|
|
|
|
statistical models.
|
|
|
|
|
2020-07-29 20:09:44 +03:00
|
|
|
### Word vectors vs. contextual language models {#vectors-vs-language-models}
|
|
|
|
|
2020-07-29 18:10:06 +03:00
|
|
|
The key difference between word vectors and contextual language models such as
|
2020-07-29 20:09:44 +03:00
|
|
|
ElMo, BERT and GPT-2 is that word vectors model **lexical types**, rather than
|
2020-07-29 18:10:06 +03:00
|
|
|
_tokens_. If you have a list of terms with no context around them, a model like
|
2020-07-29 20:09:44 +03:00
|
|
|
BERT can't really help you. BERT is designed to understand language **in
|
|
|
|
context**, which isn't what you have. A word vectors table will be a much better
|
|
|
|
fit for your task. However, if you do have words in context — whole sentences or
|
|
|
|
paragraphs of running text — word vectors will only provide a very rough
|
2020-07-29 18:10:06 +03:00
|
|
|
approximation of what the text is about.
|
|
|
|
|
|
|
|
Word vectors are also very computationally efficient, as they map a word to a
|
|
|
|
vector with a single indexing operation. Word vectors are therefore useful as a
|
2020-07-29 20:09:44 +03:00
|
|
|
way to **improve the accuracy** of neural network models, especially models that
|
2020-07-29 18:10:06 +03:00
|
|
|
are small or have received little or no pretraining. In spaCy, word vector
|
2020-07-29 20:09:44 +03:00
|
|
|
tables are only used as **static features**. spaCy does not backpropagate
|
|
|
|
gradients to the pretrained word vectors table. The static vectors table is
|
|
|
|
usually used in combination with a smaller table of learned task-specific
|
|
|
|
embeddings.
|
|
|
|
|
|
|
|
## Using word vectors directly {#vectors}
|
|
|
|
|
|
|
|
spaCy stores word vector information in the
|
|
|
|
[`Vocab.vectors`](/api/vocab#attributes) attribute, so you can access the whole
|
|
|
|
vectors table from most spaCy objects. You can also access the vector for a
|
|
|
|
[`Doc`](/api/doc), [`Span`](/api/span), [`Token`](/api/token) or
|
|
|
|
[`Lexeme`](/api/lexeme) instance via the `vector` attribute. If your `Doc` or
|
|
|
|
`Span` has multiple tokens, the average of the word vectors will be returned,
|
|
|
|
excluding any "out of vocabulary" entries that have no vector available. If none
|
|
|
|
of the words have a vector, a zeroed vector will be returned.
|
|
|
|
|
|
|
|
The `vector` attribute is a **read-only** numpy or cupy array (depending on
|
|
|
|
whether you've configured spaCy to use GPU memory), with dtype `float32`. The
|
|
|
|
array is read-only so that spaCy can avoid unnecessary copy operations where
|
|
|
|
possible. You can modify the vectors via the `Vocab` or `Vectors` table.
|
2020-07-29 18:10:06 +03:00
|
|
|
|
|
|
|
### Converting word vectors for use in spaCy
|
💫 Update website (#3285)
<!--- Provide a general summary of your changes in the title. -->
## Description
The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.
This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.
### Types of change
enhancement
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-02-17 21:31:19 +03:00
|
|
|
|
|
|
|
Custom word vectors can be trained using a number of open-source libraries, such
|
|
|
|
as [Gensim](https://radimrehurek.com/gensim), [Fast Text](https://fasttext.cc),
|
|
|
|
or Tomas Mikolov's original
|
2020-07-29 20:09:44 +03:00
|
|
|
[Word2vec implementation](https://code.google.com/archive/p/word2vec/). Most
|
💫 Update website (#3285)
<!--- Provide a general summary of your changes in the title. -->
## Description
The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.
This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.
### Types of change
enhancement
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-02-17 21:31:19 +03:00
|
|
|
word vector libraries output an easy-to-read text-based format, where each line
|
|
|
|
consists of the word followed by its vector. For everyday use, we want to
|
|
|
|
convert the vectors model into a binary format that loads faster and takes up
|
|
|
|
less space on disk. The easiest way to do this is the
|
|
|
|
[`init-model`](/api/cli#init-model) command-line utility:
|
|
|
|
|
|
|
|
```bash
|
|
|
|
wget https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.la.300.vec.gz
|
|
|
|
python -m spacy init-model en /tmp/la_vectors_wiki_lg --vectors-loc cc.la.300.vec.gz
|
|
|
|
```
|
|
|
|
|
|
|
|
This will output a spaCy model in the directory `/tmp/la_vectors_wiki_lg`,
|
|
|
|
giving you access to some nice Latin vectors 😉 You can then pass the directory
|
|
|
|
path to [`spacy.load()`](/api/top-level#spacy.load).
|
|
|
|
|
|
|
|
```python
|
|
|
|
nlp_latin = spacy.load("/tmp/la_vectors_wiki_lg")
|
2019-09-12 17:11:15 +03:00
|
|
|
doc1 = nlp_latin("Caecilius est in horto")
|
|
|
|
doc2 = nlp_latin("servus est in atrio")
|
💫 Update website (#3285)
<!--- Provide a general summary of your changes in the title. -->
## Description
The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.
This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.
### Types of change
enhancement
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-02-17 21:31:19 +03:00
|
|
|
doc1.similarity(doc2)
|
|
|
|
```
|
|
|
|
|
|
|
|
The model directory will have a `/vocab` directory with the strings, lexical
|
|
|
|
entries and word vectors from the input vectors model. The
|
|
|
|
[`init-model`](/api/cli#init-model) command supports a number of archive formats
|
|
|
|
for the word vectors: the vectors can be in plain text (`.txt`), zipped
|
|
|
|
(`.zip`), or tarred and zipped (`.tgz`).
|
|
|
|
|
|
|
|
### Optimizing vector coverage {#custom-vectors-coverage new="2"}
|
|
|
|
|
|
|
|
To help you strike a good balance between coverage and memory usage, spaCy's
|
|
|
|
[`Vectors`](/api/vectors) class lets you map **multiple keys** to the **same
|
|
|
|
row** of the table. If you're using the
|
|
|
|
[`spacy init-model`](/api/cli#init-model) command to create a vocabulary,
|
|
|
|
pruning the vectors will be taken care of automatically if you set the
|
|
|
|
`--prune-vectors` flag. You can also do it manually in the following steps:
|
|
|
|
|
|
|
|
1. Start with a **word vectors model** that covers a huge vocabulary. For
|
2019-12-21 16:10:22 +03:00
|
|
|
instance, the [`en_vectors_web_lg`](/models/en-starters#en_vectors_web_lg)
|
|
|
|
model provides 300-dimensional GloVe vectors for over 1 million terms of
|
|
|
|
English.
|
💫 Update website (#3285)
<!--- Provide a general summary of your changes in the title. -->
## Description
The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.
This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.
### Types of change
enhancement
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-02-17 21:31:19 +03:00
|
|
|
2. If your vocabulary has values set for the `Lexeme.prob` attribute, the
|
|
|
|
lexemes will be sorted by descending probability to determine which vectors
|
|
|
|
to prune. Otherwise, lexemes will be sorted by their order in the `Vocab`.
|
|
|
|
3. Call [`Vocab.prune_vectors`](/api/vocab#prune_vectors) with the number of
|
|
|
|
vectors you want to keep.
|
|
|
|
|
|
|
|
```python
|
|
|
|
nlp = spacy.load('en_vectors_web_lg')
|
|
|
|
n_vectors = 105000 # number of vectors to keep
|
|
|
|
removed_words = nlp.vocab.prune_vectors(n_vectors)
|
|
|
|
|
|
|
|
assert len(nlp.vocab.vectors) <= n_vectors # unique vectors have been pruned
|
|
|
|
assert nlp.vocab.vectors.n_keys > n_vectors # but not the total entries
|
|
|
|
```
|
|
|
|
|
|
|
|
[`Vocab.prune_vectors`](/api/vocab#prune_vectors) reduces the current vector
|
|
|
|
table to a given number of unique entries, and returns a dictionary containing
|
|
|
|
the removed words, mapped to `(string, score)` tuples, where `string` is the
|
|
|
|
entry the removed word was mapped to, and `score` the similarity score between
|
|
|
|
the two words.
|
|
|
|
|
|
|
|
```python
|
|
|
|
### Removed words
|
|
|
|
{
|
|
|
|
"Shore": ("coast", 0.732257),
|
|
|
|
"Precautionary": ("caution", 0.490973),
|
|
|
|
"hopelessness": ("sadness", 0.742366),
|
|
|
|
"Continous": ("continuous", 0.732549),
|
|
|
|
"Disemboweled": ("corpse", 0.499432),
|
|
|
|
"biostatistician": ("scientist", 0.339724),
|
|
|
|
"somewheres": ("somewheres", 0.402736),
|
|
|
|
"observing": ("observe", 0.823096),
|
|
|
|
"Leaving": ("leaving", 1.0),
|
|
|
|
}
|
|
|
|
```
|
|
|
|
|
|
|
|
In the example above, the vector for "Shore" was removed and remapped to the
|
|
|
|
vector of "coast", which is deemed about 73% similar. "Leaving" was remapped to
|
2020-07-29 20:09:44 +03:00
|
|
|
the vector of "leaving", which is identical. If you're using the
|
|
|
|
[`init-model`](/api/cli#init-model) command, you can set the `--prune-vectors`
|
|
|
|
option to easily reduce the size of the vectors as you add them to a spaCy
|
|
|
|
model:
|
💫 Update website (#3285)
<!--- Provide a general summary of your changes in the title. -->
## Description
The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.
This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.
### Types of change
enhancement
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-02-17 21:31:19 +03:00
|
|
|
|
|
|
|
```bash
|
|
|
|
$ python -m spacy init-model /tmp/la_vectors_web_md --vectors-loc la.300d.vec.tgz --prune-vectors 10000
|
|
|
|
```
|
|
|
|
|
|
|
|
This will create a spaCy model with vectors for the first 10,000 words in the
|
|
|
|
vectors model. All other words in the vectors model are mapped to the closest
|
|
|
|
vector among those retained.
|
|
|
|
|
2020-07-29 20:09:44 +03:00
|
|
|
### Adding vectors {#adding-vectors}
|
💫 Update website (#3285)
<!--- Provide a general summary of your changes in the title. -->
## Description
The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.
This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.
### Types of change
enhancement
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-02-17 21:31:19 +03:00
|
|
|
|
|
|
|
```python
|
|
|
|
### Adding vectors
|
|
|
|
from spacy.vocab import Vocab
|
|
|
|
|
2019-09-12 17:11:15 +03:00
|
|
|
vector_data = {"dog": numpy.random.uniform(-1, 1, (300,)),
|
|
|
|
"cat": numpy.random.uniform(-1, 1, (300,)),
|
|
|
|
"orange": numpy.random.uniform(-1, 1, (300,))}
|
💫 Update website (#3285)
<!--- Provide a general summary of your changes in the title. -->
## Description
The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.
This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.
### Types of change
enhancement
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-02-17 21:31:19 +03:00
|
|
|
vocab = Vocab()
|
|
|
|
for word, vector in vector_data.items():
|
|
|
|
vocab.set_vector(word, vector)
|
|
|
|
```
|
|
|
|
|
|
|
|
### Using custom similarity methods {#custom-similarity}
|
|
|
|
|
|
|
|
By default, [`Token.vector`](/api/token#vector) returns the vector for its
|
|
|
|
underlying [`Lexeme`](/api/lexeme), while [`Doc.vector`](/api/doc#vector) and
|
|
|
|
[`Span.vector`](/api/span#vector) return an average of the vectors of their
|
|
|
|
tokens. You can customize these behaviors by modifying the `doc.user_hooks`,
|
|
|
|
`doc.user_span_hooks` and `doc.user_token_hooks` dictionaries.
|
|
|
|
|
2020-07-06 23:22:37 +03:00
|
|
|
<Infobox title="Custom user hooks" emoji="📖">
|
💫 Update website (#3285)
<!--- Provide a general summary of your changes in the title. -->
## Description
The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.
This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.
### Types of change
enhancement
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-02-17 21:31:19 +03:00
|
|
|
|
|
|
|
For more details on **adding hooks** and **overwriting** the built-in `Doc`,
|
|
|
|
`Span` and `Token` methods, see the usage guide on
|
2019-12-06 21:17:12 +03:00
|
|
|
[user hooks](/usage/processing-pipelines#custom-components-user-hooks).
|
💫 Update website (#3285)
<!--- Provide a general summary of your changes in the title. -->
## Description
The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.
This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.
### Types of change
enhancement
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-02-17 21:31:19 +03:00
|
|
|
|
|
|
|
</Infobox>
|
|
|
|
|
2020-07-29 20:09:44 +03:00
|
|
|
<!-- TODO:
|
|
|
|
|
2020-07-05 17:11:16 +03:00
|
|
|
### Storing vectors on a GPU {#gpu}
|
💫 Update website (#3285)
<!--- Provide a general summary of your changes in the title. -->
## Description
The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.
This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.
### Types of change
enhancement
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-02-17 21:31:19 +03:00
|
|
|
|
2020-07-29 20:09:44 +03:00
|
|
|
-->
|
|
|
|
|
|
|
|
## Other embeddings {#embeddings}
|
|
|
|
|
|
|
|
<!-- TODO: something about other embeddings -->
|