mirror of
https://github.com/explosion/spaCy.git
synced 2024-12-30 20:06:30 +03:00
e597110d31
<!--- Provide a general summary of your changes in the title. --> ## Description The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on. This PR also includes various new docs pages and content. Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837. ### Types of change enhancement ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
103 lines
4.3 KiB
Markdown
103 lines
4.3 KiB
Markdown
import Infobox from 'components/infobox'
|
||
|
||
Similarity is determined by comparing **word vectors** or "word embeddings",
|
||
multi-dimensional meaning representations of a word. Word vectors can be
|
||
generated using an algorithm like
|
||
[word2vec](https://en.wikipedia.org/wiki/Word2vec) and usually look like this:
|
||
|
||
```python
|
||
### banana.vector
|
||
array([2.02280000e-01, -7.66180009e-02, 3.70319992e-01,
|
||
3.28450017e-02, -4.19569999e-01, 7.20689967e-02,
|
||
-3.74760002e-01, 5.74599989e-02, -1.24009997e-02,
|
||
5.29489994e-01, -5.23800015e-01, -1.97710007e-01,
|
||
-3.41470003e-01, 5.33169985e-01, -2.53309999e-02,
|
||
1.73800007e-01, 1.67720005e-01, 8.39839995e-01,
|
||
5.51070012e-02, 1.05470002e-01, 3.78719985e-01,
|
||
2.42750004e-01, 1.47449998e-02, 5.59509993e-01,
|
||
1.25210002e-01, -6.75960004e-01, 3.58420014e-01,
|
||
# ... and so on ...
|
||
3.66849989e-01, 2.52470002e-03, -6.40089989e-01,
|
||
-2.97650009e-01, 7.89430022e-01, 3.31680000e-01,
|
||
-1.19659996e+00, -4.71559986e-02, 5.31750023e-01], dtype=float32)
|
||
```
|
||
|
||
<Infobox title="Important note" variant="warning">
|
||
|
||
To make them compact and fast, spaCy's small [models](/models) (all packages
|
||
that end in `sm`) **don't ship with word vectors**, and only include
|
||
context-sensitive **tensors**. This means you can still use the `similarity()`
|
||
methods to compare documents, spans and tokens – but the result won't be as
|
||
good, and individual tokens won't have any vectors assigned. So in order to use
|
||
_real_ word vectors, you need to download a larger model:
|
||
|
||
```diff
|
||
- python -m spacy download en_core_web_sm
|
||
+ python -m spacy download en_core_web_lg
|
||
```
|
||
|
||
</Infobox>
|
||
|
||
Models that come with built-in word vectors make them available as the
|
||
[`Token.vector`](/api/token#vector) attribute. [`Doc.vector`](/api/doc#vector)
|
||
and [`Span.vector`](/api/span#vector) will default to an average of their token
|
||
vectors. You can also check if a token has a vector assigned, and get the L2
|
||
norm, which can be used to normalize vectors.
|
||
|
||
```python
|
||
### {executable="true"}
|
||
import spacy
|
||
|
||
nlp = spacy.load('en_core_web_md')
|
||
tokens = nlp(u'dog cat banana afskfsd')
|
||
|
||
for token in tokens:
|
||
print(token.text, token.has_vector, token.vector_norm, token.is_oov)
|
||
```
|
||
|
||
> - **Text**: The original token text.
|
||
> - **has vector**: Does the token have a vector representation?
|
||
> - **Vector norm**: The L2 norm of the token's vector (the square root of the
|
||
> sum of the values squared)
|
||
> - **OOV**: Out-of-vocabulary
|
||
|
||
The words "dog", "cat" and "banana" are all pretty common in English, so they're
|
||
part of the model's vocabulary, and come with a vector. The word "afskfsd" on
|
||
the other hand is a lot less common and out-of-vocabulary – so its vector
|
||
representation consists of 300 dimensions of `0`, which means it's practically
|
||
nonexistent. If your application will benefit from a **large vocabulary** with
|
||
more vectors, you should consider using one of the larger models or loading in a
|
||
full vector package, for example,
|
||
[`en_vectors_web_lg`](/models/en#en_vectors_web_lg), which includes over **1
|
||
million unique vectors**.
|
||
|
||
spaCy is able to compare two objects, and make a prediction of **how similar
|
||
they are**. Predicting similarity is useful for building recommendation systems
|
||
or flagging duplicates. For example, you can suggest a user content that's
|
||
similar to what they're currently looking at, or label a support ticket as a
|
||
duplicate if it's very similar to an already existing one.
|
||
|
||
Each `Doc`, `Span` and `Token` comes with a
|
||
[`.similarity()`](/api/token#similarity) method that lets you compare it with
|
||
another object, and determine the similarity. Of course similarity is always
|
||
subjective – whether "dog" and "cat" are similar really depends on how you're
|
||
looking at it. spaCy's similarity model usually assumes a pretty general-purpose
|
||
definition of similarity.
|
||
|
||
```python
|
||
### {executable="true"}
|
||
import spacy
|
||
|
||
nlp = spacy.load('en_core_web_md') # make sure to use larger model!
|
||
tokens = nlp(u'dog cat banana')
|
||
|
||
for token1 in tokens:
|
||
for token2 in tokens:
|
||
print(token1.text, token2.text, token1.similarity(token2))
|
||
```
|
||
|
||
In this case, the model's predictions are pretty on point. A dog is very similar
|
||
to a cat, whereas a banana is not very similar to either of them. Identical
|
||
tokens are obviously 100% similar to each other (just not always exactly `1.0`,
|
||
because of vector math and floating point imprecisions).
|