mirror of
https://github.com/explosion/spaCy.git
synced 2025-02-04 21:50:35 +03:00
Update vectors and similarity docs [ci skip]
This commit is contained in:
parent
15e6feed01
commit
2253d26b82
BIN
website/docs/images/sense2vec.jpg
Normal file
BIN
website/docs/images/sense2vec.jpg
Normal file
Binary file not shown.
After Width: | Height: | Size: 224 KiB |
|
@ -80,25 +80,73 @@ duplicate if it's very similar to an already existing one.
|
||||||
Each [`Doc`](/api/doc), [`Span`](/api/span), [`Token`](/api/token) and
|
Each [`Doc`](/api/doc), [`Span`](/api/span), [`Token`](/api/token) and
|
||||||
[`Lexeme`](/api/lexeme) comes with a [`.similarity`](/api/token#similarity)
|
[`Lexeme`](/api/lexeme) comes with a [`.similarity`](/api/token#similarity)
|
||||||
method that lets you compare it with another object, and determine the
|
method that lets you compare it with another object, and determine the
|
||||||
similarity. Of course similarity is always subjective – whether "dog" and "cat"
|
similarity. Of course similarity is always subjective – whether two words, spans
|
||||||
are similar really depends on how you're looking at it. spaCy's similarity model
|
or documents are similar really depends on how you're looking at it. spaCy's
|
||||||
usually assumes a pretty general-purpose definition of similarity.
|
similarity model usually assumes a pretty general-purpose definition of
|
||||||
|
similarity.
|
||||||
|
|
||||||
<!-- TODO: use better example here -->
|
> #### 📝 Things to try
|
||||||
|
>
|
||||||
|
> 1. Compare two different tokens and try to find the two most _dissimilar_
|
||||||
|
> tokens in the texts with the lowest similarity score (according to the
|
||||||
|
> vectors).
|
||||||
|
> 2. Compare the similarity of two [`Lexeme`](/api/lexeme) objects, entries in
|
||||||
|
> the vocabulary. You can get a lexeme via the `.lex` attribute of a token.
|
||||||
|
> You should see that the similarity results are identical to the token
|
||||||
|
> similarity.
|
||||||
|
|
||||||
```python
|
```python
|
||||||
### {executable="true"}
|
### {executable="true"}
|
||||||
import spacy
|
import spacy
|
||||||
|
|
||||||
nlp = spacy.load("en_core_web_md") # make sure to use larger model!
|
nlp = spacy.load("en_core_web_md") # make sure to use larger model!
|
||||||
tokens = nlp("dog cat banana")
|
doc1 = nlp("I like salty fries and hamburgers.")
|
||||||
|
doc2 = nlp("Fast food tastes very good.")
|
||||||
|
|
||||||
for token1 in tokens:
|
# Similarity of two documents
|
||||||
for token2 in tokens:
|
print(doc1, "<->", doc2, doc1.similarity(doc2))
|
||||||
print(token1.text, token2.text, token1.similarity(token2))
|
# Similarity of tokens and spans
|
||||||
|
french_fries = doc1[2:4]
|
||||||
|
burgers = doc1[5]
|
||||||
|
print(french_fries, "<->", burgers, french_fries.similarity(burgers))
|
||||||
```
|
```
|
||||||
|
|
||||||
In this case, the model's predictions are pretty on point. A dog is very similar
|
### What to expect from similarity results {#similarity-expectations}
|
||||||
to a cat, whereas a banana is not very similar to either of them. Identical
|
|
||||||
tokens are obviously 100% similar to each other (just not always exactly `1.0`,
|
Computing similarity scores can be helpful in many situations, but it's also
|
||||||
because of vector math and floating point imprecisions).
|
important to maintain **realistic expectations** about what information it can
|
||||||
|
provide. Words can be related to each over in many ways, so a single
|
||||||
|
"similarity" score will always be a **mix of different signals**, and vectors
|
||||||
|
trained on different data can produce very different results that may not be
|
||||||
|
useful for your purpose. Here are some important considerations to keep in mind:
|
||||||
|
|
||||||
|
- There's no objective definition of similarity. Whether "I like burgers" and "I
|
||||||
|
like pasta" is similar **depends on your application**. Both talk about food
|
||||||
|
preferences, which makes them very similar – but if you're analyzing mentions
|
||||||
|
of food, those sentences are pretty dissimilar, because they talk about very
|
||||||
|
different foods.
|
||||||
|
- The similarity of [`Doc`](/api/doc) and [`Span`](/api/span) objects defaults
|
||||||
|
to the **average** of the token vectors. This means that the vector for "fast
|
||||||
|
food" is the average of the vectors for "fast" and "food", which isn't
|
||||||
|
necessarily representative of the phrase "fast food".
|
||||||
|
- Vector averaging means that the vector of multiple tokens is **insensitive to
|
||||||
|
the order** of the words. Two documents expressing the same meaning with
|
||||||
|
dissimilar wording will return a lower similarity score than two documents
|
||||||
|
that happen to contain the same words while expressing different meanings.
|
||||||
|
|
||||||
|
<Infobox title="Tip: Check out sense2vec" emoji="💡">
|
||||||
|
|
||||||
|
[![](../../images/sense2vec.jpg)](https://github.com/explosion/sense2vec)
|
||||||
|
|
||||||
|
[`sense2vec`](https://github.com/explosion/sense2vec) is a library developed by
|
||||||
|
us that builds on top of spaCy and lets you train and query more interesting and
|
||||||
|
detailed word vectors. It combines noun phrases like "fast food" or "fair game"
|
||||||
|
and includes the part-of-speech tags and entity labels. The library also
|
||||||
|
includes annotation recipes for our annotation tool [Prodigy](https://prodi.gy)
|
||||||
|
that let you evaluate vector models and create terminology lists. For more
|
||||||
|
details, check out
|
||||||
|
[our blog post](https://explosion.ai/blog/sense2vec-reloaded). To explore the
|
||||||
|
semantic similarities across all Reddit comments of 2015 and 2019, see the
|
||||||
|
[interactive demo](https://explosion.ai/demos/sense2vec).
|
||||||
|
|
||||||
|
</Infobox>
|
||||||
|
|
|
@ -1547,23 +1547,6 @@ import Vectors101 from 'usage/101/\_vectors-similarity.md'
|
||||||
|
|
||||||
<Vectors101 />
|
<Vectors101 />
|
||||||
|
|
||||||
<Infobox title="What to expect from similarity results" variant="warning">
|
|
||||||
|
|
||||||
Computing similarity scores can be helpful in many situations, but it's also
|
|
||||||
important to maintain **realistic expectations** about what information it can
|
|
||||||
provide. Words can be related to each over in many ways, so a single
|
|
||||||
"similarity" score will always be a **mix of different signals**, and vectors
|
|
||||||
trained on different data can produce very different results that may not be
|
|
||||||
useful for your purpose.
|
|
||||||
|
|
||||||
Also note that the similarity of `Doc` or `Span` objects defaults to the
|
|
||||||
**average** of the token vectors. This means it's insensitive to the order of
|
|
||||||
the words. Two documents expressing the same meaning with dissimilar wording
|
|
||||||
will return a lower similarity score than two documents that happen to contain
|
|
||||||
the same words while expressing different meanings.
|
|
||||||
|
|
||||||
</Infobox>
|
|
||||||
|
|
||||||
### Adding word vectors {#adding-vectors}
|
### Adding word vectors {#adding-vectors}
|
||||||
|
|
||||||
Custom word vectors can be trained using a number of open-source libraries, such
|
Custom word vectors can be trained using a number of open-source libraries, such
|
||||||
|
|
|
@ -6,7 +6,7 @@ import classNames from 'classnames'
|
||||||
|
|
||||||
import Icon from './icon'
|
import Icon from './icon'
|
||||||
import classes from '../styles/link.module.sass'
|
import classes from '../styles/link.module.sass'
|
||||||
import { isString } from './util'
|
import { isString, isImage } from './util'
|
||||||
|
|
||||||
const internalRegex = /(http(s?)):\/\/(prodi.gy|spacy.io|irl.spacy.io|explosion.ai|course.spacy.io)/gi
|
const internalRegex = /(http(s?)):\/\/(prodi.gy|spacy.io|irl.spacy.io|explosion.ai|course.spacy.io)/gi
|
||||||
|
|
||||||
|
@ -39,7 +39,7 @@ export default function Link({
|
||||||
const dest = to || href
|
const dest = to || href
|
||||||
const external = forceExternal || /(http(s?)):\/\//gi.test(dest)
|
const external = forceExternal || /(http(s?)):\/\//gi.test(dest)
|
||||||
const icon = getIcon(dest)
|
const icon = getIcon(dest)
|
||||||
const withIcon = !hidden && !hideIcon && !!icon
|
const withIcon = !hidden && !hideIcon && !!icon && !isImage(children)
|
||||||
const sourceWithText = withIcon && isString(children)
|
const sourceWithText = withIcon && isString(children)
|
||||||
const linkClassNames = classNames(classes.root, className, {
|
const linkClassNames = classNames(classes.root, className, {
|
||||||
[classes.hidden]: hidden,
|
[classes.hidden]: hidden,
|
||||||
|
|
|
@ -46,6 +46,17 @@ export function isString(obj) {
|
||||||
return typeof obj === 'string' || obj instanceof String
|
return typeof obj === 'string' || obj instanceof String
|
||||||
}
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* @param obj - The object to check.
|
||||||
|
* @returns {boolean} – Whether the object is an image
|
||||||
|
*/
|
||||||
|
export function isImage(obj) {
|
||||||
|
if (!obj || !React.isValidElement(obj)) {
|
||||||
|
return false
|
||||||
|
}
|
||||||
|
return obj.props.name == 'img' || obj.props.className == 'gatsby-resp-image-wrapper'
|
||||||
|
}
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* @param obj - The object to check.
|
* @param obj - The object to check.
|
||||||
* @returns {boolean} - Whether the object is empty.
|
* @returns {boolean} - Whether the object is empty.
|
||||||
|
|
Loading…
Reference in New Issue
Block a user