Update vectors and similarity docs [ci skip]

2025-08-09 22:54:53 +03:00 · 2020-08-19 21:18:26 +02:00 · 2020-08-19 21:18:26 +02:00 · 2253d26b82
commit 2253d26b82
parent 15e6feed01
5 changed files with 73 additions and 31 deletions
--- a/website/docs/images/sense2vec.jpg
+++ b/website/docs/images/sense2vec.jpg
--- a/website/docs/usage/101/_vectors-similarity.md
+++ b/website/docs/usage/101/_vectors-similarity.md
@ -80,25 +80,73 @@ duplicate if it's very similar to an already existing one.
 Each [`Doc`](/api/doc), [`Span`](/api/span), [`Token`](/api/token) and
 [`Lexeme`](/api/lexeme) comes with a [`.similarity`](/api/token#similarity)
 method that lets you compare it with another object, and determine the
-similarity. Of course similarity is always subjective – whether "dog" and "cat"
-are similar really depends on how you're looking at it. spaCy's similarity model
-usually assumes a pretty general-purpose definition of similarity.
+similarity. Of course similarity is always subjective – whether two words, spans
+or documents are similar really depends on how you're looking at it. spaCy's
+similarity model usually assumes a pretty general-purpose definition of
+similarity.

-<!-- TODO: use better example here -->
+> #### 📝 Things to try
+>
+> 1. Compare two different tokens and try to find the two most _dissimilar_
+>    tokens in the texts with the lowest similarity score (according to the
+>    vectors).
+> 2. Compare the similarity of two [`Lexeme`](/api/lexeme) objects, entries in
+>    the vocabulary. You can get a lexeme via the `.lex` attribute of a token.
+>    You should see that the similarity results are identical to the token
+>    similarity.

 ```python
 ### {executable="true"}
 import spacy

 nlp = spacy.load("en_core_web_md")  # make sure to use larger model!
-tokens = nlp("dog cat banana")
+doc1 = nlp("I like salty fries and hamburgers.")
+doc2 = nlp("Fast food tastes very good.")

-for token1 in tokens:
-    for token2 in tokens:
-        print(token1.text, token2.text, token1.similarity(token2))
+# Similarity of two documents
+print(doc1, "<->", doc2, doc1.similarity(doc2))
+# Similarity of tokens and spans
+french_fries = doc1[2:4]
+burgers = doc1[5]
+print(french_fries, "<->", burgers, french_fries.similarity(burgers))
 ```

-In this case, the model's predictions are pretty on point. A dog is very similar
-to a cat, whereas a banana is not very similar to either of them. Identical
-tokens are obviously 100% similar to each other (just not always exactly `1.0`,
-because of vector math and floating point imprecisions).
+### What to expect from similarity results {#similarity-expectations}
+
+Computing similarity scores can be helpful in many situations, but it's also
+important to maintain **realistic expectations** about what information it can
+provide. Words can be related to each over in many ways, so a single
+"similarity" score will always be a **mix of different signals**, and vectors
+trained on different data can produce very different results that may not be
+useful for your purpose. Here are some important considerations to keep in mind:
+
+- There's no objective definition of similarity. Whether "I like burgers" and "I
+  like pasta" is similar **depends on your application**. Both talk about food
+  preferences, which makes them very similar – but if you're analyzing mentions
+  of food, those sentences are pretty dissimilar, because they talk about very
+  different foods.
+- The similarity of [`Doc`](/api/doc) and [`Span`](/api/span) objects defaults
+  to the **average** of the token vectors. This means that the vector for "fast
+  food" is the average of the vectors for "fast" and "food", which isn't
+  necessarily representative of the phrase "fast food".
+- Vector averaging means that the vector of multiple tokens is **insensitive to
+  the order** of the words. Two documents expressing the same meaning with
+  dissimilar wording will return a lower similarity score than two documents
+  that happen to contain the same words while expressing different meanings.
+
+<Infobox title="Tip: Check out sense2vec" emoji="💡">
+
+[![](../../images/sense2vec.jpg)](https://github.com/explosion/sense2vec)
+
+[`sense2vec`](https://github.com/explosion/sense2vec) is a library developed by
+us that builds on top of spaCy and lets you train and query more interesting and
+detailed word vectors. It combines noun phrases like "fast food" or "fair game"
+and includes the part-of-speech tags and entity labels. The library also
+includes annotation recipes for our annotation tool [Prodigy](https://prodi.gy)
+that let you evaluate vector models and create terminology lists. For more
+details, check out
+[our blog post](https://explosion.ai/blog/sense2vec-reloaded). To explore the
+semantic similarities across all Reddit comments of 2015 and 2019, see the
+[interactive demo](https://explosion.ai/demos/sense2vec).
+
+</Infobox>
--- a/website/docs/usage/linguistic-features.md
+++ b/website/docs/usage/linguistic-features.md
@ -1547,23 +1547,6 @@ import Vectors101 from 'usage/101/\_vectors-similarity.md'

 <Vectors101 />

-<Infobox title="What to expect from similarity results" variant="warning">
-
-Computing similarity scores can be helpful in many situations, but it's also
-important to maintain **realistic expectations** about what information it can
-provide. Words can be related to each over in many ways, so a single
-"similarity" score will always be a **mix of different signals**, and vectors
-trained on different data can produce very different results that may not be
-useful for your purpose.
-
-Also note that the similarity of `Doc` or `Span` objects defaults to the
-**average** of the token vectors. This means it's insensitive to the order of
-the words. Two documents expressing the same meaning with dissimilar wording
-will return a lower similarity score than two documents that happen to contain
-the same words while expressing different meanings.
-
-</Infobox>
-
 ### Adding word vectors {#adding-vectors}

 Custom word vectors can be trained using a number of open-source libraries, such
--- a/website/src/components/link.js
+++ b/website/src/components/link.js
@ -6,7 +6,7 @@ import classNames from 'classnames'

 import Icon from './icon'
 import classes from '../styles/link.module.sass'
-import { isString } from './util'
+import { isString, isImage } from './util'

 const internalRegex = /(http(s?)):\/\/(prodi.gy|spacy.io|irl.spacy.io|explosion.ai|course.spacy.io)/gi

@ -39,7 +39,7 @@ export default function Link({
    const dest = to || href
    const external = forceExternal || /(http(s?)):\/\//gi.test(dest)
    const icon = getIcon(dest)
-    const withIcon = !hidden && !hideIcon && !!icon
+    const withIcon = !hidden && !hideIcon && !!icon && !isImage(children)
    const sourceWithText = withIcon && isString(children)
    const linkClassNames = classNames(classes.root, className, {
        [classes.hidden]: hidden,
--- a/website/src/components/util.js
+++ b/website/src/components/util.js
@ -46,6 +46,17 @@ export function isString(obj) {
    return typeof obj === 'string' || obj instanceof String
 }

+/**
+ * @param obj - The object to check.
+ * @returns {boolean} – Whether the object is an image
+ */
+export function isImage(obj) {
+    if (!obj || !React.isValidElement(obj)) {
+        return false
+    }
+    return obj.props.name == 'img' || obj.props.className == 'gatsby-resp-image-wrapper'
+}
+
 /**
 * @param obj - The object to check.
 * @returns {boolean} - Whether the object is empty.