Update vectors and similarity docs [ci skip]

2025-12-14 21:54:18 +03:00 · 2020-08-19 21:18:26 +02:00 · 2020-08-19 21:18:26 +02:00 · 2253d26b82
commit 2253d26b82
parent 15e6feed01
5 changed files with 73 additions and 31 deletions
--- a/website/docs/images/sense2vec.jpg
+++ b/website/docs/images/sense2vec.jpg
--- a/website/docs/usage/101/_vectors-similarity.md
+++ b/website/docs/usage/101/_vectors-similarity.md
@ -80,25 +80,73 @@ duplicate if it's very similar to an already existing one.
 Each [`Doc`](/api/doc), [`Span`](/api/span), [`Token`](/api/token) and
 [`Lexeme`](/api/lexeme) comes with a [`.similarity`](/api/token#similarity)
 method that lets you compare it with another object, and determine the
-similarity. Of course similarity is always subjective – whether "dog" and "cat"
+similarity. Of course similarity is always subjective – whether two words, spans
-are similar really depends on how you're looking at it. spaCy's similarity model
+or documents are similar really depends on how you're looking at it. spaCy's
-usually assumes a pretty general-purpose definition of similarity.
+similarity model usually assumes a pretty general-purpose definition of
 similarity.
-<!-- TODO: use better example here -->
+> #### 📝 Things to try
 >
 > 1. Compare two different tokens and try to find the two most _dissimilar_
 >    tokens in the texts with the lowest similarity score (according to the
 >    vectors).
 > 2. Compare the similarity of two [`Lexeme`](/api/lexeme) objects, entries in
 >    the vocabulary. You can get a lexeme via the `.lex` attribute of a token.
 >    You should see that the similarity results are identical to the token
 >    similarity.
 ```python
 ### {executable="true"}
 import spacy
 nlp = spacy.load("en_core_web_md")  # make sure to use larger model!
-tokens = nlp("dog cat banana")
+doc1 = nlp("I like salty fries and hamburgers.")
 doc2 = nlp("Fast food tastes very good.")
-for token1 in tokens:
+# Similarity of two documents
-    for token2 in tokens:
+print(doc1, "<->", doc2, doc1.similarity(doc2))
-        print(token1.text, token2.text, token1.similarity(token2))
+# Similarity of tokens and spans
 french_fries = doc1[2:4]
 burgers = doc1[5]
 print(french_fries, "<->", burgers, french_fries.similarity(burgers))
 ```
-In this case, the model's predictions are pretty on point. A dog is very similar
+### What to expect from similarity results {#similarity-expectations}
-to a cat, whereas a banana is not very similar to either of them. Identical
+
-tokens are obviously 100% similar to each other (just not always exactly `1.0`,
+Computing similarity scores can be helpful in many situations, but it's also
-because of vector math and floating point imprecisions).
+important to maintain **realistic expectations** about what information it can
 provide. Words can be related to each over in many ways, so a single
 "similarity" score will always be a **mix of different signals**, and vectors
 trained on different data can produce very different results that may not be
 useful for your purpose. Here are some important considerations to keep in mind:
 - There's no objective definition of similarity. Whether "I like burgers" and "I
  like pasta" is similar **depends on your application**. Both talk about food
  preferences, which makes them very similar – but if you're analyzing mentions
  of food, those sentences are pretty dissimilar, because they talk about very
  different foods.
 - The similarity of [`Doc`](/api/doc) and [`Span`](/api/span) objects defaults
  to the **average** of the token vectors. This means that the vector for "fast
  food" is the average of the vectors for "fast" and "food", which isn't
  necessarily representative of the phrase "fast food".
 - Vector averaging means that the vector of multiple tokens is **insensitive to
  the order** of the words. Two documents expressing the same meaning with
  dissimilar wording will return a lower similarity score than two documents
  that happen to contain the same words while expressing different meanings.
 <Infobox title="Tip: Check out sense2vec" emoji="💡">
 [![](../../images/sense2vec.jpg)](https://github.com/explosion/sense2vec)
 [`sense2vec`](https://github.com/explosion/sense2vec) is a library developed by
 us that builds on top of spaCy and lets you train and query more interesting and
 detailed word vectors. It combines noun phrases like "fast food" or "fair game"
 and includes the part-of-speech tags and entity labels. The library also
 includes annotation recipes for our annotation tool [Prodigy](https://prodi.gy)
 that let you evaluate vector models and create terminology lists. For more
 details, check out
 [our blog post](https://explosion.ai/blog/sense2vec-reloaded). To explore the
 semantic similarities across all Reddit comments of 2015 and 2019, see the
 [interactive demo](https://explosion.ai/demos/sense2vec).
 </Infobox>
--- a/website/docs/usage/linguistic-features.md
+++ b/website/docs/usage/linguistic-features.md
@ -1547,23 +1547,6 @@ import Vectors101 from 'usage/101/\_vectors-similarity.md'
 <Vectors101 />
 <Infobox title="What to expect from similarity results" variant="warning">
 Computing similarity scores can be helpful in many situations, but it's also
 important to maintain **realistic expectations** about what information it can
 provide. Words can be related to each over in many ways, so a single
 "similarity" score will always be a **mix of different signals**, and vectors
 trained on different data can produce very different results that may not be
 useful for your purpose.
 Also note that the similarity of `Doc` or `Span` objects defaults to the
 **average** of the token vectors. This means it's insensitive to the order of
 the words. Two documents expressing the same meaning with dissimilar wording
 will return a lower similarity score than two documents that happen to contain
 the same words while expressing different meanings.
 </Infobox>
 ### Adding word vectors {#adding-vectors}
 Custom word vectors can be trained using a number of open-source libraries, such
--- a/website/src/components/link.js
+++ b/website/src/components/link.js
@ -6,7 +6,7 @@ import classNames from 'classnames'
 import Icon from './icon'
 import classes from '../styles/link.module.sass'
-import { isString } from './util'
+import { isString, isImage } from './util'
 const internalRegex = /(http(s?)):\/\/(prodi.gy|spacy.io|irl.spacy.io|explosion.ai|course.spacy.io)/gi
@ -39,7 +39,7 @@ export default function Link({
    const dest = to || href
    const external = forceExternal || /(http(s?)):\/\//gi.test(dest)
    const icon = getIcon(dest)
-    const withIcon = !hidden && !hideIcon && !!icon
+    const withIcon = !hidden && !hideIcon && !!icon && !isImage(children)
    const sourceWithText = withIcon && isString(children)
    const linkClassNames = classNames(classes.root, className, {
        [classes.hidden]: hidden,
--- a/website/src/components/util.js
+++ b/website/src/components/util.js
@ -46,6 +46,17 @@ export function isString(obj) {
    return typeof obj === 'string' || obj instanceof String
 }
 /**
 * @param obj - The object to check.
 * @returns {boolean} – Whether the object is an image
 */
 export function isImage(obj) {
    if (!obj || !React.isValidElement(obj)) {
        return false
    }
    return obj.props.name == 'img' || obj.props.className == 'gatsby-resp-image-wrapper'
 }
 /**
 * @param obj - The object to check.
 * @returns {boolean} - Whether the object is empty.