diff --git a/website/docs/images/sense2vec.jpg b/website/docs/images/sense2vec.jpg new file mode 100644 index 000000000..3a1772582 Binary files /dev/null and b/website/docs/images/sense2vec.jpg differ diff --git a/website/docs/usage/101/_vectors-similarity.md b/website/docs/usage/101/_vectors-similarity.md index a04c96236..92df1b331 100644 --- a/website/docs/usage/101/_vectors-similarity.md +++ b/website/docs/usage/101/_vectors-similarity.md @@ -80,25 +80,73 @@ duplicate if it's very similar to an already existing one. Each [`Doc`](/api/doc), [`Span`](/api/span), [`Token`](/api/token) and [`Lexeme`](/api/lexeme) comes with a [`.similarity`](/api/token#similarity) method that lets you compare it with another object, and determine the -similarity. Of course similarity is always subjective – whether "dog" and "cat" -are similar really depends on how you're looking at it. spaCy's similarity model -usually assumes a pretty general-purpose definition of similarity. +similarity. Of course similarity is always subjective – whether two words, spans +or documents are similar really depends on how you're looking at it. spaCy's +similarity model usually assumes a pretty general-purpose definition of +similarity. - +> #### 📝 Things to try +> +> 1. Compare two different tokens and try to find the two most _dissimilar_ +> tokens in the texts with the lowest similarity score (according to the +> vectors). +> 2. Compare the similarity of two [`Lexeme`](/api/lexeme) objects, entries in +> the vocabulary. You can get a lexeme via the `.lex` attribute of a token. +> You should see that the similarity results are identical to the token +> similarity. ```python ### {executable="true"} import spacy nlp = spacy.load("en_core_web_md") # make sure to use larger model! -tokens = nlp("dog cat banana") +doc1 = nlp("I like salty fries and hamburgers.") +doc2 = nlp("Fast food tastes very good.") -for token1 in tokens: - for token2 in tokens: - print(token1.text, token2.text, token1.similarity(token2)) +# Similarity of two documents +print(doc1, "<->", doc2, doc1.similarity(doc2)) +# Similarity of tokens and spans +french_fries = doc1[2:4] +burgers = doc1[5] +print(french_fries, "<->", burgers, french_fries.similarity(burgers)) ``` -In this case, the model's predictions are pretty on point. A dog is very similar -to a cat, whereas a banana is not very similar to either of them. Identical -tokens are obviously 100% similar to each other (just not always exactly `1.0`, -because of vector math and floating point imprecisions). +### What to expect from similarity results {#similarity-expectations} + +Computing similarity scores can be helpful in many situations, but it's also +important to maintain **realistic expectations** about what information it can +provide. Words can be related to each over in many ways, so a single +"similarity" score will always be a **mix of different signals**, and vectors +trained on different data can produce very different results that may not be +useful for your purpose. Here are some important considerations to keep in mind: + +- There's no objective definition of similarity. Whether "I like burgers" and "I + like pasta" is similar **depends on your application**. Both talk about food + preferences, which makes them very similar – but if you're analyzing mentions + of food, those sentences are pretty dissimilar, because they talk about very + different foods. +- The similarity of [`Doc`](/api/doc) and [`Span`](/api/span) objects defaults + to the **average** of the token vectors. This means that the vector for "fast + food" is the average of the vectors for "fast" and "food", which isn't + necessarily representative of the phrase "fast food". +- Vector averaging means that the vector of multiple tokens is **insensitive to + the order** of the words. Two documents expressing the same meaning with + dissimilar wording will return a lower similarity score than two documents + that happen to contain the same words while expressing different meanings. + + + +[![](../../images/sense2vec.jpg)](https://github.com/explosion/sense2vec) + +[`sense2vec`](https://github.com/explosion/sense2vec) is a library developed by +us that builds on top of spaCy and lets you train and query more interesting and +detailed word vectors. It combines noun phrases like "fast food" or "fair game" +and includes the part-of-speech tags and entity labels. The library also +includes annotation recipes for our annotation tool [Prodigy](https://prodi.gy) +that let you evaluate vector models and create terminology lists. For more +details, check out +[our blog post](https://explosion.ai/blog/sense2vec-reloaded). To explore the +semantic similarities across all Reddit comments of 2015 and 2019, see the +[interactive demo](https://explosion.ai/demos/sense2vec). + + diff --git a/website/docs/usage/linguistic-features.md b/website/docs/usage/linguistic-features.md index 10efcf875..3aa0df7b4 100644 --- a/website/docs/usage/linguistic-features.md +++ b/website/docs/usage/linguistic-features.md @@ -1547,23 +1547,6 @@ import Vectors101 from 'usage/101/\_vectors-similarity.md' - - -Computing similarity scores can be helpful in many situations, but it's also -important to maintain **realistic expectations** about what information it can -provide. Words can be related to each over in many ways, so a single -"similarity" score will always be a **mix of different signals**, and vectors -trained on different data can produce very different results that may not be -useful for your purpose. - -Also note that the similarity of `Doc` or `Span` objects defaults to the -**average** of the token vectors. This means it's insensitive to the order of -the words. Two documents expressing the same meaning with dissimilar wording -will return a lower similarity score than two documents that happen to contain -the same words while expressing different meanings. - - - ### Adding word vectors {#adding-vectors} Custom word vectors can be trained using a number of open-source libraries, such diff --git a/website/src/components/link.js b/website/src/components/link.js index 3644479c5..acded7d0d 100644 --- a/website/src/components/link.js +++ b/website/src/components/link.js @@ -6,7 +6,7 @@ import classNames from 'classnames' import Icon from './icon' import classes from '../styles/link.module.sass' -import { isString } from './util' +import { isString, isImage } from './util' const internalRegex = /(http(s?)):\/\/(prodi.gy|spacy.io|irl.spacy.io|explosion.ai|course.spacy.io)/gi @@ -39,7 +39,7 @@ export default function Link({ const dest = to || href const external = forceExternal || /(http(s?)):\/\//gi.test(dest) const icon = getIcon(dest) - const withIcon = !hidden && !hideIcon && !!icon + const withIcon = !hidden && !hideIcon && !!icon && !isImage(children) const sourceWithText = withIcon && isString(children) const linkClassNames = classNames(classes.root, className, { [classes.hidden]: hidden, diff --git a/website/src/components/util.js b/website/src/components/util.js index 844f2c133..a9c6efcf5 100644 --- a/website/src/components/util.js +++ b/website/src/components/util.js @@ -46,6 +46,17 @@ export function isString(obj) { return typeof obj === 'string' || obj instanceof String } +/** + * @param obj - The object to check. + * @returns {boolean} – Whether the object is an image + */ +export function isImage(obj) { + if (!obj || !React.isValidElement(obj)) { + return false + } + return obj.props.name == 'img' || obj.props.className == 'gatsby-resp-image-wrapper' +} + /** * @param obj - The object to check. * @returns {boolean} - Whether the object is empty.