mirror of
https://github.com/explosion/spaCy.git
synced 2024-12-27 18:36:36 +03:00
49cee4af92
* Integrate Python kernel via Binder * Add live model test for languages with examples * Update docs and code examples * Adjust margin (if not bootstrapped) * Add binder version to global config * Update terminal and executable code mixins * Pass attributes through infobox and section * Hide v-cloak * Fix example * Take out model comparison for now * Add meta text for compat * Remove chart.js dependency * Tidy up and simplify JS and port big components over to Vue * Remove chartjs example * Add Twitter icon * Add purple stylesheet option * Add utility for hand cursor (special cases only) * Add transition classes * Add small option for section * Add thumb object for small round thumbnail images * Allow unset code block language via "none" value (workaround to still allow unset language to default to DEFAULT_SYNTAX) * Pass through attributes * Add syntax highlighting definitions for Julia, R and Docker * Add website icon * Remove user survey from navigation * Don't hide GitHub icon on small screens * Make top navigation scrollable on small screens * Remove old resources page and references to it * Add Universe * Add helper functions for better page URL and title * Update site description * Increment versions * Update preview images * Update mentions of resources * Fix image * Fix social images * Fix problem with cover sizing and floats * Add divider and move badges into heading * Add docstrings * Reference converting section * Add section on converting word vectors * Move converting section to custom section and fix formatting * Remove old fastText example * Move extensions content to own section Keep weird ID to not break permalinks for now (we don't want to rewrite URLs if not absolutely necessary) * Use better component example and add factories section * Add note on larger model * Use better example for non-vector * Remove similarity in context section Only works via small models with tensors so has always been kind of confusing * Add note on init-model command * Fix lightning tour examples and make excutable if possible * Add spacy train CLI section to train * Fix formatting and add video * Fix formatting * Fix textcat example description (resolves #2246) * Add dummy file to try resolve conflict * Delete dummy file * Tidy up [ci skip] * Ensure sufficient height of loading container * Add loading animation to universe * Update Thebelab build and use better startup message * Fix asset versioning * Fix typo [ci skip] * Add note on project idea label
52 lines
2.3 KiB
Plaintext
52 lines
2.3 KiB
Plaintext
//- 💫 DOCS > USAGE > SPACY 101 > SIMILARITY
|
||
|
||
p
|
||
| spaCy is able to compare two objects, and make a prediction of
|
||
| #[strong how similar they are]. Predicting similarity is useful for
|
||
| building recommendation systems or flagging duplicates. For example, you
|
||
| can suggest a user content that's similar to what they're currently
|
||
| looking at, or label a support ticket as a duplicate if it's very
|
||
| similar to an already existing one.
|
||
|
||
p
|
||
| Each #[code Doc], #[code Span] and #[code Token] comes with a
|
||
| #[+api("token#similarity") #[code .similarity()]] method that lets you
|
||
| compare it with another object, and determine the similarity. Of course
|
||
| similarity is always subjective – whether "dog" and "cat" are similar
|
||
| really depends on how you're looking at it. spaCy's similarity model
|
||
| usually assumes a pretty general-purpose definition of similarity.
|
||
|
||
+code-exec.
|
||
import spacy
|
||
|
||
nlp = spacy.load('en_core_web_md') # make sure to use larger model!
|
||
tokens = nlp(u'dog cat banana')
|
||
|
||
for token1 in tokens:
|
||
for token2 in tokens:
|
||
print(token1.text, token2.text, token1.similarity(token2))
|
||
|
||
+aside
|
||
| #[strong #[+procon("neutral", "identical", false, 16)] similarity:] identical#[br]
|
||
| #[strong #[+procon("yes", "similar", false, 16)] similarity:] similar (higher is more similar) #[br]
|
||
| #[strong #[+procon("no", "dissimilar", false, 16)] similarity:] dissimilar (lower is less similar)
|
||
|
||
+table
|
||
+row("head")
|
||
for column in ["", "dog", "cat", "banana"]
|
||
+head-cell.u-text-center=column
|
||
each cells, label in {"dog": [1, 0.8, 0.24], "cat": [0.8, 1, 0.28], "banana": [0.24, 0.28, 1]}
|
||
+row
|
||
+cell.u-text-label.u-color-theme=label
|
||
for cell in cells
|
||
+cell.u-text-center
|
||
- var result = cell < 0.5 ? ["no", "dissimilar"] : cell != 1 ? ["yes", "similar"] : ["neutral", "identical"]
|
||
| #[code=cell.toFixed(2)] #[+procon(...result)]
|
||
|
||
p
|
||
| In this case, the model's predictions are pretty on point. A dog is very
|
||
| similar to a cat, whereas a banana is not very similar to either of them.
|
||
| Identical tokens are obviously 100% similar to each other (just not always
|
||
| exactly #[code 1.0], because of vector math and floating point
|
||
| imprecisions).
|