spaCy/website/usage/_spacy-101/_vocab.jade
Ines Montani 49cee4af92
💫 Interactive code examples, spaCy Universe and various docs improvements (#2274)
* Integrate Python kernel via Binder

* Add live model test for languages with examples

* Update docs and code examples

* Adjust margin (if not bootstrapped)

* Add binder version to global config

* Update terminal and executable code mixins

* Pass attributes through infobox and section

* Hide v-cloak

* Fix example

* Take out model comparison for now

* Add meta text for compat

* Remove chart.js dependency

* Tidy up and simplify JS and port big components over to Vue

* Remove chartjs example

* Add Twitter icon

* Add purple stylesheet option

* Add utility for hand cursor (special cases only)

* Add transition classes

* Add small option for section

* Add thumb object for small round thumbnail images

* Allow unset code block language via "none" value

(workaround to still allow unset language to default to DEFAULT_SYNTAX)

* Pass through attributes

* Add syntax highlighting definitions for Julia, R and Docker

* Add website icon

* Remove user survey from navigation

* Don't hide GitHub icon on small screens

* Make top navigation scrollable on small screens

* Remove old resources page and references to it

* Add Universe

* Add helper functions for better page URL and title

* Update site description

* Increment versions

* Update preview images

* Update mentions of resources

* Fix image

* Fix social images

* Fix problem with cover sizing and floats

* Add divider and move badges into heading

* Add docstrings

* Reference converting section

* Add section on converting word vectors

* Move converting section to custom section and fix formatting

* Remove old fastText example

* Move extensions content to own section

Keep weird ID to not break permalinks for now (we don't want to rewrite URLs if not absolutely necessary)

* Use better component example and add factories section

* Add note on larger model

* Use better example for non-vector

* Remove similarity in context section

Only works via small models with tensors so has always been kind of confusing

* Add note on init-model command

* Fix lightning tour examples and make excutable if possible

* Add spacy train CLI section to train

* Fix formatting and add video

* Fix formatting

* Fix textcat example description (resolves #2246)

* Add dummy file to try resolve conflict

* Delete dummy file

* Tidy up [ci skip]

* Ensure sufficient height of loading container

* Add loading animation to universe

* Update Thebelab build and use better startup message

* Fix asset versioning

* Fix typo [ci skip]

* Add note on project idea label
2018-04-29 02:06:46 +02:00

123 lines
5.9 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

//- 💫 DOCS > USAGE > SPACY 101 > VOCAB & STRINGSTORE
p
| Whenever possible, spaCy tries to store data in a vocabulary, the
| #[+api("vocab") #[code Vocab]], that will be
| #[strong shared by multiple documents]. To save memory, spaCy also
| encodes all strings to #[strong hash values] in this case for example,
| "coffee" has the hash #[code 3197928453018144401]. Entity labels like
| "ORG" and part-of-speech tags like "VERB" are also encoded. Internally,
| spaCy only "speaks" in hash values.
+aside
| #[strong Token]: A word, punctuation mark etc. #[em in context], including
| its attributes, tags and dependencies.#[br]
| #[strong Lexeme]: A "word type" with no context. Includes the word shape
| and flags, e.g. if it's lowercase, a digit or punctuation.#[br]
| #[strong Doc]: A processed container of tokens in context.#[br]
| #[strong Vocab]: The collection of lexemes.#[br]
| #[strong StringStore]: The dictionary mapping hash values to strings, for
| example #[code 3197928453018144401] → "coffee".
+graphic("/assets/img/vocab_stringstore.svg")
include ../../assets/img/vocab_stringstore.svg
p
| If you process lots of documents containing the word "coffee" in all
| kinds of different contexts, storing the exact string "coffee" every time
| would take up way too much space. So instead, spaCy hashes the string
| and stores it in the #[+api("stringstore") #[code StringStore]]. You can
| think of the #[code StringStore] as a
| #[strong lookup table that works in both directions] you can look up a
| string to get its hash, or a hash to get its string:
+code-exec.
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(u'I love coffee')
print(doc.vocab.strings[u'coffee']) # 3197928453018144401
print(doc.vocab.strings[3197928453018144401]) # 'coffee'
+aside("What does 'L' at the end of a hash mean?")
| If you return a hash value in the #[strong Python 2 interpreter], it'll
| show up as #[code 3197928453018144401L]. The #[code L] just means "long
| integer" it's #[strong not] actually a part of the hash value.
p
| Now that all strings are encoded, the entries in the vocabulary
| #[strong don't need to include the word text] themselves. Instead,
| they can look it up in the #[code StringStore] via its hash value. Each
| entry in the vocabulary, also called #[+api("lexeme") #[code Lexeme]],
| contains the #[strong context-independent] information about a word.
| For example, no matter if "love" is used as a verb or a noun in some
| context, its spelling and whether it consists of alphabetic characters
| won't ever change. Its hash value will also always be the same.
+code-exec.
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(u'I love coffee')
for word in doc:
lexeme = doc.vocab[word.text]
print(lexeme.text, lexeme.orth, lexeme.shape_, lexeme.prefix_, lexeme.suffix_,
lexeme.is_alpha, lexeme.is_digit, lexeme.is_title, lexeme.lang_)
+aside
| #[strong Text]: The original text of the lexeme.#[br]
| #[strong Orth]: The hash value of the lexeme.#[br]
| #[strong Shape]: The abstract word shape of the lexeme.#[br]
| #[strong Prefix]: By default, the first letter of the word string.#[br]
| #[strong Suffix]: By default, the last three letters of the word string.#[br]
| #[strong is alpha]: Does the lexeme consist of alphabetic characters?#[br]
| #[strong is digit]: Does the lexeme consist of digits?#[br]
+table(["text", "orth", "shape", "prefix", "suffix", "is_alpha", "is_digit"])
- var style = [0, 1, 1, 0, 0, 1, 1]
+annotation-row(["I", "4690420944186131903", "X", "I", "I", true, false], style)
+annotation-row(["love", "3702023516439754181", "xxxx", "l", "ove", true, false], style)
+annotation-row(["coffee", "3197928453018144401", "xxxx", "c", "fee", true, false], style)
p
| The mapping of words to hashes doesn't depend on any state. To make sure
| each value is unique, spaCy uses a
| #[+a("https://en.wikipedia.org/wiki/Hash_function") hash function] to
| calculate the hash #[strong based on the word string]. This also means
| that the hash for "coffee" will always be the same, no matter which model
| you're using or how you've configured spaCy.
p
| However, hashes #[strong cannot be reversed] and there's no way to
| resolve #[code 3197928453018144401] back to "coffee". All spaCy can do
| is look it up in the vocabulary. That's why you always need to make
| sure all objects you create have access to the same vocabulary. If they
| don't, spaCy might not be able to find the strings it needs.
+code-exec.
import spacy
from spacy.tokens import Doc
from spacy.vocab import Vocab
nlp = spacy.load('en_core_web_sm')
doc = nlp(u'I love coffee') # original Doc
print(doc.vocab.strings[u'coffee']) # 3197928453018144401
print(doc.vocab.strings[3197928453018144401]) # 'coffee' 👍
empty_doc = Doc(Vocab()) # new Doc with empty Vocab
# empty_doc.vocab.strings[3197928453018144401] will raise an error :(
empty_doc.vocab.strings.add(u'coffee') # add "coffee" and generate hash
print(empty_doc.vocab.strings[3197928453018144401]) # 'coffee' 👍
new_doc = Doc(doc.vocab) # create new doc with first doc's vocab
print(new_doc.vocab.strings[3197928453018144401]) # 'coffee' 👍
p
| If the vocabulary doesn't contain a string for #[code 3197928453018144401],
| spaCy will raise an error. You can re-add "coffee" manually, but this
| only works if you actually #[em know] that the document contains that
| word. To prevent this problem, spaCy will also export the #[code Vocab]
| when you save a #[code Doc] or #[code nlp] object. This will give you
| the object and its encoded annotations, plus the "key" to decode it.