//- 💫 DOCS > USAGE > SPACY 101 > VOCAB & STRINGSTORE p | Whenever possible, spaCy tries to store data in a vocabulary, the | #[+api("vocab") #[code Vocab]], that will be | #[strong shared by multiple documents]. To save memory, spaCy also | encodes all strings to #[strong hash values] – in this case for example, | "coffee" has the hash #[code 3197928453018144401]. Entity labels like | "ORG" and part-of-speech tags like "VERB" are also encoded. Internally, | spaCy only "speaks" in hash values. +aside | #[strong Token]: A word, punctuation mark etc. #[em in context], including | its attributes, tags and dependencies.#[br] | #[strong Lexeme]: A "word type" with no context. Includes the word shape | and flags, e.g. if it's lowercase, a digit or punctuation.#[br] | #[strong Doc]: A processed container of tokens in context.#[br] | #[strong Vocab]: The collection of lexemes.#[br] | #[strong StringStore]: The dictionary mapping hash values to strings, for | example #[code 3197928453018144401] → "coffee". +image include ../../../assets/img/docs/vocab_stringstore.svg .u-text-right +button("/assets/img/docs/vocab_stringstore.svg", false, "secondary").u-text-tag View large graphic p | If you process lots of documents containing the word "coffee" in all | kinds of different contexts, storing the exact string "coffee" every time | would take up way too much space. So instead, spaCy hashes the string | and stores it in the #[+api("stringstore") #[code StringStore]]. You can | think of the #[code StringStore] as a | #[strong lookup table that works in both directions] – you can look up a | string to get its hash, or a hash to get its string: +code. doc = nlp(u'I like coffee') assert doc.vocab.strings[u'coffee'] == 3197928453018144401 assert doc.vocab.strings[3197928453018144401] == u'coffee' +aside("What does 'L' at the end of a hash mean?") | If you return a hash value in the #[strong Python 2 interpreter], it'll | show up as #[code 3197928453018144401L]. The #[code L] just means "long | integer" – it's #[strong not] actually a part of the hash value. p | Now that all strings are encoded, the entries in the vocabulary | #[strong don't need to include the word text] themselves. Instead, | they can look it up in the #[code StringStore] via its hash value. Each | entry in the vocabulary, also called #[+api("lexeme") #[code Lexeme]], | contains the #[strong context-independent] information about a word. | For example, no matter if "love" is used as a verb or a noun in some | context, its spelling and whether it consists of alphabetic characters | won't ever change. Its hash value will also always be the same. +code. for word in doc: lexeme = doc.vocab[word.text] print(lexeme.text, lexeme.orth, lexeme.shape_, lexeme.prefix_, lexeme.suffix_, lexeme.is_alpha, lexeme.is_digit, lexeme.is_title, lexeme.lang_) +aside | #[strong Text]: The original text of the lexeme.#[br] | #[strong Orth]: The hash value of the lexeme.#[br] | #[strong Shape]: The abstract word shape of the lexeme.#[br] | #[strong Prefix]: By default, the first letter of the word string.#[br] | #[strong Suffix]: By default, the last three letters of the word string.#[br] | #[strong is alpha]: Does the lexeme consist of alphabetic characters?#[br] | #[strong is digit]: Does the lexeme consist of digits?#[br] +table(["text", "orth", "shape", "prefix", "suffix", "is_alpha", "is_digit"]) - var style = [0, 1, 1, 0, 0, 1, 1] +annotation-row(["I", "4690420944186131903", "X", "I", "I", true, false], style) +annotation-row(["love", "3702023516439754181", "xxxx", "l", "ove", true, false], style) +annotation-row(["coffee", "3197928453018144401", "xxxx", "c", "ffe", true, false], style) p | The mapping of words to hashes doesn't depend on any state. To make sure | each value is unique, spaCy uses a | #[+a("https://en.wikipedia.org/wiki/Hash_function") hash function] to | calculate the hash #[strong based on the word string]. This also means | that the hash for "coffee" will always be the same, no matter which model | you're using or how you've configured spaCy. p | However, hashes #[strong cannot be reversed] and there's no way to | resolve #[code 3197928453018144401] back to "coffee". All spaCy can do | is look it up in the vocabulary. That's why you always need to make | sure all objects you create have access to the same vocabulary. If they | don't, spaCy might not be able to find the strings it needs. +code. from spacy.tokens import Doc from spacy.vocab import Vocab doc = nlp(u'I like coffee') # original Doc assert doc.vocab.strings[u'coffee'] == 3197928453018144401 # get hash assert doc.vocab.strings[3197928453018144401] == u'coffee' # 👍 empty_doc = Doc(Vocab()) # new Doc with empty Vocab # doc.vocab.strings[3197928453018144401] will raise an error :( empty_doc.vocab.strings.add(u'coffee') # add "coffee" and generate hash assert doc.vocab.strings[3197928453018144401] == u'coffee' # 👍 new_doc = Doc(doc.vocab) # create new doc with first doc's vocab assert doc.vocab.strings[3197928453018144401] == u'coffee' # 👍 p | If the vocabulary doesn't contain a hash for "coffee", spaCy will | throw an error. So you either need to add it manually, or initialise the | new #[code Doc] with the shared vocabulary. To prevent this problem, | spaCy will also export the #[code Vocab] when you save a | #[code Doc] or #[code nlp] object.