spaCy/website/usage/_spacy-101/_vocab.jade
2017-10-03 14:26:20 +02:00

114 lines
5.7 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

//- 💫 DOCS > USAGE > SPACY 101 > VOCAB & STRINGSTORE
p
| Whenever possible, spaCy tries to store data in a vocabulary, the
| #[+api("vocab") #[code Vocab]], that will be
| #[strong shared by multiple documents]. To save memory, spaCy also
| encodes all strings to #[strong hash values] in this case for example,
| "coffee" has the hash #[code 3197928453018144401]. Entity labels like
| "ORG" and part-of-speech tags like "VERB" are also encoded. Internally,
| spaCy only "speaks" in hash values.
+aside
| #[strong Token]: A word, punctuation mark etc. #[em in context], including
| its attributes, tags and dependencies.#[br]
| #[strong Lexeme]: A "word type" with no context. Includes the word shape
| and flags, e.g. if it's lowercase, a digit or punctuation.#[br]
| #[strong Doc]: A processed container of tokens in context.#[br]
| #[strong Vocab]: The collection of lexemes.#[br]
| #[strong StringStore]: The dictionary mapping hash values to strings, for
| example #[code 3197928453018144401] → "coffee".
+graphic("/assets/img/vocab_stringstore.svg")
include ../../assets/img/vocab_stringstore.svg
p
| If you process lots of documents containing the word "coffee" in all
| kinds of different contexts, storing the exact string "coffee" every time
| would take up way too much space. So instead, spaCy hashes the string
| and stores it in the #[+api("stringstore") #[code StringStore]]. You can
| think of the #[code StringStore] as a
| #[strong lookup table that works in both directions] you can look up a
| string to get its hash, or a hash to get its string:
+code.
doc = nlp(u'I like coffee')
assert doc.vocab.strings[u'coffee'] == 3197928453018144401
assert doc.vocab.strings[3197928453018144401] == u'coffee'
+aside("What does 'L' at the end of a hash mean?")
| If you return a hash value in the #[strong Python 2 interpreter], it'll
| show up as #[code 3197928453018144401L]. The #[code L] just means "long
| integer" it's #[strong not] actually a part of the hash value.
p
| Now that all strings are encoded, the entries in the vocabulary
| #[strong don't need to include the word text] themselves. Instead,
| they can look it up in the #[code StringStore] via its hash value. Each
| entry in the vocabulary, also called #[+api("lexeme") #[code Lexeme]],
| contains the #[strong context-independent] information about a word.
| For example, no matter if "love" is used as a verb or a noun in some
| context, its spelling and whether it consists of alphabetic characters
| won't ever change. Its hash value will also always be the same.
+code.
for word in doc:
lexeme = doc.vocab[word.text]
print(lexeme.text, lexeme.orth, lexeme.shape_, lexeme.prefix_, lexeme.suffix_,
lexeme.is_alpha, lexeme.is_digit, lexeme.is_title, lexeme.lang_)
+aside
| #[strong Text]: The original text of the lexeme.#[br]
| #[strong Orth]: The hash value of the lexeme.#[br]
| #[strong Shape]: The abstract word shape of the lexeme.#[br]
| #[strong Prefix]: By default, the first letter of the word string.#[br]
| #[strong Suffix]: By default, the last three letters of the word string.#[br]
| #[strong is alpha]: Does the lexeme consist of alphabetic characters?#[br]
| #[strong is digit]: Does the lexeme consist of digits?#[br]
+table(["text", "orth", "shape", "prefix", "suffix", "is_alpha", "is_digit"])
- var style = [0, 1, 1, 0, 0, 1, 1]
+annotation-row(["I", "4690420944186131903", "X", "I", "I", true, false], style)
+annotation-row(["love", "3702023516439754181", "xxxx", "l", "ove", true, false], style)
+annotation-row(["coffee", "3197928453018144401", "xxxx", "c", "ffe", true, false], style)
p
| The mapping of words to hashes doesn't depend on any state. To make sure
| each value is unique, spaCy uses a
| #[+a("https://en.wikipedia.org/wiki/Hash_function") hash function] to
| calculate the hash #[strong based on the word string]. This also means
| that the hash for "coffee" will always be the same, no matter which model
| you're using or how you've configured spaCy.
p
| However, hashes #[strong cannot be reversed] and there's no way to
| resolve #[code 3197928453018144401] back to "coffee". All spaCy can do
| is look it up in the vocabulary. That's why you always need to make
| sure all objects you create have access to the same vocabulary. If they
| don't, spaCy might not be able to find the strings it needs.
+code.
from spacy.tokens import Doc
from spacy.vocab import Vocab
doc = nlp(u'I like coffee') # original Doc
assert doc.vocab.strings[u'coffee'] == 3197928453018144401 # get hash
assert doc.vocab.strings[3197928453018144401] == u'coffee' # 👍
empty_doc = Doc(Vocab()) # new Doc with empty Vocab
# doc.vocab.strings[3197928453018144401] will raise an error :(
empty_doc.vocab.strings.add(u'coffee') # add "coffee" and generate hash
assert doc.vocab.strings[3197928453018144401] == u'coffee' # 👍
new_doc = Doc(doc.vocab) # create new doc with first doc's vocab
assert doc.vocab.strings[3197928453018144401] == u'coffee' # 👍
p
| If the vocabulary doesn't contain a string for #[code 3197928453018144401],
| spaCy will raise an error. You can re-add "coffee" manually, but this
| only works if you actually #[em know] that the document contains that
| word. To prevent this problem, spaCy will also export the #[code Vocab]
| when you save a #[code Doc] or #[code nlp] object. This will give you
| the object and its encoded annotations, plus the "key" to decode it.