spaCy/extra/DEVELOPER_DOCS/StringStore-Vocab.md
Paul O'Leary McCann 9ceb8f413c
StringStore/Vocab dev docs (#9142)
* First take at StringStore/Vocab docs

Things to check:

1. The mysterious vocab members
2. How to make table of contents? Is it autogenerated?
3. Anything I missed / needs more detail?

* Update docs

* Apply suggestions from code review

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Updates based on review feedback

* Minor fix

* Move example code down

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2021-09-16 12:50:22 +09:00

8.9 KiB

StringStore & Vocab

Reference: spacy/strings.pyx Reference: spacy/vocab.pyx

Overview

spaCy represents mosts strings internally using a uint64 in Cython which corresponds to a hash. The magic required to make this largely transparent is handled by the StringStore, and is integrated into the pipelines using the Vocab, which also connects it to some other information.

These are mostly internal details that average library users should never have to think about. On the other hand, when developing a component it's normal to interact with the Vocab for lexeme data or word vectors, and it's not unusual to add labels to the StringStore.

StringStore

Overview

The StringStore is a cdef class that looks a bit like a two-way dictionary, though it is not a subclass of anything in particular.

The main functionality of the StringStore is that __getitem__ converts hashes into strings or strings into hashes.

The full details of the conversion are complicated. Normally you shouldn't have to worry about them, but the first applicable case here is used to get the return value:

  1. 0 and the empty string are special cased to each other
  2. internal symbols use a lookup table (SYMBOLS_BY_STR)
  3. normal strings or bytes are hashed
  4. internal symbol IDs in SYMBOLS_BY_INT are handled
  5. anything not yet handled is used as a hash to lookup a string

For the symbol enums, see symbols.pxd.

Almost all strings in spaCy are stored in the StringStore. This naturally includes tokens, but also includes things like labels (not just NER/POS/dep, but also categories etc.), lemmas, lowercase forms, word shapes, and so on. One of the main results of this is that tokens can be represented by a compact C struct (LexemeC/TokenC) that mostly consists of string hashes. This also means that converting input for the models is straightforward, and there's not a token mapping step like in many machine learning frameworks. Additionally, because the token IDs in spaCy are based on hashes, they are consistent across environments or models.

One pattern you'll see a lot in spaCy APIs is that something.value returns an int and something.value_ returns a string. That's implemented using the StringStore. Typically the int is stored in a C struct and the string is generated via a property that calls into the StringStore with the int.

Besides __getitem__, the StringStore has functions to return specifically a string or specifically a hash, regardless of whether the input was a string or hash to begin with, though these are only used occasionally.

Implementation Details: Hashes and Allocations

Hashes are 64-bit and are computed using murmurhash on UTF-8 bytes. There is no mechanism for detecting and avoiding collisions. To date there has never been a reproducible collision or user report about any related issues.

The empty string is not hashed, it's just converted to/from 0.

A small number of strings use indices into a lookup table (so low integers) rather than hashes. This is mostly Universal Dependencies labels or other strings considered "core" in spaCy. This was critical in v1, which hadn't introduced hashing yet. Since v2 it's important for items in spacy.attrs, especially lexeme flags, but is otherwise only maintained for backwards compatibility.

You can call strings["mystring"] with a string the StringStore has never seen before and it will return a hash. But in order to do the reverse operation, you need to call strings.add("mystring") first. Without a call to add the string will not be interned.

Example:

from spacy.strings import StringStore

ss = StringStore()
hashval = ss["spacy"] # 10639093010105930009
try:
    # this won't work
    ss[hashval]
except KeyError:
    print(f"key {hashval} unknown in the StringStore.")

ss.add("spacy")
assert ss[hashval] == "spacy" # it works now

# There is no `.keys` property, but you can iterate over keys
# The empty string will never be in the list of keys
for key in ss:
    print(key)

In normal use nothing is ever removed from the StringStore. In theory this means that if you do something like iterate through all hex values of a certain length you can have explosive memory usage. In practice this has never been an issue. (Note that this is also different from using sys.intern to intern Python strings, which does not guarantee they won't be garbage collected later.)

Strings are stored in the StringStore in a peculiar way: each string uses a union that is either an eight-byte char[] or a char*. Short strings are stored directly in the char[], while longer strings are stored in allocated memory and prefixed with their length. This is a strategy to reduce indirection and memory fragmentation. See decode_Utf8Str and _allocate in strings.pyx for the implementation.

When to Use the StringStore?

While you can ignore the StringStore in many cases, there are situations where you should make use of it to avoid errors.

Any time you introduce a string that may be set on a Doc field that has a hash, you should add the string to the StringStore. This mainly happens when adding labels in components, but there are some other cases:

  • syntax iterators, mainly get_noun_chunks
  • external data used in components, like the KnowledgeBase in the entity_linker
  • labels used in tests

Vocab

The Vocab is a core component of a Language pipeline. Its main function is to manage Lexemes, which are structs that contain information about a token that depends only on its surface form, without context. Lexemes store much of the data associated with Tokens. As a side effect of this the Vocab also manages the StringStore for a pipeline and a grab-bag of other data.

These are things stored in the vocab:

  • Lexemes
  • StringStore
  • Morphology: manages info used in MorphAnalysis objects
  • vectors: basically a dict for word vectors
  • lookups: language specific data like lemmas
  • writing_system: language specific metadata
  • get_noun_chunks: a syntax iterator
  • lex attribute getters: functions like is_punct, set in language defaults
  • cfg: not the pipeline config, this is mostly unused
  • _unused_object: Formerly an unused object, kept around until v4 for compatability

Some of these, like the Morphology and Vectors, are complex enough that they need their own explanations. Here we'll just look at Vocab-specific items.

Lexemes

A Lexeme is a type that mainly wraps a LexemeC, a struct consisting of ints that identify various context-free token attributes. Lexemes are the core data of the Vocab, and can be accessed using __getitem__ on the Vocab. The memory for storing LexemeC objects is managed by a pool that belongs to the Vocab.

Note that __getitem__ on the Vocab works much like the StringStore, in that it accepts a hash or id, with one important difference: if you do a lookup using a string, that value is added to the StringStore automatically.

The attributes stored in a LexemeC are:

  • orth (the raw text)
  • lower
  • norm
  • shape
  • prefix
  • suffix

Most of these are straightforward. All of them can be customized, and (except orth) probably should be since the defaults are based on English, but in practice this is rarely done at present.

Lookups

This is basically a dict of dicts, implemented using a Table for each sub-dict, that stores lemmas and other language-specific lookup data.

A Table is a subclass of OrderedDict used for string-to-string data. It uses Bloom filters to speed up misses and has some extra serialization features. Tables are not used outside of the lookups.

Lex Attribute Getters

Lexical Attribute Getters like is_punct are defined on a per-language basis, much like lookups, but take the form of functions rather than string-to-string dicts, so they're stored separately.

Writing System

This is a dict with three attributes:

  • direction: ltr or rtl (default ltr)
  • has_case: bool (default True)
  • has_letters: bool (default True, False only for CJK for now)

Currently these are not used much - the main use is that direction is used in visualizers, though rtl doesn't quite work (see #4854). In the future they could be used when choosing hyperparameters for subwords, controlling word shape generation, and similar tasks.

Other Vocab Members

The Vocab is kind of the default place to store things from Language.defaults that don't belong to the Tokenizer. The following properties are in the Vocab just because they don't have anywhere else to go.

  • get_noun_chunks
  • cfg: This is a dict that just stores oov_prob (hardcoded to -20)
  • _unused_object: Leftover C member, should be removed in next major version