spaCy/extra/DEVELOPER_DOCS/StringStore-Vocab.md
Paul O'Leary McCann 9ceb8f413c
StringStore/Vocab dev docs (#9142)
* First take at StringStore/Vocab docs

Things to check:

1. The mysterious vocab members
2. How to make table of contents? Is it autogenerated?
3. Anything I missed / needs more detail?

* Update docs

* Apply suggestions from code review

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Updates based on review feedback

* Minor fix

* Move example code down

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2021-09-16 12:50:22 +09:00

217 lines
8.9 KiB
Markdown

# StringStore & Vocab
> Reference: `spacy/strings.pyx`
> Reference: `spacy/vocab.pyx`
## Overview
spaCy represents mosts strings internally using a `uint64` in Cython which
corresponds to a hash. The magic required to make this largely transparent is
handled by the `StringStore`, and is integrated into the pipelines using the
`Vocab`, which also connects it to some other information.
These are mostly internal details that average library users should never have
to think about. On the other hand, when developing a component it's normal to
interact with the Vocab for lexeme data or word vectors, and it's not unusual
to add labels to the `StringStore`.
## StringStore
### Overview
The `StringStore` is a `cdef class` that looks a bit like a two-way dictionary,
though it is not a subclass of anything in particular.
The main functionality of the `StringStore` is that `__getitem__` converts
hashes into strings or strings into hashes.
The full details of the conversion are complicated. Normally you shouldn't have
to worry about them, but the first applicable case here is used to get the
return value:
1. 0 and the empty string are special cased to each other
2. internal symbols use a lookup table (`SYMBOLS_BY_STR`)
3. normal strings or bytes are hashed
4. internal symbol IDs in `SYMBOLS_BY_INT` are handled
5. anything not yet handled is used as a hash to lookup a string
For the symbol enums, see [`symbols.pxd`](https://github.com/explosion/spaCy/blob/master/spacy/symbols.pxd).
Almost all strings in spaCy are stored in the `StringStore`. This naturally
includes tokens, but also includes things like labels (not just NER/POS/dep,
but also categories etc.), lemmas, lowercase forms, word shapes, and so on. One
of the main results of this is that tokens can be represented by a compact C
struct ([`LexemeC`](https://spacy.io/api/cython-structs#lexemec)/[`TokenC`](https://github.com/explosion/spaCy/issues/4854)) that mostly consists of string hashes. This also means that converting
input for the models is straightforward, and there's not a token mapping step
like in many machine learning frameworks. Additionally, because the token IDs
in spaCy are based on hashes, they are consistent across environments or
models.
One pattern you'll see a lot in spaCy APIs is that `something.value` returns an
`int` and `something.value_` returns a string. That's implemented using the
`StringStore`. Typically the `int` is stored in a C struct and the string is
generated via a property that calls into the `StringStore` with the `int`.
Besides `__getitem__`, the `StringStore` has functions to return specifically a
string or specifically a hash, regardless of whether the input was a string or
hash to begin with, though these are only used occasionally.
### Implementation Details: Hashes and Allocations
Hashes are 64-bit and are computed using [murmurhash][] on UTF-8 bytes. There is no
mechanism for detecting and avoiding collisions. To date there has never been a
reproducible collision or user report about any related issues.
[murmurhash]: https://github.com/explosion/murmurhash
The empty string is not hashed, it's just converted to/from 0.
A small number of strings use indices into a lookup table (so low integers)
rather than hashes. This is mostly Universal Dependencies labels or other
strings considered "core" in spaCy. This was critical in v1, which hadn't
introduced hashing yet. Since v2 it's important for items in `spacy.attrs`,
especially lexeme flags, but is otherwise only maintained for backwards
compatibility.
You can call `strings["mystring"]` with a string the `StringStore` has never seen
before and it will return a hash. But in order to do the reverse operation, you
need to call `strings.add("mystring")` first. Without a call to `add` the
string will not be interned.
Example:
```
from spacy.strings import StringStore
ss = StringStore()
hashval = ss["spacy"] # 10639093010105930009
try:
# this won't work
ss[hashval]
except KeyError:
print(f"key {hashval} unknown in the StringStore.")
ss.add("spacy")
assert ss[hashval] == "spacy" # it works now
# There is no `.keys` property, but you can iterate over keys
# The empty string will never be in the list of keys
for key in ss:
print(key)
```
In normal use nothing is ever removed from the `StringStore`. In theory this
means that if you do something like iterate through all hex values of a certain
length you can have explosive memory usage. In practice this has never been an
issue. (Note that this is also different from using `sys.intern` to intern
Python strings, which does not guarantee they won't be garbage collected later.)
Strings are stored in the `StringStore` in a peculiar way: each string uses a
union that is either an eight-byte `char[]` or a `char*`. Short strings are
stored directly in the `char[]`, while longer strings are stored in allocated
memory and prefixed with their length. This is a strategy to reduce indirection
and memory fragmentation. See `decode_Utf8Str` and `_allocate` in
`strings.pyx` for the implementation.
### When to Use the StringStore?
While you can ignore the `StringStore` in many cases, there are situations where
you should make use of it to avoid errors.
Any time you introduce a string that may be set on a `Doc` field that has a hash,
you should add the string to the `StringStore`. This mainly happens when adding
labels in components, but there are some other cases:
- syntax iterators, mainly `get_noun_chunks`
- external data used in components, like the `KnowledgeBase` in the `entity_linker`
- labels used in tests
## Vocab
The `Vocab` is a core component of a `Language` pipeline. Its main function is
to manage `Lexeme`s, which are structs that contain information about a token
that depends only on its surface form, without context. `Lexeme`s store much of
the data associated with `Token`s. As a side effect of this the `Vocab` also
manages the `StringStore` for a pipeline and a grab-bag of other data.
These are things stored in the vocab:
- `Lexeme`s
- `StringStore`
- `Morphology`: manages info used in `MorphAnalysis` objects
- `vectors`: basically a dict for word vectors
- `lookups`: language specific data like lemmas
- `writing_system`: language specific metadata
- `get_noun_chunks`: a syntax iterator
- lex attribute getters: functions like `is_punct`, set in language defaults
- `cfg`: **not** the pipeline config, this is mostly unused
- `_unused_object`: Formerly an unused object, kept around until v4 for compatability
Some of these, like the Morphology and Vectors, are complex enough that they
need their own explanations. Here we'll just look at Vocab-specific items.
### Lexemes
A `Lexeme` is a type that mainly wraps a `LexemeC`, a struct consisting of ints
that identify various context-free token attributes. Lexemes are the core data
of the `Vocab`, and can be accessed using `__getitem__` on the `Vocab`. The memory
for storing `LexemeC` objects is managed by a pool that belongs to the `Vocab`.
Note that `__getitem__` on the `Vocab` works much like the `StringStore`, in
that it accepts a hash or id, with one important difference: if you do a lookup
using a string, that value is added to the `StringStore` automatically.
The attributes stored in a `LexemeC` are:
- orth (the raw text)
- lower
- norm
- shape
- prefix
- suffix
Most of these are straightforward. All of them can be customized, and (except
`orth`) probably should be since the defaults are based on English, but in
practice this is rarely done at present.
### Lookups
This is basically a dict of dicts, implemented using a `Table` for each
sub-dict, that stores lemmas and other language-specific lookup data.
A `Table` is a subclass of `OrderedDict` used for string-to-string data. It uses
Bloom filters to speed up misses and has some extra serialization features.
Tables are not used outside of the lookups.
### Lex Attribute Getters
Lexical Attribute Getters like `is_punct` are defined on a per-language basis,
much like lookups, but take the form of functions rather than string-to-string
dicts, so they're stored separately.
### Writing System
This is a dict with three attributes:
- `direction`: ltr or rtl (default ltr)
- `has_case`: bool (default `True`)
- `has_letters`: bool (default `True`, `False` only for CJK for now)
Currently these are not used much - the main use is that `direction` is used in
visualizers, though `rtl` doesn't quite work (see
[#4854](https://github.com/explosion/spaCy/issues/4854)). In the future they
could be used when choosing hyperparameters for subwords, controlling word
shape generation, and similar tasks.
### Other Vocab Members
The Vocab is kind of the default place to store things from `Language.defaults`
that don't belong to the Tokenizer. The following properties are in the Vocab
just because they don't have anywhere else to go.
- `get_noun_chunks`
- `cfg`: This is a dict that just stores `oov_prob` (hardcoded to `-20`)
- `_unused_object`: Leftover C member, should be removed in next major version