spaCy/website/docs/api/stringstore.md
Madeesh Kannan 446a3ecf34
StringStore refactoring (#11344)
* `strings`: Remove unused `hash32_utf8` function

* `strings`: Make `hash_utf8` and `decode_Utf8Str` private

* `strings`: Reorganize private functions

* 'strings': Raise error when non-string/-int types are passed to functions that don't accept them

* `strings`: Add `items()` method, add type hints, remove unused methods, restrict inputs to specific types, reorganize methods

* `Morphology`: Use `StringStore.items()` to enumerate features when pickling

* `test_stringstore`: Update pre-Python 3 tests

* Update `StringStore` docs

* Fix `get_string_id` imports

* Replace redundant test with tests for type checking

* Rename `_retrieve_interned_str`, remove `.get` default arg

* Add `get_string_id` to `strings.pyi`
Remove `mypy` ignore directives from imports of the above

* `strings.pyi`: Replace functions that consume `Union`-typed params with overloads

* `strings.pyi`: Revert some function signatures

* Update `SYMBOLS_BY_INT` lookups and error codes post-merge

* Revert clobbered change introduced in a previous merge

* Remove unnecessary type hint

* Invert tuple order in `StringStore.items()`

* Add test for `StringStore.items()`

* Revert "`Morphology`: Use `StringStore.items()` to enumerate features when pickling"

This reverts commit 1af9510ceb.

* Rename `keys` and `key_map`

* Add `keys()` and `values()`

* Add comment about the inverted key-value semantics in the API

* Fix type hints

* Implement `keys()`, `values()`, `items()` without generators

* Fix type hints, remove unnecessary boxing

* Update docs

* Simplify `keys/values/items()` impl

* `mypy` fix

* Fix error message, doc fixes
2022-10-06 10:51:06 +02:00

7.4 KiB

title tag source
StringStore class spacy/strings.pyx

Look up strings by 64-bit hashes. As of v2.0, spaCy uses hash values instead of integer IDs. This ensures that strings always map to the same ID, even from different StringStores.

StringStore.__init__

Create the StringStore.

Example

from spacy.strings import StringStore
stringstore = StringStore(["apple", "orange"])
Name Description
strings A sequence of strings to add to the store. Optional[Iterable[str]]

StringStore.__len__

Get the number of strings in the store.

Example

stringstore = StringStore(["apple", "orange"])
assert len(stringstore) == 2
Name Description
RETURNS The number of strings in the store. int

StringStore.__getitem__

Retrieve a string from a given hash. If a string is passed as the input, add it to the store and return its hash.

Example

stringstore = StringStore(["apple", "orange"])
apple_hash = stringstore["apple"]
assert apple_hash == 8566208034543834098
assert stringstore[apple_hash] == "apple"
Name Description
string_or_hash The hash value to lookup or the string to store. Union[str, int]
RETURNS The stored string or the hash of the newly added string. Union[str, int]

StringStore.__contains__

Check whether a string or a hash is in the store.

Example

stringstore = StringStore(["apple", "orange"])
assert "apple" in stringstore
assert not "cherry" in stringstore
Name Description
string_or_hash The string or hash to check. Union[str, int]
RETURNS Whether the store contains the string or hash. bool

StringStore.__iter__

Iterate over the stored strings in insertion order.

Example

stringstore = StringStore(["apple", "orange"])
all_strings = [s for s in stringstore]
assert all_strings == ["apple", "orange"]
Name Description
RETURNS A string in the store. str

StringStore.items

Iterate over the stored string-hash pairs in insertion order.

Example

stringstore = StringStore(["apple", "orange"])
all_strings_and_hashes = stringstore.items()
assert all_strings_and_hashes == [("apple", 8566208034543834098), ("orange", 2208928596161743350)]
Name Description
RETURNS A list of string-hash pairs. List[Tuple[str, int]]

StringStore.keys

Iterate over the stored strings in insertion order.

Example

stringstore = StringStore(["apple", "orange"])
all_strings = stringstore.keys()
assert all_strings == ["apple", "orange"]
Name Description
RETURNS A list of strings. List[str]

StringStore.values

Iterate over the stored string hashes in insertion order.

Example

stringstore = StringStore(["apple", "orange"])
all_hashes = stringstore.values()
assert all_hashes == [8566208034543834098, 2208928596161743350]
Name Description
RETURNS A list of string hashes. List[int]

StringStore.add

Add a string to the StringStore.

Example

stringstore = StringStore(["apple", "orange"])
banana_hash = stringstore.add("banana")
assert len(stringstore) == 3
assert banana_hash == 2525716904149915114
assert stringstore[banana_hash] == "banana"
assert stringstore["banana"] == banana_hash
Name Description
string The string to add. str
RETURNS The string's hash value. int

StringStore.to_disk

Save the current state to a directory.

Example

stringstore.to_disk("/path/to/strings")
Name Description
path A path to a directory, which will be created if it doesn't exist. Paths may be either strings or Path-like objects. Union[str, Path]

StringStore.from_disk

Loads state from a directory. Modifies the object in place and returns it.

Example

from spacy.strings import StringStore
stringstore = StringStore().from_disk("/path/to/strings")
Name Description
path A path to a directory. Paths may be either strings or Path-like objects. Union[str, Path]
RETURNS The modified StringStore object. StringStore

StringStore.to_bytes

Serialize the current state to a binary string.

Example

store_bytes = stringstore.to_bytes()
Name Description
RETURNS The serialized form of the StringStore object. bytes

StringStore.from_bytes

Load state from a binary string.

Example

from spacy.strings import StringStore
store_bytes = stringstore.to_bytes()
new_store = StringStore().from_bytes(store_bytes)
Name Description
bytes_data The data to load from. bytes
RETURNS The StringStore object. StringStore

Utilities

strings.hash_string

Get a 64-bit hash for a given string.

Example

from spacy.strings import hash_string
assert hash_string("apple") == 8566208034543834098
Name Description
string The string to hash. str
RETURNS The hash. int