spaCy/website/docs/api/stringstore.md

239 lines
7.4 KiB
Markdown
Raw Normal View History

---
title: StringStore
tag: class
source: spacy/strings.pyx
---
Look up strings by 64-bit hashes. As of v2.0, spaCy uses hash values instead of
integer IDs. This ensures that strings always map to the same ID, even from
different `StringStores`.
## StringStore.\_\_init\_\_ {#init tag="method"}
Create the `StringStore`.
> #### Example
>
> ```python
> from spacy.strings import StringStore
> stringstore = StringStore(["apple", "orange"])
> ```
2020-08-17 17:45:24 +03:00
| Name | Description |
| --------- | ---------------------------------------------------------------------- |
| `strings` | A sequence of strings to add to the store. ~~Optional[Iterable[str]]~~ |
## StringStore.\_\_len\_\_ {#len tag="method"}
Get the number of strings in the store.
> #### Example
>
> ```python
> stringstore = StringStore(["apple", "orange"])
> assert len(stringstore) == 2
> ```
2020-08-17 17:45:24 +03:00
| Name | Description |
| ----------- | ------------------------------------------- |
| **RETURNS** | The number of strings in the store. ~~int~~ |
## StringStore.\_\_getitem\_\_ {#getitem tag="method"}
`StringStore` refactoring (#11344) * `strings`: Remove unused `hash32_utf8` function * `strings`: Make `hash_utf8` and `decode_Utf8Str` private * `strings`: Reorganize private functions * 'strings': Raise error when non-string/-int types are passed to functions that don't accept them * `strings`: Add `items()` method, add type hints, remove unused methods, restrict inputs to specific types, reorganize methods * `Morphology`: Use `StringStore.items()` to enumerate features when pickling * `test_stringstore`: Update pre-Python 3 tests * Update `StringStore` docs * Fix `get_string_id` imports * Replace redundant test with tests for type checking * Rename `_retrieve_interned_str`, remove `.get` default arg * Add `get_string_id` to `strings.pyi` Remove `mypy` ignore directives from imports of the above * `strings.pyi`: Replace functions that consume `Union`-typed params with overloads * `strings.pyi`: Revert some function signatures * Update `SYMBOLS_BY_INT` lookups and error codes post-merge * Revert clobbered change introduced in a previous merge * Remove unnecessary type hint * Invert tuple order in `StringStore.items()` * Add test for `StringStore.items()` * Revert "`Morphology`: Use `StringStore.items()` to enumerate features when pickling" This reverts commit 1af9510ceb6b08cfdcfbf26df6896f26709fac0d. * Rename `keys` and `key_map` * Add `keys()` and `values()` * Add comment about the inverted key-value semantics in the API * Fix type hints * Implement `keys()`, `values()`, `items()` without generators * Fix type hints, remove unnecessary boxing * Update docs * Simplify `keys/values/items()` impl * `mypy` fix * Fix error message, doc fixes
2022-10-06 11:51:06 +03:00
Retrieve a string from a given hash. If a string is passed as the input, add it
to the store and return its hash.
> #### Example
>
> ```python
> stringstore = StringStore(["apple", "orange"])
> apple_hash = stringstore["apple"]
> assert apple_hash == 8566208034543834098
> assert stringstore[apple_hash] == "apple"
> ```
`StringStore` refactoring (#11344) * `strings`: Remove unused `hash32_utf8` function * `strings`: Make `hash_utf8` and `decode_Utf8Str` private * `strings`: Reorganize private functions * 'strings': Raise error when non-string/-int types are passed to functions that don't accept them * `strings`: Add `items()` method, add type hints, remove unused methods, restrict inputs to specific types, reorganize methods * `Morphology`: Use `StringStore.items()` to enumerate features when pickling * `test_stringstore`: Update pre-Python 3 tests * Update `StringStore` docs * Fix `get_string_id` imports * Replace redundant test with tests for type checking * Rename `_retrieve_interned_str`, remove `.get` default arg * Add `get_string_id` to `strings.pyi` Remove `mypy` ignore directives from imports of the above * `strings.pyi`: Replace functions that consume `Union`-typed params with overloads * `strings.pyi`: Revert some function signatures * Update `SYMBOLS_BY_INT` lookups and error codes post-merge * Revert clobbered change introduced in a previous merge * Remove unnecessary type hint * Invert tuple order in `StringStore.items()` * Add test for `StringStore.items()` * Revert "`Morphology`: Use `StringStore.items()` to enumerate features when pickling" This reverts commit 1af9510ceb6b08cfdcfbf26df6896f26709fac0d. * Rename `keys` and `key_map` * Add `keys()` and `values()` * Add comment about the inverted key-value semantics in the API * Fix type hints * Implement `keys()`, `values()`, `items()` without generators * Fix type hints, remove unnecessary boxing * Update docs * Simplify `keys/values/items()` impl * `mypy` fix * Fix error message, doc fixes
2022-10-06 11:51:06 +03:00
| Name | Description |
| ---------------- | ---------------------------------------------------------------------------- |
| `string_or_hash` | The hash value to lookup or the string to store. ~~Union[str, int]~~ |
| **RETURNS** | The stored string or the hash of the newly added string. ~~Union[str, int]~~ |
## StringStore.\_\_contains\_\_ {#contains tag="method"}
`StringStore` refactoring (#11344) * `strings`: Remove unused `hash32_utf8` function * `strings`: Make `hash_utf8` and `decode_Utf8Str` private * `strings`: Reorganize private functions * 'strings': Raise error when non-string/-int types are passed to functions that don't accept them * `strings`: Add `items()` method, add type hints, remove unused methods, restrict inputs to specific types, reorganize methods * `Morphology`: Use `StringStore.items()` to enumerate features when pickling * `test_stringstore`: Update pre-Python 3 tests * Update `StringStore` docs * Fix `get_string_id` imports * Replace redundant test with tests for type checking * Rename `_retrieve_interned_str`, remove `.get` default arg * Add `get_string_id` to `strings.pyi` Remove `mypy` ignore directives from imports of the above * `strings.pyi`: Replace functions that consume `Union`-typed params with overloads * `strings.pyi`: Revert some function signatures * Update `SYMBOLS_BY_INT` lookups and error codes post-merge * Revert clobbered change introduced in a previous merge * Remove unnecessary type hint * Invert tuple order in `StringStore.items()` * Add test for `StringStore.items()` * Revert "`Morphology`: Use `StringStore.items()` to enumerate features when pickling" This reverts commit 1af9510ceb6b08cfdcfbf26df6896f26709fac0d. * Rename `keys` and `key_map` * Add `keys()` and `values()` * Add comment about the inverted key-value semantics in the API * Fix type hints * Implement `keys()`, `values()`, `items()` without generators * Fix type hints, remove unnecessary boxing * Update docs * Simplify `keys/values/items()` impl * `mypy` fix * Fix error message, doc fixes
2022-10-06 11:51:06 +03:00
Check whether a string or a hash is in the store.
> #### Example
>
> ```python
> stringstore = StringStore(["apple", "orange"])
> assert "apple" in stringstore
> assert not "cherry" in stringstore
> ```
`StringStore` refactoring (#11344) * `strings`: Remove unused `hash32_utf8` function * `strings`: Make `hash_utf8` and `decode_Utf8Str` private * `strings`: Reorganize private functions * 'strings': Raise error when non-string/-int types are passed to functions that don't accept them * `strings`: Add `items()` method, add type hints, remove unused methods, restrict inputs to specific types, reorganize methods * `Morphology`: Use `StringStore.items()` to enumerate features when pickling * `test_stringstore`: Update pre-Python 3 tests * Update `StringStore` docs * Fix `get_string_id` imports * Replace redundant test with tests for type checking * Rename `_retrieve_interned_str`, remove `.get` default arg * Add `get_string_id` to `strings.pyi` Remove `mypy` ignore directives from imports of the above * `strings.pyi`: Replace functions that consume `Union`-typed params with overloads * `strings.pyi`: Revert some function signatures * Update `SYMBOLS_BY_INT` lookups and error codes post-merge * Revert clobbered change introduced in a previous merge * Remove unnecessary type hint * Invert tuple order in `StringStore.items()` * Add test for `StringStore.items()` * Revert "`Morphology`: Use `StringStore.items()` to enumerate features when pickling" This reverts commit 1af9510ceb6b08cfdcfbf26df6896f26709fac0d. * Rename `keys` and `key_map` * Add `keys()` and `values()` * Add comment about the inverted key-value semantics in the API * Fix type hints * Implement `keys()`, `values()`, `items()` without generators * Fix type hints, remove unnecessary boxing * Update docs * Simplify `keys/values/items()` impl * `mypy` fix * Fix error message, doc fixes
2022-10-06 11:51:06 +03:00
| Name | Description |
| ---------------- | ------------------------------------------------------- |
| `string_or_hash` | The string or hash to check. ~~Union[str, int]~~ |
| **RETURNS** | Whether the store contains the string or hash. ~~bool~~ |
## StringStore.\_\_iter\_\_ {#iter tag="method"}
`StringStore` refactoring (#11344) * `strings`: Remove unused `hash32_utf8` function * `strings`: Make `hash_utf8` and `decode_Utf8Str` private * `strings`: Reorganize private functions * 'strings': Raise error when non-string/-int types are passed to functions that don't accept them * `strings`: Add `items()` method, add type hints, remove unused methods, restrict inputs to specific types, reorganize methods * `Morphology`: Use `StringStore.items()` to enumerate features when pickling * `test_stringstore`: Update pre-Python 3 tests * Update `StringStore` docs * Fix `get_string_id` imports * Replace redundant test with tests for type checking * Rename `_retrieve_interned_str`, remove `.get` default arg * Add `get_string_id` to `strings.pyi` Remove `mypy` ignore directives from imports of the above * `strings.pyi`: Replace functions that consume `Union`-typed params with overloads * `strings.pyi`: Revert some function signatures * Update `SYMBOLS_BY_INT` lookups and error codes post-merge * Revert clobbered change introduced in a previous merge * Remove unnecessary type hint * Invert tuple order in `StringStore.items()` * Add test for `StringStore.items()` * Revert "`Morphology`: Use `StringStore.items()` to enumerate features when pickling" This reverts commit 1af9510ceb6b08cfdcfbf26df6896f26709fac0d. * Rename `keys` and `key_map` * Add `keys()` and `values()` * Add comment about the inverted key-value semantics in the API * Fix type hints * Implement `keys()`, `values()`, `items()` without generators * Fix type hints, remove unnecessary boxing * Update docs * Simplify `keys/values/items()` impl * `mypy` fix * Fix error message, doc fixes
2022-10-06 11:51:06 +03:00
Iterate over the stored strings in insertion order.
> #### Example
>
> ```python
> stringstore = StringStore(["apple", "orange"])
> all_strings = [s for s in stringstore]
> assert all_strings == ["apple", "orange"]
> ```
`StringStore` refactoring (#11344) * `strings`: Remove unused `hash32_utf8` function * `strings`: Make `hash_utf8` and `decode_Utf8Str` private * `strings`: Reorganize private functions * 'strings': Raise error when non-string/-int types are passed to functions that don't accept them * `strings`: Add `items()` method, add type hints, remove unused methods, restrict inputs to specific types, reorganize methods * `Morphology`: Use `StringStore.items()` to enumerate features when pickling * `test_stringstore`: Update pre-Python 3 tests * Update `StringStore` docs * Fix `get_string_id` imports * Replace redundant test with tests for type checking * Rename `_retrieve_interned_str`, remove `.get` default arg * Add `get_string_id` to `strings.pyi` Remove `mypy` ignore directives from imports of the above * `strings.pyi`: Replace functions that consume `Union`-typed params with overloads * `strings.pyi`: Revert some function signatures * Update `SYMBOLS_BY_INT` lookups and error codes post-merge * Revert clobbered change introduced in a previous merge * Remove unnecessary type hint * Invert tuple order in `StringStore.items()` * Add test for `StringStore.items()` * Revert "`Morphology`: Use `StringStore.items()` to enumerate features when pickling" This reverts commit 1af9510ceb6b08cfdcfbf26df6896f26709fac0d. * Rename `keys` and `key_map` * Add `keys()` and `values()` * Add comment about the inverted key-value semantics in the API * Fix type hints * Implement `keys()`, `values()`, `items()` without generators * Fix type hints, remove unnecessary boxing * Update docs * Simplify `keys/values/items()` impl * `mypy` fix * Fix error message, doc fixes
2022-10-06 11:51:06 +03:00
| Name | Description |
| ----------- | ------------------------------ |
| **RETURNS** | A string in the store. ~~str~~ |
`StringStore` refactoring (#11344) * `strings`: Remove unused `hash32_utf8` function * `strings`: Make `hash_utf8` and `decode_Utf8Str` private * `strings`: Reorganize private functions * 'strings': Raise error when non-string/-int types are passed to functions that don't accept them * `strings`: Add `items()` method, add type hints, remove unused methods, restrict inputs to specific types, reorganize methods * `Morphology`: Use `StringStore.items()` to enumerate features when pickling * `test_stringstore`: Update pre-Python 3 tests * Update `StringStore` docs * Fix `get_string_id` imports * Replace redundant test with tests for type checking * Rename `_retrieve_interned_str`, remove `.get` default arg * Add `get_string_id` to `strings.pyi` Remove `mypy` ignore directives from imports of the above * `strings.pyi`: Replace functions that consume `Union`-typed params with overloads * `strings.pyi`: Revert some function signatures * Update `SYMBOLS_BY_INT` lookups and error codes post-merge * Revert clobbered change introduced in a previous merge * Remove unnecessary type hint * Invert tuple order in `StringStore.items()` * Add test for `StringStore.items()` * Revert "`Morphology`: Use `StringStore.items()` to enumerate features when pickling" This reverts commit 1af9510ceb6b08cfdcfbf26df6896f26709fac0d. * Rename `keys` and `key_map` * Add `keys()` and `values()` * Add comment about the inverted key-value semantics in the API * Fix type hints * Implement `keys()`, `values()`, `items()` without generators * Fix type hints, remove unnecessary boxing * Update docs * Simplify `keys/values/items()` impl * `mypy` fix * Fix error message, doc fixes
2022-10-06 11:51:06 +03:00
## StringStore.items {#iter tag="method" new="4"}
Iterate over the stored string-hash pairs in insertion order.
> #### Example
>
> ```python
> stringstore = StringStore(["apple", "orange"])
> all_strings_and_hashes = stringstore.items()
> assert all_strings_and_hashes == [("apple", 8566208034543834098), ("orange", 2208928596161743350)]
> ```
| Name | Description |
| ----------- | ------------------------------------------------------ |
| **RETURNS** | A list of string-hash pairs. ~~List[Tuple[str, int]]~~ |
## StringStore.keys {#iter tag="method" new="4"}
Iterate over the stored strings in insertion order.
> #### Example
>
> ```python
> stringstore = StringStore(["apple", "orange"])
> all_strings = stringstore.keys()
> assert all_strings == ["apple", "orange"]
> ```
| Name | Description |
| ----------- | -------------------------------- |
| **RETURNS** | A list of strings. ~~List[str]~~ |
## StringStore.values {#iter tag="method" new="4"}
Iterate over the stored string hashes in insertion order.
> #### Example
>
> ```python
> stringstore = StringStore(["apple", "orange"])
> all_hashes = stringstore.values()
> assert all_hashes == [8566208034543834098, 2208928596161743350]
> ```
| Name | Description |
| ----------- | -------------------------------------- |
| **RETURNS** | A list of string hashes. ~~List[int]~~ |
## StringStore.add {#add tag="method"}
Add a string to the `StringStore`.
> #### Example
>
> ```python
> stringstore = StringStore(["apple", "orange"])
> banana_hash = stringstore.add("banana")
> assert len(stringstore) == 3
> assert banana_hash == 2525716904149915114
> assert stringstore[banana_hash] == "banana"
> assert stringstore["banana"] == banana_hash
> ```
2020-08-17 17:45:24 +03:00
| Name | Description |
| ----------- | -------------------------------- |
| `string` | The string to add. ~~str~~ |
| **RETURNS** | The string's hash value. ~~int~~ |
`StringStore` refactoring (#11344) * `strings`: Remove unused `hash32_utf8` function * `strings`: Make `hash_utf8` and `decode_Utf8Str` private * `strings`: Reorganize private functions * 'strings': Raise error when non-string/-int types are passed to functions that don't accept them * `strings`: Add `items()` method, add type hints, remove unused methods, restrict inputs to specific types, reorganize methods * `Morphology`: Use `StringStore.items()` to enumerate features when pickling * `test_stringstore`: Update pre-Python 3 tests * Update `StringStore` docs * Fix `get_string_id` imports * Replace redundant test with tests for type checking * Rename `_retrieve_interned_str`, remove `.get` default arg * Add `get_string_id` to `strings.pyi` Remove `mypy` ignore directives from imports of the above * `strings.pyi`: Replace functions that consume `Union`-typed params with overloads * `strings.pyi`: Revert some function signatures * Update `SYMBOLS_BY_INT` lookups and error codes post-merge * Revert clobbered change introduced in a previous merge * Remove unnecessary type hint * Invert tuple order in `StringStore.items()` * Add test for `StringStore.items()` * Revert "`Morphology`: Use `StringStore.items()` to enumerate features when pickling" This reverts commit 1af9510ceb6b08cfdcfbf26df6896f26709fac0d. * Rename `keys` and `key_map` * Add `keys()` and `values()` * Add comment about the inverted key-value semantics in the API * Fix type hints * Implement `keys()`, `values()`, `items()` without generators * Fix type hints, remove unnecessary boxing * Update docs * Simplify `keys/values/items()` impl * `mypy` fix * Fix error message, doc fixes
2022-10-06 11:51:06 +03:00
## StringStore.to_disk {#to_disk tag="method"}
Save the current state to a directory.
> #### Example
>
> ```python
> stringstore.to_disk("/path/to/strings")
> ```
2020-08-17 17:45:24 +03:00
| Name | Description |
| ------ | ------------------------------------------------------------------------------------------------------------------------------------------ |
| `path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ |
## StringStore.from_disk {#from_disk tag="method" new="2"}
Loads state from a directory. Modifies the object in place and returns it.
> #### Example
>
> ```python
> from spacy.strings import StringStore
> stringstore = StringStore().from_disk("/path/to/strings")
> ```
2020-08-17 17:45:24 +03:00
| Name | Description |
| ----------- | ----------------------------------------------------------------------------------------------- |
| `path` | A path to a directory. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ |
| **RETURNS** | The modified `StringStore` object. ~~StringStore~~ |
## StringStore.to_bytes {#to_bytes tag="method"}
Serialize the current state to a binary string.
> #### Example
>
> ```python
> store_bytes = stringstore.to_bytes()
> ```
2020-08-17 17:45:24 +03:00
| Name | Description |
| ----------- | ---------------------------------------------------------- |
| **RETURNS** | The serialized form of the `StringStore` object. ~~bytes~~ |
## StringStore.from_bytes {#from_bytes tag="method"}
Load state from a binary string.
> #### Example
>
> ```python
> from spacy.strings import StringStore
> store_bytes = stringstore.to_bytes()
> new_store = StringStore().from_bytes(store_bytes)
> ```
2020-08-17 17:45:24 +03:00
| Name | Description |
| ------------ | ----------------------------------------- |
| `bytes_data` | The data to load from. ~~bytes~~ |
| **RETURNS** | The `StringStore` object. ~~StringStore~~ |
## Utilities {#util}
### strings.hash_string {#hash_string tag="function"}
Get a 64-bit hash for a given string.
> #### Example
>
> ```python
> from spacy.strings import hash_string
> assert hash_string("apple") == 8566208034543834098
> ```
2020-08-17 17:45:24 +03:00
| Name | Description |
| ----------- | --------------------------- |
| `string` | The string to hash. ~~str~~ |
| **RETURNS** | The hash. ~~int~~ |