spaCy/spacy/strings.pxd

from cymem.cymem cimport Pool
from libc.stdint cimport int64_t, uint32_t
from libcpp.set cimport set
from libcpp.vector cimport vector
from murmurhash.mrmr cimport hash64
from preshed.maps cimport PreshMap

from .typedefs cimport attr_t, hash_t

ctypedef union Utf8Str:
    unsigned char[8] s
    unsigned char* p


cdef class StringStore:
    cdef Pool mem
    cdef vector[hash_t] _keys
    cdef vector[hash_t] _transient_keys
    cdef PreshMap _map
    cdef PreshMap _transient_map
    cdef Pool _non_temp_mem

    cdef hash_t _intern_str(self, str string, bint transient)
    cdef Utf8Str* _allocate_str_repr(self, const unsigned char* chars, uint32_t length) except *
    cdef str _decode_str_repr(self, const Utf8Str* string)


cpdef hash_t hash_string(object string) except -1
cpdef hash_t get_string_id(object string_or_hash) except -1
isort all the things 2023-06-26 12:41:03 +03:00			`from cymem.cymem cimport Pool`
`StringStore` refactoring (#11344) * `strings`: Remove unused `hash32_utf8` function * `strings`: Make `hash_utf8` and `decode_Utf8Str` private * `strings`: Reorganize private functions * 'strings': Raise error when non-string/-int types are passed to functions that don't accept them * `strings`: Add `items()` method, add type hints, remove unused methods, restrict inputs to specific types, reorganize methods * `Morphology`: Use `StringStore.items()` to enumerate features when pickling * `test_stringstore`: Update pre-Python 3 tests * Update `StringStore` docs * Fix `get_string_id` imports * Replace redundant test with tests for type checking * Rename `_retrieve_interned_str`, remove `.get` default arg * Add `get_string_id` to `strings.pyi` Remove `mypy` ignore directives from imports of the above * `strings.pyi`: Replace functions that consume `Union`-typed params with overloads * `strings.pyi`: Revert some function signatures * Update `SYMBOLS_BY_INT` lookups and error codes post-merge * Revert clobbered change introduced in a previous merge * Remove unnecessary type hint * Invert tuple order in `StringStore.items()` * Add test for `StringStore.items()` * Revert "`Morphology`: Use `StringStore.items()` to enumerate features when pickling" This reverts commit 1af9510ceb6b08cfdcfbf26df6896f26709fac0d. * Rename `keys` and `key_map` * Add `keys()` and `values()` * Add comment about the inverted key-value semantics in the API * Fix type hints * Implement `keys()`, `values()`, `items()` without generators * Fix type hints, remove unnecessary boxing * Update docs * Simplify `keys/values/items()` impl * `mypy` fix * Fix error message, doc fixes 2022-10-06 11:51:06 +03:00			`from libc.stdint cimport int64_t, uint32_t`
Try to fix StringStore clean up (see #1506) 2017-11-11 03:11:27 +03:00			`from libcpp.set cimport set`
isort all the things 2023-06-26 12:41:03 +03:00			`from libcpp.vector cimport vector`
* Move murmurhash to .pxd in strings file 2014-12-19 23:41:08 +03:00			`from murmurhash.mrmr cimport hash64`
isort all the things 2023-06-26 12:41:03 +03:00			`from preshed.maps cimport PreshMap`
* Fix type declarations for attr_t. Remove unused id_t. 2015-07-18 23:39:57 +03:00
Refactor training, with new spacy.train module. Defaults still a little awkward. 2016-10-09 13:24:24 +03:00			`from .typedefs cimport attr_t, hash_t`
* Move StringStore class to its own file 2014-12-19 22:42:01 +03:00
* Fix header for string store 2015-07-20 13:06:10 +03:00			`ctypedef union Utf8Str:`
			`unsigned char[8] s`
			`unsigned char* p`


* Move StringStore class to its own file 2014-12-19 22:42:01 +03:00			`cdef class StringStore:`
			`cdef Pool mem`
`StringStore` refactoring (#11344) * `strings`: Remove unused `hash32_utf8` function * `strings`: Make `hash_utf8` and `decode_Utf8Str` private * `strings`: Reorganize private functions * 'strings': Raise error when non-string/-int types are passed to functions that don't accept them * `strings`: Add `items()` method, add type hints, remove unused methods, restrict inputs to specific types, reorganize methods * `Morphology`: Use `StringStore.items()` to enumerate features when pickling * `test_stringstore`: Update pre-Python 3 tests * Update `StringStore` docs * Fix `get_string_id` imports * Replace redundant test with tests for type checking * Rename `_retrieve_interned_str`, remove `.get` default arg * Add `get_string_id` to `strings.pyi` Remove `mypy` ignore directives from imports of the above * `strings.pyi`: Replace functions that consume `Union`-typed params with overloads * `strings.pyi`: Revert some function signatures * Update `SYMBOLS_BY_INT` lookups and error codes post-merge * Revert clobbered change introduced in a previous merge * Remove unnecessary type hint * Invert tuple order in `StringStore.items()` * Add test for `StringStore.items()` * Revert "`Morphology`: Use `StringStore.items()` to enumerate features when pickling" This reverts commit 1af9510ceb6b08cfdcfbf26df6896f26709fac0d. * Rename `keys` and `key_map` * Add `keys()` and `values()` * Add comment about the inverted key-value semantics in the API * Fix type hints * Implement `keys()`, `values()`, `items()` without generators * Fix type hints, remove unnecessary boxing * Update docs * Simplify `keys/values/items()` impl * `mypy` fix * Fix error message, doc fixes 2022-10-06 11:51:06 +03:00			`cdef vector[hash_t] _keys`
Support 'memory zones' for user memory management Add a context manage nlp.memory_zone(), which will begin memory_zone() blocks on the vocab, string store, and potentially other components. Once the memory_zone() block expires, spaCy will free any shared resources that were allocated for the text-processing that occurred within the memory_zone. If you create Doc objects within a memory zone, it's invalid to access them once the memory zone is expired. The purpose of this is that spaCy creates and stores Lexeme objects in the Vocab that can be shared between multiple Doc objects. It also interns strings. Normally, spaCy can't know when all Doc objects using a Lexeme are out-of-scope, so new Lexemes accumulate in the vocab, causing memory pressure. Memory zones solve this problem by telling spaCy "okay none of the documents allocated within this block will be accessed again". This lets spaCy free all new Lexeme objects and other data that were created during the block. The mechanism is general, so memory_zone() context managers can be added to other components that could benefit from them, e.g. pipeline components. I experimented with adding memory zone support to the tokenizer as well, for its cache. However, this seems unnecessarily complicated. It makes more sense to just stick a limit on the cache size. This lets spaCy benefit from the efficiency advantage of the cache better, because we can maintain a (bounded) cache even if only small batches of documents are being processed. 2024-09-08 14:06:54 +03:00			`cdef vector[hash_t] _transient_keys`
`StringStore` refactoring (#11344) * `strings`: Remove unused `hash32_utf8` function * `strings`: Make `hash_utf8` and `decode_Utf8Str` private * `strings`: Reorganize private functions * 'strings': Raise error when non-string/-int types are passed to functions that don't accept them * `strings`: Add `items()` method, add type hints, remove unused methods, restrict inputs to specific types, reorganize methods * `Morphology`: Use `StringStore.items()` to enumerate features when pickling * `test_stringstore`: Update pre-Python 3 tests * Update `StringStore` docs * Fix `get_string_id` imports * Replace redundant test with tests for type checking * Rename `_retrieve_interned_str`, remove `.get` default arg * Add `get_string_id` to `strings.pyi` Remove `mypy` ignore directives from imports of the above * `strings.pyi`: Replace functions that consume `Union`-typed params with overloads * `strings.pyi`: Revert some function signatures * Update `SYMBOLS_BY_INT` lookups and error codes post-merge * Revert clobbered change introduced in a previous merge * Remove unnecessary type hint * Invert tuple order in `StringStore.items()` * Add test for `StringStore.items()` * Revert "`Morphology`: Use `StringStore.items()` to enumerate features when pickling" This reverts commit 1af9510ceb6b08cfdcfbf26df6896f26709fac0d. * Rename `keys` and `key_map` * Add `keys()` and `values()` * Add comment about the inverted key-value semantics in the API * Fix type hints * Implement `keys()`, `values()`, `items()` without generators * Fix type hints, remove unnecessary boxing * Update docs * Simplify `keys/values/items()` impl * `mypy` fix * Fix error message, doc fixes 2022-10-06 11:51:06 +03:00			`cdef PreshMap _map`
Support 'memory zones' for user memory management Add a context manage nlp.memory_zone(), which will begin memory_zone() blocks on the vocab, string store, and potentially other components. Once the memory_zone() block expires, spaCy will free any shared resources that were allocated for the text-processing that occurred within the memory_zone. If you create Doc objects within a memory zone, it's invalid to access them once the memory zone is expired. The purpose of this is that spaCy creates and stores Lexeme objects in the Vocab that can be shared between multiple Doc objects. It also interns strings. Normally, spaCy can't know when all Doc objects using a Lexeme are out-of-scope, so new Lexemes accumulate in the vocab, causing memory pressure. Memory zones solve this problem by telling spaCy "okay none of the documents allocated within this block will be accessed again". This lets spaCy free all new Lexeme objects and other data that were created during the block. The mechanism is general, so memory_zone() context managers can be added to other components that could benefit from them, e.g. pipeline components. I experimented with adding memory zone support to the tokenizer as well, for its cache. However, this seems unnecessarily complicated. It makes more sense to just stick a limit on the cache size. This lets spaCy benefit from the efficiency advantage of the cache better, because we can maintain a (bounded) cache even if only small batches of documents are being processed. 2024-09-08 14:06:54 +03:00			`cdef PreshMap _transient_map`
			`cdef Pool _non_temp_mem`
`StringStore` refactoring (#11344) * `strings`: Remove unused `hash32_utf8` function * `strings`: Make `hash_utf8` and `decode_Utf8Str` private * `strings`: Reorganize private functions * 'strings': Raise error when non-string/-int types are passed to functions that don't accept them * `strings`: Add `items()` method, add type hints, remove unused methods, restrict inputs to specific types, reorganize methods * `Morphology`: Use `StringStore.items()` to enumerate features when pickling * `test_stringstore`: Update pre-Python 3 tests * Update `StringStore` docs * Fix `get_string_id` imports * Replace redundant test with tests for type checking * Rename `_retrieve_interned_str`, remove `.get` default arg * Add `get_string_id` to `strings.pyi` Remove `mypy` ignore directives from imports of the above * `strings.pyi`: Replace functions that consume `Union`-typed params with overloads * `strings.pyi`: Revert some function signatures * Update `SYMBOLS_BY_INT` lookups and error codes post-merge * Revert clobbered change introduced in a previous merge * Remove unnecessary type hint * Invert tuple order in `StringStore.items()` * Add test for `StringStore.items()` * Revert "`Morphology`: Use `StringStore.items()` to enumerate features when pickling" This reverts commit 1af9510ceb6b08cfdcfbf26df6896f26709fac0d. * Rename `keys` and `key_map` * Add `keys()` and `values()` * Add comment about the inverted key-value semantics in the API * Fix type hints * Implement `keys()`, `values()`, `items()` without generators * Fix type hints, remove unnecessary boxing * Update docs * Simplify `keys/values/items()` impl * `mypy` fix * Fix error message, doc fixes 2022-10-06 11:51:06 +03:00
Support 'memory zones' for user memory management Add a context manage nlp.memory_zone(), which will begin memory_zone() blocks on the vocab, string store, and potentially other components. Once the memory_zone() block expires, spaCy will free any shared resources that were allocated for the text-processing that occurred within the memory_zone. If you create Doc objects within a memory zone, it's invalid to access them once the memory zone is expired. The purpose of this is that spaCy creates and stores Lexeme objects in the Vocab that can be shared between multiple Doc objects. It also interns strings. Normally, spaCy can't know when all Doc objects using a Lexeme are out-of-scope, so new Lexemes accumulate in the vocab, causing memory pressure. Memory zones solve this problem by telling spaCy "okay none of the documents allocated within this block will be accessed again". This lets spaCy free all new Lexeme objects and other data that were created during the block. The mechanism is general, so memory_zone() context managers can be added to other components that could benefit from them, e.g. pipeline components. I experimented with adding memory zone support to the tokenizer as well, for its cache. However, this seems unnecessarily complicated. It makes more sense to just stick a limit on the cache size. This lets spaCy benefit from the efficiency advantage of the cache better, because we can maintain a (bounded) cache even if only small batches of documents are being processed. 2024-09-08 14:06:54 +03:00			`cdef hash_t _intern_str(self, str string, bint transient)`
`StringStore` refactoring (#11344) * `strings`: Remove unused `hash32_utf8` function * `strings`: Make `hash_utf8` and `decode_Utf8Str` private * `strings`: Reorganize private functions * 'strings': Raise error when non-string/-int types are passed to functions that don't accept them * `strings`: Add `items()` method, add type hints, remove unused methods, restrict inputs to specific types, reorganize methods * `Morphology`: Use `StringStore.items()` to enumerate features when pickling * `test_stringstore`: Update pre-Python 3 tests * Update `StringStore` docs * Fix `get_string_id` imports * Replace redundant test with tests for type checking * Rename `_retrieve_interned_str`, remove `.get` default arg * Add `get_string_id` to `strings.pyi` Remove `mypy` ignore directives from imports of the above * `strings.pyi`: Replace functions that consume `Union`-typed params with overloads * `strings.pyi`: Revert some function signatures * Update `SYMBOLS_BY_INT` lookups and error codes post-merge * Revert clobbered change introduced in a previous merge * Remove unnecessary type hint * Invert tuple order in `StringStore.items()` * Add test for `StringStore.items()` * Revert "`Morphology`: Use `StringStore.items()` to enumerate features when pickling" This reverts commit 1af9510ceb6b08cfdcfbf26df6896f26709fac0d. * Rename `keys` and `key_map` * Add `keys()` and `values()` * Add comment about the inverted key-value semantics in the API * Fix type hints * Implement `keys()`, `values()`, `items()` without generators * Fix type hints, remove unnecessary boxing * Update docs * Simplify `keys/values/items()` impl * `mypy` fix * Fix error message, doc fixes 2022-10-06 11:51:06 +03:00			`cdef Utf8Str* _allocate_str_repr(self, const unsigned char* chars, uint32_t length) except *`
			`cdef str _decode_str_repr(self, const Utf8Str* string)`
* Move StringStore class to its own file 2014-12-19 22:42:01 +03:00

`StringStore` refactoring (#11344) * `strings`: Remove unused `hash32_utf8` function * `strings`: Make `hash_utf8` and `decode_Utf8Str` private * `strings`: Reorganize private functions * 'strings': Raise error when non-string/-int types are passed to functions that don't accept them * `strings`: Add `items()` method, add type hints, remove unused methods, restrict inputs to specific types, reorganize methods * `Morphology`: Use `StringStore.items()` to enumerate features when pickling * `test_stringstore`: Update pre-Python 3 tests * Update `StringStore` docs * Fix `get_string_id` imports * Replace redundant test with tests for type checking * Rename `_retrieve_interned_str`, remove `.get` default arg * Add `get_string_id` to `strings.pyi` Remove `mypy` ignore directives from imports of the above * `strings.pyi`: Replace functions that consume `Union`-typed params with overloads * `strings.pyi`: Revert some function signatures * Update `SYMBOLS_BY_INT` lookups and error codes post-merge * Revert clobbered change introduced in a previous merge * Remove unnecessary type hint * Invert tuple order in `StringStore.items()` * Add test for `StringStore.items()` * Revert "`Morphology`: Use `StringStore.items()` to enumerate features when pickling" This reverts commit 1af9510ceb6b08cfdcfbf26df6896f26709fac0d. * Rename `keys` and `key_map` * Add `keys()` and `values()` * Add comment about the inverted key-value semantics in the API * Fix type hints * Implement `keys()`, `values()`, `items()` without generators * Fix type hints, remove unnecessary boxing * Update docs * Simplify `keys/values/items()` impl * `mypy` fix * Fix error message, doc fixes 2022-10-06 11:51:06 +03:00			`cpdef hash_t hash_string(object string) except -1`
			`cpdef hash_t get_string_id(object string_or_hash) except -1`