spaCy/spacy/structs.pxd

from libc.stdint cimport int32_t, int64_t, uint8_t, uint32_t, uint64_t
from libcpp.unordered_map cimport unordered_map
from libcpp.unordered_set cimport unordered_set
from libcpp.vector cimport vector

from .parts_of_speech cimport univ_pos_t
from .typedefs cimport attr_t, flags_t, hash_t


cdef struct LexemeC:
    flags_t flags

    attr_t lang

    attr_t id
    attr_t length

    attr_t orth
    attr_t lower
    attr_t norm
    attr_t shape
    attr_t prefix
    attr_t suffix


cdef struct SpanC:
    hash_t id
    int start
    int end
    int start_char
    int end_char
    attr_t label
    attr_t kb_id


cdef struct TokenC:
    const LexemeC* lex
    uint64_t morph
    univ_pos_t pos
    bint spacy
    attr_t tag
    int idx
    attr_t lemma
    attr_t norm
    int head
    attr_t dep

    uint32_t l_kids
    uint32_t r_kids
    uint32_t l_edge
    uint32_t r_edge

    int sent_start
    int ent_iob
    attr_t ent_type # TODO: Is there a better way to do this? Multiple sources of truth..
    attr_t ent_kb_id
    hash_t ent_id


cdef struct MorphAnalysisC:
    hash_t key
    int length

    attr_t* fields
    attr_t* features


# Internal struct, for storage and disambiguation of entities.
cdef struct KBEntryC:

    # The hash of this entry's unique ID/name in the kB
    hash_t entity_hash

    # Allows retrieval of the entity vector, as an index into a vectors table of the KB.
    # Can be expanded later to refer to multiple rows (compositional model to reduce storage footprint).
    int32_t vector_index

    # Allows retrieval of a struct of non-vector features.
    # This is currently not implemented and set to -1 for the common case where there are no features.
    int32_t feats_row

    # log probability of entity, based on corpus frequency
    float freq


# Each alias struct stores a list of Entry pointers with their prior probabilities
# for this specific mention/alias.
cdef struct AliasC:

    # All entry candidates for this alias
    vector[int64_t] entry_indices

    # Prior probability P(entity|alias) - should sum up to (at most) 1.
    vector[float] probs


cdef struct EdgeC:
    hash_t label
    int32_t head
    int32_t tail


cdef struct GraphC:
    vector[vector[int32_t]] nodes
    vector[EdgeC] edges
    vector[float] weights
    vector[int] n_heads
    vector[int] n_tails
    vector[int] first_head
    vector[int] first_tail
    unordered_set[int]* roots
    unordered_map[hash_t, int]* node_map
    unordered_map[hash_t, int]* edge_map
Configure isort to use the Black profile, recursively isort the `spacy` module (#12721) * Use isort with Black profile * isort all the things * Fix import cycles as a result of import sorting * Add DOCBIN_ALL_ATTRS type definition * Add isort to requirements * Remove isort from build dependencies check * Typo 2023-06-14 18:48:41 +03:00			`from libc.stdint cimport int32_t, int64_t, uint8_t, uint32_t, uint64_t`
Add SpanGroup and Graph container types to represent arbitrary annotations (#6696) * Draft out initial Spans data structure * Initial span group commit * Basic span group support on Doc * Basic test for span group * Compile span_group.pyx * Draft addition of SpanGroup to DocBin * Add deserialization for SpanGroup * Add tests for serializing SpanGroup * Fix serialization of SpanGroup * Add EdgeC and GraphC structs * Add draft Graph data structure * Compile graph * More work on Graph * Update GraphC * Upd graph * Fix walk functions * Let Graph take nodes and edges on construction * Fix walking and getting * Add graph tests * Fix import * Add module with the SpanGroups dict thingy * Update test * Rename 'span_groups' attribute * Try to fix c++11 compilation * Fix test * Update DocBin * Try to fix compilation * Try to fix graph * Improve SpanGroup docstrings * Add doc.spans to documentation * Fix serialization * Tidy up and add docs * Update docs [ci skip] * Add SpanGroup.has_overlap * WIP updated Graph API * Start testing new Graph API * Update Graph tests * Update Graph * Add docstring Co-authored-by: Ines Montani <ines@ines.io> 2021-01-14 09:30:41 +03:00			`from libcpp.unordered_map cimport unordered_map`
Configure isort to use the Black profile, recursively isort the `spacy` module (#12721) * Use isort with Black profile * isort all the things * Fix import cycles as a result of import sorting * Add DOCBIN_ALL_ATTRS type definition * Add isort to requirements * Remove isort from build dependencies check * Typo 2023-06-14 18:48:41 +03:00			`from libcpp.unordered_set cimport unordered_set`
			`from libcpp.vector cimport vector`
bulk loading in proper order of entity indices 2019-04-24 12:26:38 +03:00
Tidy up compiler flags and imports (#5071) 2020-03-02 13:48:10 +03:00			`from .parts_of_speech cimport univ_pos_t`
Configure isort to use the Black profile, recursively isort the `spacy` module (#12721) * Use isort with Black profile * isort all the things * Fix import cycles as a result of import sorting * Add DOCBIN_ALL_ATTRS type definition * Add isort to requirements * Remove isort from build dependencies check * Typo 2023-06-14 18:48:41 +03:00			`from .typedefs cimport attr_t, flags_t, hash_t`
bulk loading in proper order of entity indices 2019-04-24 12:26:38 +03:00
* Move all struct definitions to structs.pxd, to avoid circular dependencies 2014-12-19 22:51:33 +03:00
* Tmp commit. Refactoring to create a Python Lexeme class. 2015-01-12 02:26:22 +03:00			`cdef struct LexemeC:`
* Move all struct definitions to structs.pxd, to avoid circular dependencies 2014-12-19 22:51:33 +03:00			`flags_t flags`
Remove trailing whitespace 2015-04-19 11:31:31 +03:00
introduce lang field for LexemeC to hold language id put noun_chunk logic into iterators.py for each language separately 2016-03-10 15:01:34 +03:00			`attr_t lang`

* Move all struct definitions to structs.pxd, to avoid circular dependencies 2014-12-19 22:51:33 +03:00			`attr_t id`
* Tmp. Refactoring, introducing a Lexeme PyObject. 2015-01-12 03:23:44 +03:00			`attr_t length`

* Rename sic to orth 2015-01-22 18:08:25 +03:00			`attr_t orth`
* Rename NORM1 and NORM2 attrs to lower and norm 2015-01-23 22:17:03 +03:00			`attr_t lower`
			`attr_t norm`
* Move all struct definitions to structs.pxd, to avoid circular dependencies 2014-12-19 22:51:33 +03:00			`attr_t shape`
			`attr_t prefix`
			`attr_t suffix`
Remove trailing whitespace 2015-04-19 11:31:31 +03:00
Get spaCy train command working with neural network * Integrate models into pipeline * Add basic serialization (maybe incorrect) * Fix pickle on vocab 2017-05-17 13:04:50 +03:00
Replace Entity/MatchStruct with SpanC (#4459) * Replace MatchStruct with Entity Replace MatchStruct with Entity since the existing Entity struct is nearly identical. * Replace Entity with more general SpanC 2019-10-18 12:01:47 +03:00			`cdef struct SpanC:`
Initial, limited support for quantified patterns in Matcher, and tracking of ent_id attribute in Token and Span. The quantifiers need a lot more testing, and there are some known problems. The main known problem is that the zero-plus and one-plus quantifiers won't work if a token can match both the quantified pattern expression AND the tail of the match. 2016-09-21 15:54:55 +03:00			`hash_t id`
* Tmp 2015-03-09 08:46:22 +03:00			`int start`
			`int end`
Replace Entity/MatchStruct with SpanC (#4459) * Replace MatchStruct with Entity Replace MatchStruct with Entity since the existing Entity struct is nearly identical. * Replace Entity with more general SpanC 2019-10-18 12:01:47 +03:00			`int start_char`
			`int end_char`
WIP on stringstore change. 27 failures 2017-05-28 15:06:40 +03:00			`attr_t label`
Replace Entity/MatchStruct with SpanC (#4459) * Replace MatchStruct with Entity Replace MatchStruct with Entity since the existing Entity struct is nearly identical. * Replace Entity with more general SpanC 2019-10-18 12:01:47 +03:00			`attr_t kb_id`
* NER seems to be working, scoring 69 F. Need to add decision-history features --- currently only use current word, 2 words context. Need refactoring. 2015-03-10 20:00:23 +03:00
* Tmp 2015-03-09 08:46:22 +03:00
* Move all struct definitions to structs.pxd, to avoid circular dependencies 2014-12-19 22:51:33 +03:00			`cdef struct TokenC:`
* Tmp commit. Refactoring to create a Python Lexeme class. 2015-01-12 02:26:22 +03:00			`const LexemeC* lex`
* More work on language-generic parsing 2015-08-28 03:02:33 +03:00			`uint64_t morph`
* Move POS tag definitions to parts_of_speech.pxd 2015-01-25 08:31:07 +03:00			`univ_pos_t pos`
* Add TokenC.spacy attr 2015-07-13 20:48:07 +03:00			`bint spacy`
WIP on stringstore change. 27 failures 2017-05-28 15:06:40 +03:00			`attr_t tag`
* Move all struct definitions to structs.pxd, to avoid circular dependencies 2014-12-19 22:51:33 +03:00			`int idx`
Adjust lexeme sizing for attr_t being 64 bit 2017-05-28 13:51:09 +03:00			`attr_t lemma`
Make NORM a token attribute (#3029) See #3028. The solution in this patch is pretty debateable. What we do is give the TokenC struct a .norm field, by repurposing the previously idle .sense attribute. It's nice to repurpose a previous field because it means the TokenC doesn't change size, so even if someone's using the internals very deeply, nothing will break. The weird thing here is that the TokenC and the LexemeC both have an attribute named NORM. This arguably assists in backwards compatibility. On the other hand, maybe it's really bad! We're changing the semantics of the attribute subtly, so maybe it's better if someone calling lex.norm gets a breakage, and instead is told to write lex.default_norm? Overall I believe this patch makes the NORM feature work the way we sort of expected it to work. Certainly it's much more like how the docs describe it, and more in line with how we've been directing people to use the norm attribute. We'll also be able to use token.norm to do stuff like spelling correction, which is pretty cool. 2018-12-08 12:49:10 +03:00			`attr_t norm`
* Move all struct definitions to structs.pxd, to avoid circular dependencies 2014-12-19 22:51:33 +03:00			`int head`
Adjust lexeme sizing for attr_t being 64 bit 2017-05-28 13:51:09 +03:00			`attr_t dep`
* Add l_edge and r_edge props in TokenC for tracking the parse-yield of the token 2015-04-29 20:14:20 +03:00
* Move all struct definitions to structs.pxd, to avoid circular dependencies 2014-12-19 22:51:33 +03:00			`uint32_t l_kids`
			`uint32_t r_kids`
* Add l_edge and r_edge props in TokenC for tracking the parse-yield of the token 2015-04-29 20:14:20 +03:00			`uint32_t l_edge`
			`uint32_t r_edge`
* Move all struct definitions to structs.pxd, to avoid circular dependencies 2014-12-19 22:51:33 +03:00
Make TokenC.sent_tart an int, to allow ternary value 2017-10-08 20:58:54 +03:00			`int sent_start`
* NER seems to be working, scoring 69 F. Need to add decision-history features --- currently only use current word, 2 words context. Need refactoring. 2015-03-10 20:00:23 +03:00			`int ent_iob`
WIP on stringstore change. 27 failures 2017-05-28 15:06:40 +03:00			`attr_t ent_type # TODO: Is there a better way to do this? Multiple sources of truth..`
annotate kb_id through ents in doc 2019-03-14 17:48:40 +03:00			`attr_t ent_kb_id`
Initial, limited support for quantified patterns in Matcher, and tracking of ent_id attribute in Token and Span. The quantifiers need a lot more testing, and there are some known problems. The main known problem is that the zero-plus and one-plus quantifiers won't work if a token can match both the quantified pattern expression AND the tail of the match. 2016-09-21 15:54:55 +03:00			`hash_t ent_id`
Update structs 2018-09-25 00:58:08 +03:00

Add MorphAnalysisC struct 2019-03-07 16:03:07 +03:00			`cdef struct MorphAnalysisC:`
Modify morphology to support arbitrary features (#4932) * Restructure tag maps for MorphAnalysis changes Prepare tag maps for upcoming MorphAnalysis changes that allow arbritrary features. * Use default tag map rather than duplicating for ca / uk / vi * Import tag map into defaults for ga * Modify tag maps so all morphological fields and features are strings * Move features from `"Other"` to the top level * Rewrite tuples as strings separated by `","` * Rewrite morph symbols for fr lemmatizer as strings * Export MorphAnalysis under spacy.tokens * Modify morphology to support arbitrary features Modify `Morphology` and `MorphAnalysis` so that arbitrary features are supported. * Modify `MorphAnalysisC` so that it can support arbitrary features and multiple values per field. `MorphAnalysisC` is redesigned to contain: * key: hash of UD FEATS string of morphological features * array of `MorphFeatureC` structs that each contain a hash of `Field` and `Field=Value` for a given morphological feature, which makes it possible to: * find features by field * represent multiple values for a given field * `get_field()` is renamed to `get_by_field()` and is no longer `nogil`. Instead a new helper function `get_n_by_field()` is `nogil` and returns `n` features by field. * `MorphAnalysis.get()` returns all possible values for a field as a list of individual features such as `["Tense=Pres", "Tense=Past"]`. * `MorphAnalysis`'s `str()` and `repr()` are the UD FEATS string. * `Morphology.feats_to_dict()` converts a UD FEATS string to a dict where: * Each field has one entry in the dict * Multiple values remain separated by a separator in the value string * `Token.morph_` returns the UD FEATS string and you can set `Token.morph_` with a UD FEATS string or with a tag map dict. * Modify get_by_field to use np.ndarray Modify `get_by_field()` to use np.ndarray. Remove `max_results` from `get_n_by_field()` and always iterate over all the fields. * Rewrite without MorphFeatureC * Add shortcut for existing feats strings as keys Add shortcut for existing feats strings as keys in `Morphology.add()`. * Check for '_' as empty analysis when adding morphs * Extend helper converters in Morphology Add and extend helper converters that convert and normalize between: * UD FEATS strings (`"Case=dat,gen\|Number=sing"`) * per-field dict of feats (`{"Case": "dat,gen", "Number": "sing"}`) * list of individual features (`["Case=dat", "Case=gen", "Number=sing"]`) All converters sort fields and values where applicable. 2020-01-24 00:01:54 +03:00			`hash_t key`
Add length attribute to MorphAnalysisC 2019-03-08 02:08:57 +03:00			`int length`
Add is_sent_end token property (#5375) Reconstruction of the original PR #4697 by @MiniLau. Removes unused `SENT_END` symbol and `IS_SENT_END` from `Matcher` schema because the Matcher is only going to be able to support `IS_SENT_START`. 2020-04-29 13:53:16 +03:00
Modify morphology to support arbitrary features (#4932) * Restructure tag maps for MorphAnalysis changes Prepare tag maps for upcoming MorphAnalysis changes that allow arbritrary features. * Use default tag map rather than duplicating for ca / uk / vi * Import tag map into defaults for ga * Modify tag maps so all morphological fields and features are strings * Move features from `"Other"` to the top level * Rewrite tuples as strings separated by `","` * Rewrite morph symbols for fr lemmatizer as strings * Export MorphAnalysis under spacy.tokens * Modify morphology to support arbitrary features Modify `Morphology` and `MorphAnalysis` so that arbitrary features are supported. * Modify `MorphAnalysisC` so that it can support arbitrary features and multiple values per field. `MorphAnalysisC` is redesigned to contain: * key: hash of UD FEATS string of morphological features * array of `MorphFeatureC` structs that each contain a hash of `Field` and `Field=Value` for a given morphological feature, which makes it possible to: * find features by field * represent multiple values for a given field * `get_field()` is renamed to `get_by_field()` and is no longer `nogil`. Instead a new helper function `get_n_by_field()` is `nogil` and returns `n` features by field. * `MorphAnalysis.get()` returns all possible values for a field as a list of individual features such as `["Tense=Pres", "Tense=Past"]`. * `MorphAnalysis`'s `str()` and `repr()` are the UD FEATS string. * `Morphology.feats_to_dict()` converts a UD FEATS string to a dict where: * Each field has one entry in the dict * Multiple values remain separated by a separator in the value string * `Token.morph_` returns the UD FEATS string and you can set `Token.morph_` with a UD FEATS string or with a tag map dict. * Modify get_by_field to use np.ndarray Modify `get_by_field()` to use np.ndarray. Remove `max_results` from `get_n_by_field()` and always iterate over all the fields. * Rewrite without MorphFeatureC * Add shortcut for existing feats strings as keys Add shortcut for existing feats strings as keys in `Morphology.add()`. * Check for '_' as empty analysis when adding morphs * Extend helper converters in Morphology Add and extend helper converters that convert and normalize between: * UD FEATS strings (`"Case=dat,gen\|Number=sing"`) * per-field dict of feats (`{"Case": "dat,gen", "Number": "sing"}`) * list of individual features (`["Case=dat", "Case=gen", "Number=sing"]`) All converters sort fields and values where applicable. 2020-01-24 00:01:54 +03:00			`attr_t* fields`
			`attr_t* features`

Update structs 2018-09-25 00:58:08 +03:00
bulk loading in proper order of entity indices 2019-04-24 12:26:38 +03:00			`# Internal struct, for storage and disambiguation of entities.`
rename to KBEntryC 2019-06-26 16:55:26 +03:00			`cdef struct KBEntryC:`
bulk loading in proper order of entity indices 2019-04-24 12:26:38 +03:00
			`# The hash of this entry's unique ID/name in the kB`
			`hash_t entity_hash`

entity vectors in the KB + serialization of them 2019-06-05 19:29:18 +03:00			`# Allows retrieval of the entity vector, as an index into a vectors table of the KB.`
			`# Can be expanded later to refer to multiple rows (compositional model to reduce storage footprint).`
			`int32_t vector_index`

			`# Allows retrieval of a struct of non-vector features.`
			`# This is currently not implemented and set to -1 for the common case where there are no features.`
bulk loading in proper order of entity indices 2019-04-24 12:26:38 +03:00			`int32_t feats_row`

			`# log probability of entity, based on corpus frequency`
rename entity frequency 2019-07-19 18:40:28 +03:00			`float freq`
bulk loading in proper order of entity indices 2019-04-24 12:26:38 +03:00

			`# Each alias struct stores a list of Entry pointers with their prior probabilities`
			`# for this specific mention/alias.`
			`cdef struct AliasC:`

			`# All entry candidates for this alias`
			`vector[int64_t] entry_indices`

			`# Prior probability P(entity\|alias) - should sum up to (at most) 1.`
			`vector[float] probs`
Add SpanGroup and Graph container types to represent arbitrary annotations (#6696) * Draft out initial Spans data structure * Initial span group commit * Basic span group support on Doc * Basic test for span group * Compile span_group.pyx * Draft addition of SpanGroup to DocBin * Add deserialization for SpanGroup * Add tests for serializing SpanGroup * Fix serialization of SpanGroup * Add EdgeC and GraphC structs * Add draft Graph data structure * Compile graph * More work on Graph * Update GraphC * Upd graph * Fix walk functions * Let Graph take nodes and edges on construction * Fix walking and getting * Add graph tests * Fix import * Add module with the SpanGroups dict thingy * Update test * Rename 'span_groups' attribute * Try to fix c++11 compilation * Fix test * Update DocBin * Try to fix compilation * Try to fix graph * Improve SpanGroup docstrings * Add doc.spans to documentation * Fix serialization * Tidy up and add docs * Update docs [ci skip] * Add SpanGroup.has_overlap * WIP updated Graph API * Start testing new Graph API * Update Graph tests * Update Graph * Add docstring Co-authored-by: Ines Montani <ines@ines.io> 2021-01-14 09:30:41 +03:00

			`cdef struct EdgeC:`
			`hash_t label`
			`int32_t head`
			`int32_t tail`


			`cdef struct GraphC:`
			`vector[vector[int32_t]] nodes`
			`vector[EdgeC] edges`
			`vector[float] weights`
			`vector[int] n_heads`
			`vector[int] n_tails`
			`vector[int] first_head`
			`vector[int] first_tail`
			`unordered_set[int]* roots`
			`unordered_map[hash_t, int]* node_map`
			`unordered_map[hash_t, int]* edge_map`