spaCy/spacy/structs.pxd

from libc.stdint cimport uint8_t, uint32_t, int32_t, uint64_t

from .typedefs cimport flags_t, attr_t, hash_t
from .parts_of_speech cimport univ_pos_t

from libcpp.vector cimport vector
from libc.stdint cimport int32_t, int64_t


cdef struct LexemeC:
    flags_t flags

    attr_t lang

    attr_t id
    attr_t length

    attr_t orth
    attr_t lower
    attr_t norm
    attr_t shape
    attr_t prefix
    attr_t suffix

    attr_t cluster

    float prob
    float sentiment


cdef struct SerializedLexemeC:
    unsigned char[8 + 8*10 + 4 + 4] data
    #    sizeof(flags_t)  # flags
    #    + sizeof(attr_t) # lang
    #    + sizeof(attr_t) # id
    #    + sizeof(attr_t) # length
    #    + sizeof(attr_t) # orth
    #    + sizeof(attr_t) # lower
    #    + sizeof(attr_t) # norm
    #    + sizeof(attr_t) # shape
    #    + sizeof(attr_t) # prefix
    #    + sizeof(attr_t) # suffix
    #    + sizeof(attr_t) # cluster
    #    + sizeof(float)  # prob
    #    + sizeof(float)  # cluster
    #    + sizeof(float) # l2_norm


cdef struct SpanC:
    hash_t id
    int start
    int end
    int start_char
    int end_char
    attr_t label
    attr_t kb_id


cdef struct TokenC:
    const LexemeC* lex
    uint64_t morph
    univ_pos_t pos
    bint spacy
    attr_t tag
    int idx
    attr_t lemma
    attr_t norm
    int head
    attr_t dep

    uint32_t l_kids
    uint32_t r_kids
    uint32_t l_edge
    uint32_t r_edge

    int sent_start
    int ent_iob
    attr_t ent_type # TODO: Is there a better way to do this? Multiple sources of truth..
    attr_t ent_kb_id
    hash_t ent_id


cdef struct MorphAnalysisC:
    hash_t key
    int length
    attr_t* fields
    attr_t* features


# Internal struct, for storage and disambiguation of entities.
cdef struct KBEntryC:

    # The hash of this entry's unique ID/name in the kB
    hash_t entity_hash

    # Allows retrieval of the entity vector, as an index into a vectors table of the KB.
    # Can be expanded later to refer to multiple rows (compositional model to reduce storage footprint).
    int32_t vector_index

    # Allows retrieval of a struct of non-vector features.
    # This is currently not implemented and set to -1 for the common case where there are no features.
    int32_t feats_row

    # log probability of entity, based on corpus frequency
    float freq


# Each alias struct stores a list of Entry pointers with their prior probabilities
# for this specific mention/alias.
cdef struct AliasC:

    # All entry candidates for this alias
    vector[int64_t] entry_indices

    # Prior probability P(entity|alias) - should sum up to (at most) 1.
    vector[float] probs
* Hack on morphology structs 2015-08-26 20:18:36 +03:00			`from libc.stdint cimport uint8_t, uint32_t, int32_t, uint64_t`
* Move all struct definitions to structs.pxd, to avoid circular dependencies 2014-12-19 22:51:33 +03:00
* Fix type declarations for attr_t. Remove unused id_t. 2015-07-18 23:39:57 +03:00			`from .typedefs cimport flags_t, attr_t, hash_t`
* Move POS tag definitions to parts_of_speech.pxd 2015-01-25 08:31:07 +03:00			`from .parts_of_speech cimport univ_pos_t`
* Move all struct definitions to structs.pxd, to avoid circular dependencies 2014-12-19 22:51:33 +03:00
bulk loading in proper order of entity indices 2019-04-24 12:26:38 +03:00			`from libcpp.vector cimport vector`
			`from libc.stdint cimport int32_t, int64_t`


* Move all struct definitions to structs.pxd, to avoid circular dependencies 2014-12-19 22:51:33 +03:00
* Tmp commit. Refactoring to create a Python Lexeme class. 2015-01-12 02:26:22 +03:00			`cdef struct LexemeC:`
* Move all struct definitions to structs.pxd, to avoid circular dependencies 2014-12-19 22:51:33 +03:00			`flags_t flags`
Remove trailing whitespace 2015-04-19 11:31:31 +03:00
introduce lang field for LexemeC to hold language id put noun_chunk logic into iterators.py for each language separately 2016-03-10 15:01:34 +03:00			`attr_t lang`

* Move all struct definitions to structs.pxd, to avoid circular dependencies 2014-12-19 22:51:33 +03:00			`attr_t id`
* Tmp. Refactoring, introducing a Lexeme PyObject. 2015-01-12 03:23:44 +03:00			`attr_t length`

* Rename sic to orth 2015-01-22 18:08:25 +03:00			`attr_t orth`
* Rename NORM1 and NORM2 attrs to lower and norm 2015-01-23 22:17:03 +03:00			`attr_t lower`
			`attr_t norm`
* Move all struct definitions to structs.pxd, to avoid circular dependencies 2014-12-19 22:51:33 +03:00			`attr_t shape`
			`attr_t prefix`
			`attr_t suffix`
Remove trailing whitespace 2015-04-19 11:31:31 +03:00
* Move all struct definitions to structs.pxd, to avoid circular dependencies 2014-12-19 22:51:33 +03:00			`attr_t cluster`

			`float prob`
			`float sentiment`


Get spaCy train command working with neural network * Integrate models into pipeline * Add basic serialization (maybe incorrect) * Fix pickle on vocab 2017-05-17 13:04:50 +03:00			`cdef struct SerializedLexemeC:`
Adjust lexeme sizing for attr_t being 64 bit 2017-05-28 13:51:09 +03:00			`unsigned char[8 + 8*10 + 4 + 4] data`
Get spaCy train command working with neural network * Integrate models into pipeline * Add basic serialization (maybe incorrect) * Fix pickle on vocab 2017-05-17 13:04:50 +03:00			`# sizeof(flags_t) # flags`
			`# + sizeof(attr_t) # lang`
			`# + sizeof(attr_t) # id`
			`# + sizeof(attr_t) # length`
			`# + sizeof(attr_t) # orth`
			`# + sizeof(attr_t) # lower`
			`# + sizeof(attr_t) # norm`
			`# + sizeof(attr_t) # shape`
			`# + sizeof(attr_t) # prefix`
			`# + sizeof(attr_t) # suffix`
			`# + sizeof(attr_t) # cluster`
			`# + sizeof(float) # prob`
			`# + sizeof(float) # cluster`
			`# + sizeof(float) # l2_norm`


Replace Entity/MatchStruct with SpanC (#4459) * Replace MatchStruct with Entity Replace MatchStruct with Entity since the existing Entity struct is nearly identical. * Replace Entity with more general SpanC 2019-10-18 12:01:47 +03:00			`cdef struct SpanC:`
Initial, limited support for quantified patterns in Matcher, and tracking of ent_id attribute in Token and Span. The quantifiers need a lot more testing, and there are some known problems. The main known problem is that the zero-plus and one-plus quantifiers won't work if a token can match both the quantified pattern expression AND the tail of the match. 2016-09-21 15:54:55 +03:00			`hash_t id`
* Tmp 2015-03-09 08:46:22 +03:00			`int start`
			`int end`
Replace Entity/MatchStruct with SpanC (#4459) * Replace MatchStruct with Entity Replace MatchStruct with Entity since the existing Entity struct is nearly identical. * Replace Entity with more general SpanC 2019-10-18 12:01:47 +03:00			`int start_char`
			`int end_char`
WIP on stringstore change. 27 failures 2017-05-28 15:06:40 +03:00			`attr_t label`
Replace Entity/MatchStruct with SpanC (#4459) * Replace MatchStruct with Entity Replace MatchStruct with Entity since the existing Entity struct is nearly identical. * Replace Entity with more general SpanC 2019-10-18 12:01:47 +03:00			`attr_t kb_id`
* NER seems to be working, scoring 69 F. Need to add decision-history features --- currently only use current word, 2 words context. Need refactoring. 2015-03-10 20:00:23 +03:00
* Tmp 2015-03-09 08:46:22 +03:00
* Move all struct definitions to structs.pxd, to avoid circular dependencies 2014-12-19 22:51:33 +03:00			`cdef struct TokenC:`
* Tmp commit. Refactoring to create a Python Lexeme class. 2015-01-12 02:26:22 +03:00			`const LexemeC* lex`
* More work on language-generic parsing 2015-08-28 03:02:33 +03:00			`uint64_t morph`
* Move POS tag definitions to parts_of_speech.pxd 2015-01-25 08:31:07 +03:00			`univ_pos_t pos`
* Add TokenC.spacy attr 2015-07-13 20:48:07 +03:00			`bint spacy`
WIP on stringstore change. 27 failures 2017-05-28 15:06:40 +03:00			`attr_t tag`
* Move all struct definitions to structs.pxd, to avoid circular dependencies 2014-12-19 22:51:33 +03:00			`int idx`
Adjust lexeme sizing for attr_t being 64 bit 2017-05-28 13:51:09 +03:00			`attr_t lemma`
Make NORM a token attribute (#3029) See #3028. The solution in this patch is pretty debateable. What we do is give the TokenC struct a .norm field, by repurposing the previously idle .sense attribute. It's nice to repurpose a previous field because it means the TokenC doesn't change size, so even if someone's using the internals very deeply, nothing will break. The weird thing here is that the TokenC and the LexemeC both have an attribute named NORM. This arguably assists in backwards compatibility. On the other hand, maybe it's really bad! We're changing the semantics of the attribute subtly, so maybe it's better if someone calling lex.norm gets a breakage, and instead is told to write lex.default_norm? Overall I believe this patch makes the NORM feature work the way we sort of expected it to work. Certainly it's much more like how the docs describe it, and more in line with how we've been directing people to use the norm attribute. We'll also be able to use token.norm to do stuff like spelling correction, which is pretty cool. 2018-12-08 12:49:10 +03:00			`attr_t norm`
* Move all struct definitions to structs.pxd, to avoid circular dependencies 2014-12-19 22:51:33 +03:00			`int head`
Adjust lexeme sizing for attr_t being 64 bit 2017-05-28 13:51:09 +03:00			`attr_t dep`
* Add l_edge and r_edge props in TokenC for tracking the parse-yield of the token 2015-04-29 20:14:20 +03:00
* Move all struct definitions to structs.pxd, to avoid circular dependencies 2014-12-19 22:51:33 +03:00			`uint32_t l_kids`
			`uint32_t r_kids`
* Add l_edge and r_edge props in TokenC for tracking the parse-yield of the token 2015-04-29 20:14:20 +03:00			`uint32_t l_edge`
			`uint32_t r_edge`
* Move all struct definitions to structs.pxd, to avoid circular dependencies 2014-12-19 22:51:33 +03:00
Make TokenC.sent_tart an int, to allow ternary value 2017-10-08 20:58:54 +03:00			`int sent_start`
* NER seems to be working, scoring 69 F. Need to add decision-history features --- currently only use current word, 2 words context. Need refactoring. 2015-03-10 20:00:23 +03:00			`int ent_iob`
WIP on stringstore change. 27 failures 2017-05-28 15:06:40 +03:00			`attr_t ent_type # TODO: Is there a better way to do this? Multiple sources of truth..`
annotate kb_id through ents in doc 2019-03-14 17:48:40 +03:00			`attr_t ent_kb_id`
Initial, limited support for quantified patterns in Matcher, and tracking of ent_id attribute in Token and Span. The quantifiers need a lot more testing, and there are some known problems. The main known problem is that the zero-plus and one-plus quantifiers won't work if a token can match both the quantified pattern expression AND the tail of the match. 2016-09-21 15:54:55 +03:00			`hash_t ent_id`
Update structs 2018-09-25 00:58:08 +03:00

Add MorphAnalysisC struct 2019-03-07 16:03:07 +03:00			`cdef struct MorphAnalysisC:`
Modify morphology to support arbitrary features (#4932) * Restructure tag maps for MorphAnalysis changes Prepare tag maps for upcoming MorphAnalysis changes that allow arbritrary features. * Use default tag map rather than duplicating for ca / uk / vi * Import tag map into defaults for ga * Modify tag maps so all morphological fields and features are strings * Move features from `"Other"` to the top level * Rewrite tuples as strings separated by `","` * Rewrite morph symbols for fr lemmatizer as strings * Export MorphAnalysis under spacy.tokens * Modify morphology to support arbitrary features Modify `Morphology` and `MorphAnalysis` so that arbitrary features are supported. * Modify `MorphAnalysisC` so that it can support arbitrary features and multiple values per field. `MorphAnalysisC` is redesigned to contain: * key: hash of UD FEATS string of morphological features * array of `MorphFeatureC` structs that each contain a hash of `Field` and `Field=Value` for a given morphological feature, which makes it possible to: * find features by field * represent multiple values for a given field * `get_field()` is renamed to `get_by_field()` and is no longer `nogil`. Instead a new helper function `get_n_by_field()` is `nogil` and returns `n` features by field. * `MorphAnalysis.get()` returns all possible values for a field as a list of individual features such as `["Tense=Pres", "Tense=Past"]`. * `MorphAnalysis`'s `str()` and `repr()` are the UD FEATS string. * `Morphology.feats_to_dict()` converts a UD FEATS string to a dict where: * Each field has one entry in the dict * Multiple values remain separated by a separator in the value string * `Token.morph_` returns the UD FEATS string and you can set `Token.morph_` with a UD FEATS string or with a tag map dict. * Modify get_by_field to use np.ndarray Modify `get_by_field()` to use np.ndarray. Remove `max_results` from `get_n_by_field()` and always iterate over all the fields. * Rewrite without MorphFeatureC * Add shortcut for existing feats strings as keys Add shortcut for existing feats strings as keys in `Morphology.add()`. * Check for '_' as empty analysis when adding morphs * Extend helper converters in Morphology Add and extend helper converters that convert and normalize between: * UD FEATS strings (`"Case=dat,gen\|Number=sing"`) * per-field dict of feats (`{"Case": "dat,gen", "Number": "sing"}`) * list of individual features (`["Case=dat", "Case=gen", "Number=sing"]`) All converters sort fields and values where applicable. 2020-01-24 00:01:54 +03:00			`hash_t key`
Add length attribute to MorphAnalysisC 2019-03-08 02:08:57 +03:00			`int length`
Modify morphology to support arbitrary features (#4932) * Restructure tag maps for MorphAnalysis changes Prepare tag maps for upcoming MorphAnalysis changes that allow arbritrary features. * Use default tag map rather than duplicating for ca / uk / vi * Import tag map into defaults for ga * Modify tag maps so all morphological fields and features are strings * Move features from `"Other"` to the top level * Rewrite tuples as strings separated by `","` * Rewrite morph symbols for fr lemmatizer as strings * Export MorphAnalysis under spacy.tokens * Modify morphology to support arbitrary features Modify `Morphology` and `MorphAnalysis` so that arbitrary features are supported. * Modify `MorphAnalysisC` so that it can support arbitrary features and multiple values per field. `MorphAnalysisC` is redesigned to contain: * key: hash of UD FEATS string of morphological features * array of `MorphFeatureC` structs that each contain a hash of `Field` and `Field=Value` for a given morphological feature, which makes it possible to: * find features by field * represent multiple values for a given field * `get_field()` is renamed to `get_by_field()` and is no longer `nogil`. Instead a new helper function `get_n_by_field()` is `nogil` and returns `n` features by field. * `MorphAnalysis.get()` returns all possible values for a field as a list of individual features such as `["Tense=Pres", "Tense=Past"]`. * `MorphAnalysis`'s `str()` and `repr()` are the UD FEATS string. * `Morphology.feats_to_dict()` converts a UD FEATS string to a dict where: * Each field has one entry in the dict * Multiple values remain separated by a separator in the value string * `Token.morph_` returns the UD FEATS string and you can set `Token.morph_` with a UD FEATS string or with a tag map dict. * Modify get_by_field to use np.ndarray Modify `get_by_field()` to use np.ndarray. Remove `max_results` from `get_n_by_field()` and always iterate over all the fields. * Rewrite without MorphFeatureC * Add shortcut for existing feats strings as keys Add shortcut for existing feats strings as keys in `Morphology.add()`. * Check for '_' as empty analysis when adding morphs * Extend helper converters in Morphology Add and extend helper converters that convert and normalize between: * UD FEATS strings (`"Case=dat,gen\|Number=sing"`) * per-field dict of feats (`{"Case": "dat,gen", "Number": "sing"}`) * list of individual features (`["Case=dat", "Case=gen", "Number=sing"]`) All converters sort fields and values where applicable. 2020-01-24 00:01:54 +03:00			`attr_t* fields`
			`attr_t* features`

Update structs 2018-09-25 00:58:08 +03:00
bulk loading in proper order of entity indices 2019-04-24 12:26:38 +03:00			`# Internal struct, for storage and disambiguation of entities.`
rename to KBEntryC 2019-06-26 16:55:26 +03:00			`cdef struct KBEntryC:`
bulk loading in proper order of entity indices 2019-04-24 12:26:38 +03:00
			`# The hash of this entry's unique ID/name in the kB`
			`hash_t entity_hash`

entity vectors in the KB + serialization of them 2019-06-05 19:29:18 +03:00			`# Allows retrieval of the entity vector, as an index into a vectors table of the KB.`
			`# Can be expanded later to refer to multiple rows (compositional model to reduce storage footprint).`
			`int32_t vector_index`

			`# Allows retrieval of a struct of non-vector features.`
			`# This is currently not implemented and set to -1 for the common case where there are no features.`
bulk loading in proper order of entity indices 2019-04-24 12:26:38 +03:00			`int32_t feats_row`

			`# log probability of entity, based on corpus frequency`
rename entity frequency 2019-07-19 18:40:28 +03:00			`float freq`
bulk loading in proper order of entity indices 2019-04-24 12:26:38 +03:00

			`# Each alias struct stores a list of Entry pointers with their prior probabilities`
			`# for this specific mention/alias.`
			`cdef struct AliasC:`

			`# All entry candidates for this alias`
			`vector[int64_t] entry_indices`

			`# Prior probability P(entity\|alias) - should sum up to (at most) 1.`
			`vector[float] probs`