spaCy/spacy/structs.pxd

from libc.stdint cimport uint8_t, uint32_t, int32_t, uint64_t
from libcpp.vector cimport vector
from libc.stdint cimport int32_t, int64_t

from .typedefs cimport flags_t, attr_t, hash_t
from .parts_of_speech cimport univ_pos_t


cdef struct LexemeC:
    flags_t flags

    attr_t lang

    attr_t id
    attr_t length

    attr_t orth
    attr_t lower
    attr_t norm
    attr_t shape
    attr_t prefix
    attr_t suffix


cdef struct SpanC:
    hash_t id
    int start
    int end
    int start_char
    int end_char
    attr_t label
    attr_t kb_id


cdef struct TokenC:
    const LexemeC* lex
    uint64_t morph
    univ_pos_t pos
    bint spacy
    attr_t tag
    int idx
    attr_t lemma
    attr_t norm
    int head
    attr_t dep

    uint32_t l_kids
    uint32_t r_kids
    uint32_t l_edge
    uint32_t r_edge

    int sent_start
    int ent_iob
    attr_t ent_type # TODO: Is there a better way to do this? Multiple sources of truth..
    attr_t ent_kb_id
    hash_t ent_id


cdef struct MorphAnalysisC:
    hash_t key
    int length

    attr_t abbr
    attr_t adp_type
    attr_t adv_type
    attr_t animacy
    attr_t aspect
    attr_t case
    attr_t conj_type
    attr_t connegative
    attr_t definite
    attr_t degree
    attr_t derivation
    attr_t echo
    attr_t foreign
    attr_t gender
    attr_t hyph
    attr_t inf_form
    attr_t mood
    attr_t negative
    attr_t number
    attr_t name_type
    attr_t noun_type
    attr_t num_form
    attr_t num_type
    attr_t num_value
    attr_t part_form
    attr_t part_type
    attr_t person
    attr_t polite
    attr_t polarity
    attr_t poss
    attr_t prefix
    attr_t prep_case
    attr_t pron_type
    attr_t punct_side
    attr_t punct_type
    attr_t reflex
    attr_t style
    attr_t style_variant
    attr_t tense
    attr_t typo
    attr_t verb_form
    attr_t voice
    attr_t verb_type
    attr_t* fields
    attr_t* features


# Internal struct, for storage and disambiguation of entities.
cdef struct KBEntryC:

    # The hash of this entry's unique ID/name in the kB
    hash_t entity_hash

    # Allows retrieval of the entity vector, as an index into a vectors table of the KB.
    # Can be expanded later to refer to multiple rows (compositional model to reduce storage footprint).
    int32_t vector_index

    # Allows retrieval of a struct of non-vector features.
    # This is currently not implemented and set to -1 for the common case where there are no features.
    int32_t feats_row

    # log probability of entity, based on corpus frequency
    float freq


# Each alias struct stores a list of Entry pointers with their prior probabilities
# for this specific mention/alias.
cdef struct AliasC:

    # All entry candidates for this alias
    vector[int64_t] entry_indices

    # Prior probability P(entity|alias) - should sum up to (at most) 1.
    vector[float] probs
* Hack on morphology structs 2015-08-26 20:18:36 +03:00			`from libc.stdint cimport uint8_t, uint32_t, int32_t, uint64_t`
bulk loading in proper order of entity indices 2019-04-24 12:26:38 +03:00			`from libcpp.vector cimport vector`
			`from libc.stdint cimport int32_t, int64_t`

Tidy up compiler flags and imports (#5071) 2020-03-02 13:48:10 +03:00			`from .typedefs cimport flags_t, attr_t, hash_t`
			`from .parts_of_speech cimport univ_pos_t`
bulk loading in proper order of entity indices 2019-04-24 12:26:38 +03:00
* Move all struct definitions to structs.pxd, to avoid circular dependencies 2014-12-19 22:51:33 +03:00
* Tmp commit. Refactoring to create a Python Lexeme class. 2015-01-12 02:26:22 +03:00			`cdef struct LexemeC:`
* Move all struct definitions to structs.pxd, to avoid circular dependencies 2014-12-19 22:51:33 +03:00			`flags_t flags`
Remove trailing whitespace 2015-04-19 11:31:31 +03:00
introduce lang field for LexemeC to hold language id put noun_chunk logic into iterators.py for each language separately 2016-03-10 15:01:34 +03:00			`attr_t lang`

* Move all struct definitions to structs.pxd, to avoid circular dependencies 2014-12-19 22:51:33 +03:00			`attr_t id`
* Tmp. Refactoring, introducing a Lexeme PyObject. 2015-01-12 03:23:44 +03:00			`attr_t length`

* Rename sic to orth 2015-01-22 18:08:25 +03:00			`attr_t orth`
* Rename NORM1 and NORM2 attrs to lower and norm 2015-01-23 22:17:03 +03:00			`attr_t lower`
			`attr_t norm`
* Move all struct definitions to structs.pxd, to avoid circular dependencies 2014-12-19 22:51:33 +03:00			`attr_t shape`
			`attr_t prefix`
			`attr_t suffix`
Remove trailing whitespace 2015-04-19 11:31:31 +03:00
Get spaCy train command working with neural network * Integrate models into pipeline * Add basic serialization (maybe incorrect) * Fix pickle on vocab 2017-05-17 13:04:50 +03:00
Replace Entity/MatchStruct with SpanC (#4459) * Replace MatchStruct with Entity Replace MatchStruct with Entity since the existing Entity struct is nearly identical. * Replace Entity with more general SpanC 2019-10-18 12:01:47 +03:00			`cdef struct SpanC:`
Initial, limited support for quantified patterns in Matcher, and tracking of ent_id attribute in Token and Span. The quantifiers need a lot more testing, and there are some known problems. The main known problem is that the zero-plus and one-plus quantifiers won't work if a token can match both the quantified pattern expression AND the tail of the match. 2016-09-21 15:54:55 +03:00			`hash_t id`
* Tmp 2015-03-09 08:46:22 +03:00			`int start`
			`int end`
Replace Entity/MatchStruct with SpanC (#4459) * Replace MatchStruct with Entity Replace MatchStruct with Entity since the existing Entity struct is nearly identical. * Replace Entity with more general SpanC 2019-10-18 12:01:47 +03:00			`int start_char`
			`int end_char`
WIP on stringstore change. 27 failures 2017-05-28 15:06:40 +03:00			`attr_t label`
Replace Entity/MatchStruct with SpanC (#4459) * Replace MatchStruct with Entity Replace MatchStruct with Entity since the existing Entity struct is nearly identical. * Replace Entity with more general SpanC 2019-10-18 12:01:47 +03:00			`attr_t kb_id`
* NER seems to be working, scoring 69 F. Need to add decision-history features --- currently only use current word, 2 words context. Need refactoring. 2015-03-10 20:00:23 +03:00
* Tmp 2015-03-09 08:46:22 +03:00
* Move all struct definitions to structs.pxd, to avoid circular dependencies 2014-12-19 22:51:33 +03:00			`cdef struct TokenC:`
* Tmp commit. Refactoring to create a Python Lexeme class. 2015-01-12 02:26:22 +03:00			`const LexemeC* lex`
* More work on language-generic parsing 2015-08-28 03:02:33 +03:00			`uint64_t morph`
* Move POS tag definitions to parts_of_speech.pxd 2015-01-25 08:31:07 +03:00			`univ_pos_t pos`
* Add TokenC.spacy attr 2015-07-13 20:48:07 +03:00			`bint spacy`
WIP on stringstore change. 27 failures 2017-05-28 15:06:40 +03:00			`attr_t tag`
* Move all struct definitions to structs.pxd, to avoid circular dependencies 2014-12-19 22:51:33 +03:00			`int idx`
Adjust lexeme sizing for attr_t being 64 bit 2017-05-28 13:51:09 +03:00			`attr_t lemma`
Make NORM a token attribute (#3029) See #3028. The solution in this patch is pretty debateable. What we do is give the TokenC struct a .norm field, by repurposing the previously idle .sense attribute. It's nice to repurpose a previous field because it means the TokenC doesn't change size, so even if someone's using the internals very deeply, nothing will break. The weird thing here is that the TokenC and the LexemeC both have an attribute named NORM. This arguably assists in backwards compatibility. On the other hand, maybe it's really bad! We're changing the semantics of the attribute subtly, so maybe it's better if someone calling lex.norm gets a breakage, and instead is told to write lex.default_norm? Overall I believe this patch makes the NORM feature work the way we sort of expected it to work. Certainly it's much more like how the docs describe it, and more in line with how we've been directing people to use the norm attribute. We'll also be able to use token.norm to do stuff like spelling correction, which is pretty cool. 2018-12-08 12:49:10 +03:00			`attr_t norm`
* Move all struct definitions to structs.pxd, to avoid circular dependencies 2014-12-19 22:51:33 +03:00			`int head`
Adjust lexeme sizing for attr_t being 64 bit 2017-05-28 13:51:09 +03:00			`attr_t dep`
* Add l_edge and r_edge props in TokenC for tracking the parse-yield of the token 2015-04-29 20:14:20 +03:00
* Move all struct definitions to structs.pxd, to avoid circular dependencies 2014-12-19 22:51:33 +03:00			`uint32_t l_kids`
			`uint32_t r_kids`
* Add l_edge and r_edge props in TokenC for tracking the parse-yield of the token 2015-04-29 20:14:20 +03:00			`uint32_t l_edge`
			`uint32_t r_edge`
* Move all struct definitions to structs.pxd, to avoid circular dependencies 2014-12-19 22:51:33 +03:00
Make TokenC.sent_tart an int, to allow ternary value 2017-10-08 20:58:54 +03:00			`int sent_start`
* NER seems to be working, scoring 69 F. Need to add decision-history features --- currently only use current word, 2 words context. Need refactoring. 2015-03-10 20:00:23 +03:00			`int ent_iob`
WIP on stringstore change. 27 failures 2017-05-28 15:06:40 +03:00			`attr_t ent_type # TODO: Is there a better way to do this? Multiple sources of truth..`
annotate kb_id through ents in doc 2019-03-14 17:48:40 +03:00			`attr_t ent_kb_id`
Initial, limited support for quantified patterns in Matcher, and tracking of ent_id attribute in Token and Span. The quantifiers need a lot more testing, and there are some known problems. The main known problem is that the zero-plus and one-plus quantifiers won't work if a token can match both the quantified pattern expression AND the tail of the match. 2016-09-21 15:54:55 +03:00			`hash_t ent_id`
Update structs 2018-09-25 00:58:08 +03:00

Add MorphAnalysisC struct 2019-03-07 16:03:07 +03:00			`cdef struct MorphAnalysisC:`
Modify morphology to support arbitrary features (#4932) * Restructure tag maps for MorphAnalysis changes Prepare tag maps for upcoming MorphAnalysis changes that allow arbritrary features. * Use default tag map rather than duplicating for ca / uk / vi * Import tag map into defaults for ga * Modify tag maps so all morphological fields and features are strings * Move features from `"Other"` to the top level * Rewrite tuples as strings separated by `","` * Rewrite morph symbols for fr lemmatizer as strings * Export MorphAnalysis under spacy.tokens * Modify morphology to support arbitrary features Modify `Morphology` and `MorphAnalysis` so that arbitrary features are supported. * Modify `MorphAnalysisC` so that it can support arbitrary features and multiple values per field. `MorphAnalysisC` is redesigned to contain: * key: hash of UD FEATS string of morphological features * array of `MorphFeatureC` structs that each contain a hash of `Field` and `Field=Value` for a given morphological feature, which makes it possible to: * find features by field * represent multiple values for a given field * `get_field()` is renamed to `get_by_field()` and is no longer `nogil`. Instead a new helper function `get_n_by_field()` is `nogil` and returns `n` features by field. * `MorphAnalysis.get()` returns all possible values for a field as a list of individual features such as `["Tense=Pres", "Tense=Past"]`. * `MorphAnalysis`'s `str()` and `repr()` are the UD FEATS string. * `Morphology.feats_to_dict()` converts a UD FEATS string to a dict where: * Each field has one entry in the dict * Multiple values remain separated by a separator in the value string * `Token.morph_` returns the UD FEATS string and you can set `Token.morph_` with a UD FEATS string or with a tag map dict. * Modify get_by_field to use np.ndarray Modify `get_by_field()` to use np.ndarray. Remove `max_results` from `get_n_by_field()` and always iterate over all the fields. * Rewrite without MorphFeatureC * Add shortcut for existing feats strings as keys Add shortcut for existing feats strings as keys in `Morphology.add()`. * Check for '_' as empty analysis when adding morphs * Extend helper converters in Morphology Add and extend helper converters that convert and normalize between: * UD FEATS strings (`"Case=dat,gen\|Number=sing"`) * per-field dict of feats (`{"Case": "dat,gen", "Number": "sing"}`) * list of individual features (`["Case=dat", "Case=gen", "Number=sing"]`) All converters sort fields and values where applicable. 2020-01-24 00:01:54 +03:00			`hash_t key`
Add length attribute to MorphAnalysisC 2019-03-08 02:08:57 +03:00			`int length`
Add is_sent_end token property (#5375) Reconstruction of the original PR #4697 by @MiniLau. Removes unused `SENT_END` symbol and `IS_SENT_END` from `Matcher` schema because the Matcher is only going to be able to support `IS_SENT_START`. 2020-04-29 13:53:16 +03:00
Add MorphAnalysisC struct 2019-03-07 16:03:07 +03:00			`attr_t abbr`
			`attr_t adp_type`
			`attr_t adv_type`
			`attr_t animacy`
			`attr_t aspect`
			`attr_t case`
			`attr_t conj_type`
			`attr_t connegative`
			`attr_t definite`
			`attr_t degree`
			`attr_t derivation`
			`attr_t echo`
			`attr_t foreign`
			`attr_t gender`
			`attr_t hyph`
			`attr_t inf_form`
			`attr_t mood`
			`attr_t negative`
			`attr_t number`
			`attr_t name_type`
			`attr_t noun_type`
			`attr_t num_form`
			`attr_t num_type`
			`attr_t num_value`
			`attr_t part_form`
			`attr_t part_type`
			`attr_t person`
			`attr_t polite`
			`attr_t polarity`
			`attr_t poss`
			`attr_t prefix`
			`attr_t prep_case`
			`attr_t pron_type`
			`attr_t punct_side`
			`attr_t punct_type`
			`attr_t reflex`
			`attr_t style`
			`attr_t style_variant`
			`attr_t tense`
			`attr_t typo`
			`attr_t verb_form`
			`attr_t voice`
			`attr_t verb_type`
Modify morphology to support arbitrary features (#4932) * Restructure tag maps for MorphAnalysis changes Prepare tag maps for upcoming MorphAnalysis changes that allow arbritrary features. * Use default tag map rather than duplicating for ca / uk / vi * Import tag map into defaults for ga * Modify tag maps so all morphological fields and features are strings * Move features from `"Other"` to the top level * Rewrite tuples as strings separated by `","` * Rewrite morph symbols for fr lemmatizer as strings * Export MorphAnalysis under spacy.tokens * Modify morphology to support arbitrary features Modify `Morphology` and `MorphAnalysis` so that arbitrary features are supported. * Modify `MorphAnalysisC` so that it can support arbitrary features and multiple values per field. `MorphAnalysisC` is redesigned to contain: * key: hash of UD FEATS string of morphological features * array of `MorphFeatureC` structs that each contain a hash of `Field` and `Field=Value` for a given morphological feature, which makes it possible to: * find features by field * represent multiple values for a given field * `get_field()` is renamed to `get_by_field()` and is no longer `nogil`. Instead a new helper function `get_n_by_field()` is `nogil` and returns `n` features by field. * `MorphAnalysis.get()` returns all possible values for a field as a list of individual features such as `["Tense=Pres", "Tense=Past"]`. * `MorphAnalysis`'s `str()` and `repr()` are the UD FEATS string. * `Morphology.feats_to_dict()` converts a UD FEATS string to a dict where: * Each field has one entry in the dict * Multiple values remain separated by a separator in the value string * `Token.morph_` returns the UD FEATS string and you can set `Token.morph_` with a UD FEATS string or with a tag map dict. * Modify get_by_field to use np.ndarray Modify `get_by_field()` to use np.ndarray. Remove `max_results` from `get_n_by_field()` and always iterate over all the fields. * Rewrite without MorphFeatureC * Add shortcut for existing feats strings as keys Add shortcut for existing feats strings as keys in `Morphology.add()`. * Check for '_' as empty analysis when adding morphs * Extend helper converters in Morphology Add and extend helper converters that convert and normalize between: * UD FEATS strings (`"Case=dat,gen\|Number=sing"`) * per-field dict of feats (`{"Case": "dat,gen", "Number": "sing"}`) * list of individual features (`["Case=dat", "Case=gen", "Number=sing"]`) All converters sort fields and values where applicable. 2020-01-24 00:01:54 +03:00			`attr_t* fields`
			`attr_t* features`

Update structs 2018-09-25 00:58:08 +03:00
bulk loading in proper order of entity indices 2019-04-24 12:26:38 +03:00			`# Internal struct, for storage and disambiguation of entities.`
rename to KBEntryC 2019-06-26 16:55:26 +03:00			`cdef struct KBEntryC:`
bulk loading in proper order of entity indices 2019-04-24 12:26:38 +03:00
			`# The hash of this entry's unique ID/name in the kB`
			`hash_t entity_hash`

entity vectors in the KB + serialization of them 2019-06-05 19:29:18 +03:00			`# Allows retrieval of the entity vector, as an index into a vectors table of the KB.`
			`# Can be expanded later to refer to multiple rows (compositional model to reduce storage footprint).`
			`int32_t vector_index`

			`# Allows retrieval of a struct of non-vector features.`
			`# This is currently not implemented and set to -1 for the common case where there are no features.`
bulk loading in proper order of entity indices 2019-04-24 12:26:38 +03:00			`int32_t feats_row`

			`# log probability of entity, based on corpus frequency`
rename entity frequency 2019-07-19 18:40:28 +03:00			`float freq`
bulk loading in proper order of entity indices 2019-04-24 12:26:38 +03:00

			`# Each alias struct stores a list of Entry pointers with their prior probabilities`
			`# for this specific mention/alias.`
			`cdef struct AliasC:`

			`# All entry candidates for this alias`
			`vector[int64_t] entry_indices`

			`# Prior probability P(entity\|alias) - should sum up to (at most) 1.`
			`vector[float] probs`