spaCy/spacy/morphology.pxd

from cymem.cymem cimport Pool
from preshed.maps cimport PreshMap, PreshMapArray
from libc.stdint cimport uint64_t
from murmurhash cimport mrmr
cimport numpy as np

from .structs cimport TokenC, MorphAnalysisC
from .strings cimport StringStore
from .typedefs cimport hash_t, attr_t, flags_t
from .parts_of_speech cimport univ_pos_t
from . cimport symbols


cdef class Morphology:
    cdef readonly Pool mem
    cdef readonly StringStore strings
    cdef PreshMap tags # Keyed by hash, value is pointer to tag

    cdef public object lemmatizer
    cdef readonly object tag_map
    cdef readonly object tag_names
    cdef readonly object reverse_index
    cdef readonly object _exc
    cdef readonly PreshMapArray _cache
    cdef readonly int n_tags

    cdef MorphAnalysisC create_morph_tag(self, field_feature_pairs) except *
    cdef int insert(self, MorphAnalysisC tag) except -1

    cdef int assign_untagged(self, TokenC* token) except -1
    cdef int assign_tag(self, TokenC* token, tag) except -1
    cdef int assign_tag_id(self, TokenC* token, int tag_id) except -1

    cdef int _assign_tag_from_exceptions(self, TokenC* token, int tag_id) except -1


cdef int check_feature(const MorphAnalysisC* morph, attr_t feature) nogil
cdef list list_features(const MorphAnalysisC* morph)
cdef np.ndarray get_by_field(const MorphAnalysisC* morph, attr_t field)
cdef int get_n_by_field(attr_t* results, const MorphAnalysisC* morph, attr_t field) nogil
* More work on language-generic parsing 2015-08-28 03:02:33 +03:00			`from cymem.cymem cimport Pool`
Restore previous morphology stuff 2018-09-25 01:35:59 +03:00			`from preshed.maps cimport PreshMap, PreshMapArray`
* More work on language-generic parsing 2015-08-28 03:02:33 +03:00			`from libc.stdint cimport uint64_t`
WIP on supporting morphology features 2018-09-25 00:57:41 +03:00			`from murmurhash cimport mrmr`
Modify morphology to support arbitrary features (#4932) * Restructure tag maps for MorphAnalysis changes Prepare tag maps for upcoming MorphAnalysis changes that allow arbritrary features. * Use default tag map rather than duplicating for ca / uk / vi * Import tag map into defaults for ga * Modify tag maps so all morphological fields and features are strings * Move features from `"Other"` to the top level * Rewrite tuples as strings separated by `","` * Rewrite morph symbols for fr lemmatizer as strings * Export MorphAnalysis under spacy.tokens * Modify morphology to support arbitrary features Modify `Morphology` and `MorphAnalysis` so that arbitrary features are supported. * Modify `MorphAnalysisC` so that it can support arbitrary features and multiple values per field. `MorphAnalysisC` is redesigned to contain: * key: hash of UD FEATS string of morphological features * array of `MorphFeatureC` structs that each contain a hash of `Field` and `Field=Value` for a given morphological feature, which makes it possible to: * find features by field * represent multiple values for a given field * `get_field()` is renamed to `get_by_field()` and is no longer `nogil`. Instead a new helper function `get_n_by_field()` is `nogil` and returns `n` features by field. * `MorphAnalysis.get()` returns all possible values for a field as a list of individual features such as `["Tense=Pres", "Tense=Past"]`. * `MorphAnalysis`'s `str()` and `repr()` are the UD FEATS string. * `Morphology.feats_to_dict()` converts a UD FEATS string to a dict where: * Each field has one entry in the dict * Multiple values remain separated by a separator in the value string * `Token.morph_` returns the UD FEATS string and you can set `Token.morph_` with a UD FEATS string or with a tag map dict. * Modify get_by_field to use np.ndarray Modify `get_by_field()` to use np.ndarray. Remove `max_results` from `get_n_by_field()` and always iterate over all the fields. * Rewrite without MorphFeatureC * Add shortcut for existing feats strings as keys Add shortcut for existing feats strings as keys in `Morphology.add()`. * Check for '_' as empty analysis when adding morphs * Extend helper converters in Morphology Add and extend helper converters that convert and normalize between: * UD FEATS strings (`"Case=dat,gen\|Number=sing"`) * per-field dict of feats (`{"Case": "dat,gen", "Number": "sing"}`) * list of individual features (`["Case=dat", "Case=gen", "Number=sing"]`) All converters sort fields and values where applicable. 2020-01-24 00:01:54 +03:00			`cimport numpy as np`
* More work on language-generic parsing 2015-08-28 03:02:33 +03:00
Add MorphAnalysisC struct 2019-03-07 16:03:07 +03:00			`from .structs cimport TokenC, MorphAnalysisC`
* Tagger training now working. Still need to test load/save of model. Morphology still broken. 2015-08-27 10:16:11 +03:00			`from .strings cimport StringStore`
WIP on supporting morphology features 2018-09-25 00:57:41 +03:00			`from .typedefs cimport hash_t, attr_t, flags_t`
* More work on language-generic parsing 2015-08-28 03:02:33 +03:00			`from .parts_of_speech cimport univ_pos_t`
* Addmorphology symbols to morphology. May need to remove these as an enum. 2015-10-10 14:10:58 +03:00			`from . cimport symbols`
* More work on language-generic parsing 2015-08-28 03:02:33 +03:00
Tidy up compiler flags and imports (#5071) 2020-03-02 13:48:10 +03:00
* Comment out enums in Morpohlogy for now 2015-08-26 20:17:35 +03:00			`cdef class Morphology:`
* More work on language-generic parsing 2015-08-28 03:02:33 +03:00			`cdef readonly Pool mem`
* More work on language independent parsing 2015-08-28 04:44:54 +03:00			`cdef readonly StringStore strings`
WIP on supporting morphology features 2018-09-25 00:57:41 +03:00			`cdef PreshMap tags # Keyed by hash, value is pointer to tag`
Tidy up compiler flags and imports (#5071) 2020-03-02 13:48:10 +03:00
* Tagger training now working. Still need to test load/save of model. Morphology still broken. 2015-08-27 10:16:11 +03:00			`cdef public object lemmatizer`
* Ensure Morphology can be pickled, to address Issue #125. 2015-10-12 07:27:47 +03:00			`cdef readonly object tag_map`
Restore previous morphology stuff 2018-09-25 01:35:59 +03:00			`cdef readonly object tag_names`
			`cdef readonly object reverse_index`
Update Morphology to load exceptions as MORPH_RULES Update `Morphology` to load exceptions in `Morphology.__init__` and `Morphology.load_morph_exceptions` from the format used in `MORPH_RULES` rather than the internal format with tuple keys. * Rename to `Morphology.exc` to `Morphology._exc` for internal use with tuple keys * Add `Morphology.exc` as a property that converts the internal `_exc` back to `MORPH_RULES` format, primarily for serialization 2020-07-16 22:16:49 +03:00			`cdef readonly object _exc`
Fix morphology 2018-09-25 11:57:33 +03:00			`cdef readonly PreshMapArray _cache`
Restore previous morphology stuff 2018-09-25 01:35:59 +03:00			`cdef readonly int n_tags`
* More work on language-generic parsing 2015-08-28 03:02:33 +03:00
Modify morphology to support arbitrary features (#4932) * Restructure tag maps for MorphAnalysis changes Prepare tag maps for upcoming MorphAnalysis changes that allow arbritrary features. * Use default tag map rather than duplicating for ca / uk / vi * Import tag map into defaults for ga * Modify tag maps so all morphological fields and features are strings * Move features from `"Other"` to the top level * Rewrite tuples as strings separated by `","` * Rewrite morph symbols for fr lemmatizer as strings * Export MorphAnalysis under spacy.tokens * Modify morphology to support arbitrary features Modify `Morphology` and `MorphAnalysis` so that arbitrary features are supported. * Modify `MorphAnalysisC` so that it can support arbitrary features and multiple values per field. `MorphAnalysisC` is redesigned to contain: * key: hash of UD FEATS string of morphological features * array of `MorphFeatureC` structs that each contain a hash of `Field` and `Field=Value` for a given morphological feature, which makes it possible to: * find features by field * represent multiple values for a given field * `get_field()` is renamed to `get_by_field()` and is no longer `nogil`. Instead a new helper function `get_n_by_field()` is `nogil` and returns `n` features by field. * `MorphAnalysis.get()` returns all possible values for a field as a list of individual features such as `["Tense=Pres", "Tense=Past"]`. * `MorphAnalysis`'s `str()` and `repr()` are the UD FEATS string. * `Morphology.feats_to_dict()` converts a UD FEATS string to a dict where: * Each field has one entry in the dict * Multiple values remain separated by a separator in the value string * `Token.morph_` returns the UD FEATS string and you can set `Token.morph_` with a UD FEATS string or with a tag map dict. * Modify get_by_field to use np.ndarray Modify `get_by_field()` to use np.ndarray. Remove `max_results` from `get_n_by_field()` and always iterate over all the fields. * Rewrite without MorphFeatureC * Add shortcut for existing feats strings as keys Add shortcut for existing feats strings as keys in `Morphology.add()`. * Check for '_' as empty analysis when adding morphs * Extend helper converters in Morphology Add and extend helper converters that convert and normalize between: * UD FEATS strings (`"Case=dat,gen\|Number=sing"`) * per-field dict of feats (`{"Case": "dat,gen", "Number": "sing"}`) * list of individual features (`["Case=dat", "Case=gen", "Number=sing"]`) All converters sort fields and values where applicable. 2020-01-24 00:01:54 +03:00			`cdef MorphAnalysisC create_morph_tag(self, field_feature_pairs) except *`
			`cdef int insert(self, MorphAnalysisC tag) except -1`
Tidy up compiler flags and imports (#5071) 2020-03-02 13:48:10 +03:00
Add assign_untagged method in Morphology 2017-10-11 04:22:49 +03:00			`cdef int assign_untagged(self, TokenC* token) except -1`
* More work on language-generic parsing 2015-08-28 03:02:33 +03:00			`cdef int assign_tag(self, TokenC* token, tag) except -1`
Fix morphology tagger 2016-11-04 21:19:09 +03:00			`cdef int assign_tag_id(self, TokenC* token, int tag_id) except -1`
* Comment out enums in Morpohlogy for now 2015-08-26 20:17:35 +03:00
Fix exception loading 2018-09-25 16:18:21 +03:00			`cdef int _assign_tag_from_exceptions(self, TokenC* token, int tag_id) except -1`
Export helper functions for morphology 2019-03-07 20:33:06 +03:00

Modify morphology to support arbitrary features (#4932) * Restructure tag maps for MorphAnalysis changes Prepare tag maps for upcoming MorphAnalysis changes that allow arbritrary features. * Use default tag map rather than duplicating for ca / uk / vi * Import tag map into defaults for ga * Modify tag maps so all morphological fields and features are strings * Move features from `"Other"` to the top level * Rewrite tuples as strings separated by `","` * Rewrite morph symbols for fr lemmatizer as strings * Export MorphAnalysis under spacy.tokens * Modify morphology to support arbitrary features Modify `Morphology` and `MorphAnalysis` so that arbitrary features are supported. * Modify `MorphAnalysisC` so that it can support arbitrary features and multiple values per field. `MorphAnalysisC` is redesigned to contain: * key: hash of UD FEATS string of morphological features * array of `MorphFeatureC` structs that each contain a hash of `Field` and `Field=Value` for a given morphological feature, which makes it possible to: * find features by field * represent multiple values for a given field * `get_field()` is renamed to `get_by_field()` and is no longer `nogil`. Instead a new helper function `get_n_by_field()` is `nogil` and returns `n` features by field. * `MorphAnalysis.get()` returns all possible values for a field as a list of individual features such as `["Tense=Pres", "Tense=Past"]`. * `MorphAnalysis`'s `str()` and `repr()` are the UD FEATS string. * `Morphology.feats_to_dict()` converts a UD FEATS string to a dict where: * Each field has one entry in the dict * Multiple values remain separated by a separator in the value string * `Token.morph_` returns the UD FEATS string and you can set `Token.morph_` with a UD FEATS string or with a tag map dict. * Modify get_by_field to use np.ndarray Modify `get_by_field()` to use np.ndarray. Remove `max_results` from `get_n_by_field()` and always iterate over all the fields. * Rewrite without MorphFeatureC * Add shortcut for existing feats strings as keys Add shortcut for existing feats strings as keys in `Morphology.add()`. * Check for '_' as empty analysis when adding morphs * Extend helper converters in Morphology Add and extend helper converters that convert and normalize between: * UD FEATS strings (`"Case=dat,gen\|Number=sing"`) * per-field dict of feats (`{"Case": "dat,gen", "Number": "sing"}`) * list of individual features (`["Case=dat", "Case=gen", "Number=sing"]`) All converters sort fields and values where applicable. 2020-01-24 00:01:54 +03:00			`cdef int check_feature(const MorphAnalysisC* morph, attr_t feature) nogil`
			`cdef list list_features(const MorphAnalysisC* morph)`
			`cdef np.ndarray get_by_field(const MorphAnalysisC* morph, attr_t field)`
			`cdef int get_n_by_field(attr_t* results, const MorphAnalysisC* morph, attr_t field) nogil`