spaCy/spacy/tests/doc/test_morphanalysis.py

import pytest
from spacy.symbols import POS, PRON, VERB


@pytest.fixture
def i_has(en_tokenizer):
    doc = en_tokenizer("I has")
    tag_map = {
        "PRP": {POS: PRON, "PronType": "prs"},
        "VBZ": {
            POS: VERB,
            "VerbForm": "fin",
            "Tense": "pres",
            "Number": "sing",
            "Person": "three",
        },
    }
    en_tokenizer.vocab.morphology.load_tag_map(tag_map)
    doc[0].tag_ = "PRP"
    doc[1].tag_ = "VBZ"
    return doc


def test_token_morph_eq(i_has):
    assert i_has[0].morph is not i_has[0].morph
    assert i_has[0].morph == i_has[0].morph
    assert i_has[0].morph != i_has[1].morph


def test_token_morph_key(i_has):
    assert i_has[0].morph.key != 0
    assert i_has[1].morph.key != 0
    assert i_has[0].morph.key == i_has[0].morph.key
    assert i_has[0].morph.key != i_has[1].morph.key


def test_morph_props(i_has):
    assert i_has[0].morph.get("PronType") == ["PronType=prs"]
    assert i_has[1].morph.get("PronType") == []


def test_morph_iter(i_has):
    assert set(i_has[0].morph) == set(["PronType=prs"])
    assert set(i_has[1].morph) == set(
        ["Number=sing", "Person=three", "Tense=pres", "VerbForm=fin"]
    )


def test_morph_get(i_has):
    assert i_has[0].morph.get("PronType") == ["PronType=prs"]


def test_morph_set(i_has):
    assert i_has[0].morph.get("PronType") == ["PronType=prs"]
    # set by string
    i_has[0].morph_ = "PronType=unk"
    assert i_has[0].morph.get("PronType") == ["PronType=unk"]
    # set by string, fields are alphabetized
    i_has[0].morph_ = "PronType=123|NounType=unk"
    assert i_has[0].morph_ == "NounType=unk|PronType=123"
    # set by dict
    i_has[0].morph_ = {"AType": "123", "BType": "unk", "POS": "ADJ"}
    assert i_has[0].morph_ == "AType=123|BType=unk|POS=ADJ"
    # set by string with multiple values, fields and values are alphabetized
    i_has[0].morph_ = "BType=c|AType=b,a"
    assert i_has[0].morph_ == "AType=a,b|BType=c"
    # set by dict with multiple values, fields and values are alphabetized
    i_has[0].morph_ = {"AType": "b,a", "BType": "c"}
    assert i_has[0].morph_ == "AType=a,b|BType=c"


def test_morph_str(i_has):
    assert str(i_has[0].morph) == "PronType=prs"
    assert str(i_has[1].morph) == "Number=sing|Person=three|Tense=pres|VerbForm=fin"
Add test for morph analysis 2019-03-08 02:10:07 +03:00			`import pytest`
Remove corpus-specific tag maps Remove corpus-specific tag maps from the language data for languages without custom tokenizers. For languages with custom word segmenters that also provide tags (Japanese and Korean), the tag maps for the custom tokenizers are kept as the default. The default tag maps for languages without custom tokenizers are now the default tag map from `lang/tag_map/py`, UPOS -> UPOS. 2020-07-15 15:13:58 +03:00			`from spacy.symbols import POS, PRON, VERB`
Tidy up and auto-format 2019-03-08 15:28:53 +03:00
Add test for morph analysis 2019-03-08 02:10:07 +03:00
			`@pytest.fixture`
			`def i_has(en_tokenizer):`
			`doc = en_tokenizer("I has")`
Remove corpus-specific tag maps Remove corpus-specific tag maps from the language data for languages without custom tokenizers. For languages with custom word segmenters that also provide tags (Japanese and Korean), the tag maps for the custom tokenizers are kept as the default. The default tag maps for languages without custom tokenizers are now the default tag map from `lang/tag_map/py`, UPOS -> UPOS. 2020-07-15 15:13:58 +03:00			`tag_map = {`
			`"PRP": {POS: PRON, "PronType": "prs"},`
			`"VBZ": {`
			`POS: VERB,`
			`"VerbForm": "fin",`
			`"Tense": "pres",`
			`"Number": "sing",`
			`"Person": "three",`
			`},`
			`}`
			`en_tokenizer.vocab.morphology.load_tag_map(tag_map)`
Add test for morph analysis 2019-03-08 02:10:07 +03:00			`doc[0].tag_ = "PRP"`
			`doc[1].tag_ = "VBZ"`
			`return doc`

Tidy up and auto-format 2019-03-08 15:28:53 +03:00
Modify morphology to support arbitrary features (#4932) * Restructure tag maps for MorphAnalysis changes Prepare tag maps for upcoming MorphAnalysis changes that allow arbritrary features. * Use default tag map rather than duplicating for ca / uk / vi * Import tag map into defaults for ga * Modify tag maps so all morphological fields and features are strings * Move features from `"Other"` to the top level * Rewrite tuples as strings separated by `","` * Rewrite morph symbols for fr lemmatizer as strings * Export MorphAnalysis under spacy.tokens * Modify morphology to support arbitrary features Modify `Morphology` and `MorphAnalysis` so that arbitrary features are supported. * Modify `MorphAnalysisC` so that it can support arbitrary features and multiple values per field. `MorphAnalysisC` is redesigned to contain: * key: hash of UD FEATS string of morphological features * array of `MorphFeatureC` structs that each contain a hash of `Field` and `Field=Value` for a given morphological feature, which makes it possible to: * find features by field * represent multiple values for a given field * `get_field()` is renamed to `get_by_field()` and is no longer `nogil`. Instead a new helper function `get_n_by_field()` is `nogil` and returns `n` features by field. * `MorphAnalysis.get()` returns all possible values for a field as a list of individual features such as `["Tense=Pres", "Tense=Past"]`. * `MorphAnalysis`'s `str()` and `repr()` are the UD FEATS string. * `Morphology.feats_to_dict()` converts a UD FEATS string to a dict where: * Each field has one entry in the dict * Multiple values remain separated by a separator in the value string * `Token.morph_` returns the UD FEATS string and you can set `Token.morph_` with a UD FEATS string or with a tag map dict. * Modify get_by_field to use np.ndarray Modify `get_by_field()` to use np.ndarray. Remove `max_results` from `get_n_by_field()` and always iterate over all the fields. * Rewrite without MorphFeatureC * Add shortcut for existing feats strings as keys Add shortcut for existing feats strings as keys in `Morphology.add()`. * Check for '_' as empty analysis when adding morphs * Extend helper converters in Morphology Add and extend helper converters that convert and normalize between: * UD FEATS strings (`"Case=dat,gen\|Number=sing"`) * per-field dict of feats (`{"Case": "dat,gen", "Number": "sing"}`) * list of individual features (`["Case=dat", "Case=gen", "Number=sing"]`) All converters sort fields and values where applicable. 2020-01-24 00:01:54 +03:00			`def test_token_morph_eq(i_has):`
			`assert i_has[0].morph is not i_has[0].morph`
			`assert i_has[0].morph == i_has[0].morph`
			`assert i_has[0].morph != i_has[1].morph`


			`def test_token_morph_key(i_has):`
			`assert i_has[0].morph.key != 0`
			`assert i_has[1].morph.key != 0`
			`assert i_has[0].morph.key == i_has[0].morph.key`
			`assert i_has[0].morph.key != i_has[1].morph.key`
Add test for morph analysis 2019-03-08 02:10:07 +03:00
Tidy up and auto-format 2019-03-08 15:28:53 +03:00
Add test for morph analysis 2019-03-08 02:10:07 +03:00			`def test_morph_props(i_has):`
Modify morphology to support arbitrary features (#4932) * Restructure tag maps for MorphAnalysis changes Prepare tag maps for upcoming MorphAnalysis changes that allow arbritrary features. * Use default tag map rather than duplicating for ca / uk / vi * Import tag map into defaults for ga * Modify tag maps so all morphological fields and features are strings * Move features from `"Other"` to the top level * Rewrite tuples as strings separated by `","` * Rewrite morph symbols for fr lemmatizer as strings * Export MorphAnalysis under spacy.tokens * Modify morphology to support arbitrary features Modify `Morphology` and `MorphAnalysis` so that arbitrary features are supported. * Modify `MorphAnalysisC` so that it can support arbitrary features and multiple values per field. `MorphAnalysisC` is redesigned to contain: * key: hash of UD FEATS string of morphological features * array of `MorphFeatureC` structs that each contain a hash of `Field` and `Field=Value` for a given morphological feature, which makes it possible to: * find features by field * represent multiple values for a given field * `get_field()` is renamed to `get_by_field()` and is no longer `nogil`. Instead a new helper function `get_n_by_field()` is `nogil` and returns `n` features by field. * `MorphAnalysis.get()` returns all possible values for a field as a list of individual features such as `["Tense=Pres", "Tense=Past"]`. * `MorphAnalysis`'s `str()` and `repr()` are the UD FEATS string. * `Morphology.feats_to_dict()` converts a UD FEATS string to a dict where: * Each field has one entry in the dict * Multiple values remain separated by a separator in the value string * `Token.morph_` returns the UD FEATS string and you can set `Token.morph_` with a UD FEATS string or with a tag map dict. * Modify get_by_field to use np.ndarray Modify `get_by_field()` to use np.ndarray. Remove `max_results` from `get_n_by_field()` and always iterate over all the fields. * Rewrite without MorphFeatureC * Add shortcut for existing feats strings as keys Add shortcut for existing feats strings as keys in `Morphology.add()`. * Check for '_' as empty analysis when adding morphs * Extend helper converters in Morphology Add and extend helper converters that convert and normalize between: * UD FEATS strings (`"Case=dat,gen\|Number=sing"`) * per-field dict of feats (`{"Case": "dat,gen", "Number": "sing"}`) * list of individual features (`["Case=dat", "Case=gen", "Number=sing"]`) All converters sort fields and values where applicable. 2020-01-24 00:01:54 +03:00			`assert i_has[0].morph.get("PronType") == ["PronType=prs"]`
			`assert i_has[1].morph.get("PronType") == []`
Add test for morph analysis 2019-03-08 02:10:07 +03:00

			`def test_morph_iter(i_has):`
Modify morphology to support arbitrary features (#4932) * Restructure tag maps for MorphAnalysis changes Prepare tag maps for upcoming MorphAnalysis changes that allow arbritrary features. * Use default tag map rather than duplicating for ca / uk / vi * Import tag map into defaults for ga * Modify tag maps so all morphological fields and features are strings * Move features from `"Other"` to the top level * Rewrite tuples as strings separated by `","` * Rewrite morph symbols for fr lemmatizer as strings * Export MorphAnalysis under spacy.tokens * Modify morphology to support arbitrary features Modify `Morphology` and `MorphAnalysis` so that arbitrary features are supported. * Modify `MorphAnalysisC` so that it can support arbitrary features and multiple values per field. `MorphAnalysisC` is redesigned to contain: * key: hash of UD FEATS string of morphological features * array of `MorphFeatureC` structs that each contain a hash of `Field` and `Field=Value` for a given morphological feature, which makes it possible to: * find features by field * represent multiple values for a given field * `get_field()` is renamed to `get_by_field()` and is no longer `nogil`. Instead a new helper function `get_n_by_field()` is `nogil` and returns `n` features by field. * `MorphAnalysis.get()` returns all possible values for a field as a list of individual features such as `["Tense=Pres", "Tense=Past"]`. * `MorphAnalysis`'s `str()` and `repr()` are the UD FEATS string. * `Morphology.feats_to_dict()` converts a UD FEATS string to a dict where: * Each field has one entry in the dict * Multiple values remain separated by a separator in the value string * `Token.morph_` returns the UD FEATS string and you can set `Token.morph_` with a UD FEATS string or with a tag map dict. * Modify get_by_field to use np.ndarray Modify `get_by_field()` to use np.ndarray. Remove `max_results` from `get_n_by_field()` and always iterate over all the fields. * Rewrite without MorphFeatureC * Add shortcut for existing feats strings as keys Add shortcut for existing feats strings as keys in `Morphology.add()`. * Check for '_' as empty analysis when adding morphs * Extend helper converters in Morphology Add and extend helper converters that convert and normalize between: * UD FEATS strings (`"Case=dat,gen\|Number=sing"`) * per-field dict of feats (`{"Case": "dat,gen", "Number": "sing"}`) * list of individual features (`["Case=dat", "Case=gen", "Number=sing"]`) All converters sort fields and values where applicable. 2020-01-24 00:01:54 +03:00			`assert set(i_has[0].morph) == set(["PronType=prs"])`
Tidy up and auto-format 2020-02-18 17:38:18 +03:00			`assert set(i_has[1].morph) == set(`
			`["Number=sing", "Person=three", "Tense=pres", "VerbForm=fin"]`
			`)`
Test morphological features 2019-03-08 03:38:54 +03:00

			`def test_morph_get(i_has):`
Modify morphology to support arbitrary features (#4932) * Restructure tag maps for MorphAnalysis changes Prepare tag maps for upcoming MorphAnalysis changes that allow arbritrary features. * Use default tag map rather than duplicating for ca / uk / vi * Import tag map into defaults for ga * Modify tag maps so all morphological fields and features are strings * Move features from `"Other"` to the top level * Rewrite tuples as strings separated by `","` * Rewrite morph symbols for fr lemmatizer as strings * Export MorphAnalysis under spacy.tokens * Modify morphology to support arbitrary features Modify `Morphology` and `MorphAnalysis` so that arbitrary features are supported. * Modify `MorphAnalysisC` so that it can support arbitrary features and multiple values per field. `MorphAnalysisC` is redesigned to contain: * key: hash of UD FEATS string of morphological features * array of `MorphFeatureC` structs that each contain a hash of `Field` and `Field=Value` for a given morphological feature, which makes it possible to: * find features by field * represent multiple values for a given field * `get_field()` is renamed to `get_by_field()` and is no longer `nogil`. Instead a new helper function `get_n_by_field()` is `nogil` and returns `n` features by field. * `MorphAnalysis.get()` returns all possible values for a field as a list of individual features such as `["Tense=Pres", "Tense=Past"]`. * `MorphAnalysis`'s `str()` and `repr()` are the UD FEATS string. * `Morphology.feats_to_dict()` converts a UD FEATS string to a dict where: * Each field has one entry in the dict * Multiple values remain separated by a separator in the value string * `Token.morph_` returns the UD FEATS string and you can set `Token.morph_` with a UD FEATS string or with a tag map dict. * Modify get_by_field to use np.ndarray Modify `get_by_field()` to use np.ndarray. Remove `max_results` from `get_n_by_field()` and always iterate over all the fields. * Rewrite without MorphFeatureC * Add shortcut for existing feats strings as keys Add shortcut for existing feats strings as keys in `Morphology.add()`. * Check for '_' as empty analysis when adding morphs * Extend helper converters in Morphology Add and extend helper converters that convert and normalize between: * UD FEATS strings (`"Case=dat,gen\|Number=sing"`) * per-field dict of feats (`{"Case": "dat,gen", "Number": "sing"}`) * list of individual features (`["Case=dat", "Case=gen", "Number=sing"]`) All converters sort fields and values where applicable. 2020-01-24 00:01:54 +03:00			`assert i_has[0].morph.get("PronType") == ["PronType=prs"]`


			`def test_morph_set(i_has):`
			`assert i_has[0].morph.get("PronType") == ["PronType=prs"]`
			`# set by string`
			`i_has[0].morph_ = "PronType=unk"`
			`assert i_has[0].morph.get("PronType") == ["PronType=unk"]`
			`# set by string, fields are alphabetized`
			`i_has[0].morph_ = "PronType=123\|NounType=unk"`
			`assert i_has[0].morph_ == "NounType=unk\|PronType=123"`
			`# set by dict`
			`i_has[0].morph_ = {"AType": "123", "BType": "unk", "POS": "ADJ"}`
			`assert i_has[0].morph_ == "AType=123\|BType=unk\|POS=ADJ"`
			`# set by string with multiple values, fields and values are alphabetized`
			`i_has[0].morph_ = "BType=c\|AType=b,a"`
			`assert i_has[0].morph_ == "AType=a,b\|BType=c"`
			`# set by dict with multiple values, fields and values are alphabetized`
			`i_has[0].morph_ = {"AType": "b,a", "BType": "c"}`
			`assert i_has[0].morph_ == "AType=a,b\|BType=c"`


			`def test_morph_str(i_has):`
			`assert str(i_has[0].morph) == "PronType=prs"`
			`assert str(i_has[1].morph) == "Number=sing\|Person=three\|Tense=pres\|VerbForm=fin"`