spaCy/website/docs/api/morphology.md

255 lines
9.5 KiB
Markdown
Raw Normal View History

---
title: Morphology
tag: class
source: spacy/morphology.pyx
---
2020-07-29 12:36:42 +03:00
Store the possible morphological analyses for a language, and index them by
hash. To save space on each token, tokens only know the hash of their
morphological analysis, so queries of morphological attributes are delegated to
2020-08-21 14:22:59 +03:00
this class. See [`MorphAnalysis`](/api/morphology#morphanalysis) for the
2020-08-17 17:45:24 +03:00
container storing a single morphological analysis.
## Morphology.\_\_init\_\_ {#init tag="method"}
Add Lemmatizer and simplify related components (#5848) * Add Lemmatizer and simplify related components * Add `Lemmatizer` pipe with `lookup` and `rule` modes using the `Lookups` tables. * Reduce `Tagger` to a simple tagger that sets `Token.tag` (no pos or lemma) * Reduce `Morphology` to only keep track of morph tags (no tag map, lemmatizer, or morph rules) * Remove lemmatizer from `Vocab` * Adjust many many tests Differences: * No default lookup lemmas * No special treatment of TAG in `from_array` and similar required * Easier to modify labels in a `Tagger` * No extra strings added from morphology / tag map * Fix test * Initial fix for Lemmatizer config/serialization * Adjust init test to be more generic * Adjust init test to force empty Lookups * Add simple cache to rule-based lemmatizer * Convert language-specific lemmatizers Convert language-specific lemmatizers to component lemmatizers. Remove previous lemmatizer class. * Fix French and Polish lemmatizers * Remove outdated UPOS conversions * Update Russian lemmatizer init in tests * Add minimal init/run tests for custom lemmatizers * Add option to overwrite existing lemmas * Update mode setting, lookup loading, and caching * Make `mode` an immutable property * Only enforce strict `load_lookups` for known supported modes * Move caching into individual `_lemmatize` methods * Implement strict when lang is not found in lookups * Fix tables/lookups in make_lemmatizer * Reallow provided lookups and allow for stricter checks * Add lookups asset to all Lemmatizer pipe tests * Rename lookups in lemmatizer init test * Clean up merge * Refactor lookup table loading * Add helper from `load_lemmatizer_lookups` that loads required and optional lookups tables based on settings provided by a config. Additional slight refactor of lookups: * Add `Lookups.set_table` to set a table from a provided `Table` * Reorder class definitions to be able to specify type as `Table` * Move registry assets into test methods * Refactor lookups tables config Use class methods within `Lemmatizer` to provide the config for particular modes and to load the lookups from a config. * Add pipe and score to lemmatizer * Simplify Tagger.score * Add missing import * Clean up imports and auto-format * Remove unused kwarg * Tidy up and auto-format * Update docstrings for Lemmatizer Update docstrings for Lemmatizer. Additionally modify `is_base_form` API to take `Token` instead of individual features. * Update docstrings * Remove tag map values from Tagger.add_label * Update API docs * Fix relative link in Lemmatizer API docs
2020-08-07 16:27:13 +03:00
Create a Morphology object.
> #### Example
>
> ```python
> from spacy.morphology import Morphology
>
Add Lemmatizer and simplify related components (#5848) * Add Lemmatizer and simplify related components * Add `Lemmatizer` pipe with `lookup` and `rule` modes using the `Lookups` tables. * Reduce `Tagger` to a simple tagger that sets `Token.tag` (no pos or lemma) * Reduce `Morphology` to only keep track of morph tags (no tag map, lemmatizer, or morph rules) * Remove lemmatizer from `Vocab` * Adjust many many tests Differences: * No default lookup lemmas * No special treatment of TAG in `from_array` and similar required * Easier to modify labels in a `Tagger` * No extra strings added from morphology / tag map * Fix test * Initial fix for Lemmatizer config/serialization * Adjust init test to be more generic * Adjust init test to force empty Lookups * Add simple cache to rule-based lemmatizer * Convert language-specific lemmatizers Convert language-specific lemmatizers to component lemmatizers. Remove previous lemmatizer class. * Fix French and Polish lemmatizers * Remove outdated UPOS conversions * Update Russian lemmatizer init in tests * Add minimal init/run tests for custom lemmatizers * Add option to overwrite existing lemmas * Update mode setting, lookup loading, and caching * Make `mode` an immutable property * Only enforce strict `load_lookups` for known supported modes * Move caching into individual `_lemmatize` methods * Implement strict when lang is not found in lookups * Fix tables/lookups in make_lemmatizer * Reallow provided lookups and allow for stricter checks * Add lookups asset to all Lemmatizer pipe tests * Rename lookups in lemmatizer init test * Clean up merge * Refactor lookup table loading * Add helper from `load_lemmatizer_lookups` that loads required and optional lookups tables based on settings provided by a config. Additional slight refactor of lookups: * Add `Lookups.set_table` to set a table from a provided `Table` * Reorder class definitions to be able to specify type as `Table` * Move registry assets into test methods * Refactor lookups tables config Use class methods within `Lemmatizer` to provide the config for particular modes and to load the lookups from a config. * Add pipe and score to lemmatizer * Simplify Tagger.score * Add missing import * Clean up imports and auto-format * Remove unused kwarg * Tidy up and auto-format * Update docstrings for Lemmatizer Update docstrings for Lemmatizer. Additionally modify `is_base_form` API to take `Token` instead of individual features. * Update docstrings * Remove tag map values from Tagger.add_label * Update API docs * Fix relative link in Lemmatizer API docs
2020-08-07 16:27:13 +03:00
> morphology = Morphology(strings)
> ```
2020-08-17 17:45:24 +03:00
| Name | Description |
| --------- | --------------------------------- |
| `strings` | The string store. ~~StringStore~~ |
## Morphology.add {#add tag="method"}
2020-07-29 12:36:42 +03:00
Insert a morphological analysis in the morphology table, if not already present.
2020-08-17 17:45:24 +03:00
The morphological analysis may be provided in the Universal Dependencies
[FEATS](https://universaldependencies.org/format.html#morphological-annotation)
format as a string or in the tag map dictionary format. Returns the hash of the
new analysis.
> #### Example
>
> ```python
> feats = "Feat1=Val1|Feat2=Val2"
> hash = nlp.vocab.morphology.add(feats)
> assert hash == nlp.vocab.strings[feats]
> ```
2020-08-17 17:45:24 +03:00
| Name | Description |
| ---------- | ------------------------------------------------ |
| `features` | The morphological features. ~~Union[Dict, str]~~ |
## Morphology.get {#get tag="method"}
> #### Example
>
> ```python
> feats = "Feat1=Val1|Feat2=Val2"
> hash = nlp.vocab.morphology.add(feats)
> assert nlp.vocab.morphology.get(hash) == feats
> ```
2020-08-17 17:45:24 +03:00
Get the
[FEATS](https://universaldependencies.org/format.html#morphological-annotation)
string for the hash of the morphological analysis.
2020-08-17 17:45:24 +03:00
| Name | Description |
| ------- | ----------------------------------------------- |
| `morph` | The hash of the morphological analysis. ~~int~~ |
## Morphology.feats_to_dict {#feats_to_dict tag="staticmethod"}
2020-08-17 17:45:24 +03:00
Convert a string
[FEATS](https://universaldependencies.org/format.html#morphological-annotation)
representation to a dictionary of features and values in the same format as the
tag map.
> #### Example
>
> ```python
> from spacy.morphology import Morphology
> d = Morphology.feats_to_dict("Feat1=Val1|Feat2=Val2")
> assert d == {"Feat1": "Val1", "Feat2": "Val2"}
> ```
2020-08-17 17:45:24 +03:00
| Name | Description |
| ----------- | ---------------------------------------------------------------------------------------------------------------------------------------------------- |
| `feats` | The morphological features in Universal Dependencies [FEATS](https://universaldependencies.org/format.html#morphological-annotation) format. ~~str~~ |
| **RETURNS** | The morphological features as a dictionary. ~~Dict[str, str]~~ |
## Morphology.dict_to_feats {#dict_to_feats tag="staticmethod"}
2020-08-17 17:45:24 +03:00
Convert a dictionary of features and values to a string
[FEATS](https://universaldependencies.org/format.html#morphological-annotation)
representation.
> #### Example
>
> ```python
> from spacy.morphology import Morphology
> f = Morphology.dict_to_feats({"Feat1": "Val1", "Feat2": "Val2"})
> assert f == "Feat1=Val1|Feat2=Val2"
> ```
2020-08-17 17:45:24 +03:00
| Name | Description |
| ------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `feats_dict` | The morphological features as a dictionary. ~~Dict[str, str]~~ |
| **RETURNS** | The morphological features as in Universal Dependencies [FEATS](https://universaldependencies.org/format.html#morphological-annotation) format. ~~str~~ |
## Attributes {#attributes}
2020-08-17 17:45:24 +03:00
| Name | Description |
| ------------- | ------------------------------------------------------------------------------------------------------------------------------ |
| `FEATURE_SEP` | The [FEATS](https://universaldependencies.org/format.html#morphological-annotation) feature separator. Default is `|`. ~~str~~ |
| `FIELD_SEP` | The [FEATS](https://universaldependencies.org/format.html#morphological-annotation) field separator. Default is `=`. ~~str~~ |
| `VALUE_SEP` | The [FEATS](https://universaldependencies.org/format.html#morphological-annotation) value separator. Default is `,`. ~~str~~ |
## MorphAnalysis {#morphanalysis tag="class" source="spacy/tokens/morphanalysis.pyx"}
Stores a single morphological analysis.
### MorphAnalysis.\_\_init\_\_ {#morphanalysis-init tag="method"}
Initialize a MorphAnalysis object from a Universal Dependencies
[FEATS](https://universaldependencies.org/format.html#morphological-annotation)
string or a dictionary of morphological features.
> #### Example
>
> ```python
> from spacy.tokens import MorphAnalysis
>
> feats = "Feat1=Val1|Feat2=Val2"
> m = MorphAnalysis(nlp.vocab, feats)
> ```
| Name | Description |
| ---------- | ---------------------------------------------------------- |
| `vocab` | The vocab. ~~Vocab~~ |
| `features` | The morphological features. ~~Union[Dict[str, str], str]~~ |
### MorphAnalysis.\_\_contains\_\_ {#morphanalysis-contains tag="method"}
Whether a feature/value pair is in the analysis.
> #### Example
>
> ```python
> feats = "Feat1=Val1,Val2|Feat2=Val2"
> morph = MorphAnalysis(nlp.vocab, feats)
> assert "Feat1=Val1" in morph
> ```
| Name | Description |
| ----------- | --------------------------------------------- |
| **RETURNS** | A feature/value pair in the analysis. ~~str~~ |
### MorphAnalysis.\_\_iter\_\_ {#morphanalysis-iter tag="method"}
Iterate over the feature/value pairs in the analysis.
> #### Example
>
> ```python
> feats = "Feat1=Val1,Val3|Feat2=Val2"
> morph = MorphAnalysis(nlp.vocab, feats)
> assert list(morph) == ["Feat1=Va1", "Feat1=Val3", "Feat2=Val2"]
> ```
| Name | Description |
| ---------- | --------------------------------------------- |
| **YIELDS** | A feature/value pair in the analysis. ~~str~~ |
### MorphAnalysis.\_\_len\_\_ {#morphanalysis-len tag="method"}
Returns the number of features in the analysis.
> #### Example
>
> ```python
> feats = "Feat1=Val1,Val2|Feat2=Val2"
> morph = MorphAnalysis(nlp.vocab, feats)
> assert len(morph) == 3
> ```
| Name | Description |
| ----------- | ----------------------------------------------- |
| **RETURNS** | The number of features in the analysis. ~~int~~ |
### MorphAnalysis.\_\_str\_\_ {#morphanalysis-str tag="method"}
Returns the morphological analysis in the Universal Dependencies
[FEATS](https://universaldependencies.org/format.html#morphological-annotation)
string format.
> #### Example
>
> ```python
> feats = "Feat1=Val1,Val2|Feat2=Val2"
> morph = MorphAnalysis(nlp.vocab, feats)
> assert str(morph) == feats
> ```
| Name | Description |
| ----------- | ------------------------------------------------------------------------------------------------------------------------------------------ |
| **RETURNS** | The analysis in the Universal Dependencies [FEATS](https://universaldependencies.org/format.html#morphological-annotation) format. ~~str~~ |
### MorphAnalysis.get {#morphanalysis-get tag="method"}
Retrieve values for a feature by field.
> #### Example
>
> ```python
> feats = "Feat1=Val1,Val2"
> morph = MorphAnalysis(nlp.vocab, feats)
> assert morph.get("Feat1") == ["Val1", "Val2"]
> ```
| Name | Description |
| ----------- | ------------------------------------------------ |
| `field` | The field to retrieve. ~~str~~ |
| **RETURNS** | A list of the individual features. ~~List[str]~~ |
### MorphAnalysis.to_dict {#morphanalysis-to_dict tag="method"}
Produce a dict representation of the analysis, in the same format as the tag
map.
> #### Example
>
> ```python
> feats = "Feat1=Val1,Val2|Feat2=Val2"
> morph = MorphAnalysis(nlp.vocab, feats)
> assert morph.to_dict() == {"Feat1": "Val1,Val2", "Feat2": "Val2"}
> ```
| Name | Description |
| ----------- | ----------------------------------------------------------- |
| **RETURNS** | The dict representation of the analysis. ~~Dict[str, str]~~ |
### MorphAnalysis.from_id {#morphanalysis-from_id tag="classmethod"}
Create a morphological analysis from a given hash ID.
> #### Example
>
> ```python
> feats = "Feat1=Val1|Feat2=Val2"
> hash = nlp.vocab.strings[feats]
> morph = MorphAnalysis.from_id(nlp.vocab, hash)
> assert str(morph) == feats
> ```
| Name | Description |
| ------- | ---------------------------------------- |
| `vocab` | The vocab. ~~Vocab~~ |
| `key` | The hash of the features string. ~~int~~ |