spaCy/morphology.md at 46bc513a4e85d75a41f31bc07d7207bdd403595c

mirror of https://github.com/explosion/spaCy.git synced 2024-12-28 10:56:31 +03:00

Add Lemmatizer and simplify related components (#5848 )

* Add Lemmatizer and simplify related components

* Add `Lemmatizer` pipe with `lookup` and `rule` modes using the
`Lookups` tables.
* Reduce `Tagger` to a simple tagger that sets `Token.tag` (no pos or lemma)
* Reduce `Morphology` to only keep track of morph tags (no tag map, lemmatizer,
or morph rules)
* Remove lemmatizer from `Vocab`
* Adjust many many tests

Differences:

* No default lookup lemmas
* No special treatment of TAG in `from_array` and similar required
* Easier to modify labels in a `Tagger`
* No extra strings added from morphology / tag map

* Fix test

* Initial fix for Lemmatizer config/serialization

* Adjust init test to be more generic

* Adjust init test to force empty Lookups

* Add simple cache to rule-based lemmatizer

* Convert language-specific lemmatizers

Convert language-specific lemmatizers to component lemmatizers. Remove
previous lemmatizer class.

* Fix French and Polish lemmatizers

* Remove outdated UPOS conversions

* Update Russian lemmatizer init in tests

* Add minimal init/run tests for custom lemmatizers

* Add option to overwrite existing lemmas

* Update mode setting, lookup loading, and caching

* Make `mode` an immutable property
* Only enforce strict `load_lookups` for known supported modes
* Move caching into individual `_lemmatize` methods

* Implement strict when lang is not found in lookups

* Fix tables/lookups in make_lemmatizer

* Reallow provided lookups and allow for stricter checks

* Add lookups asset to all Lemmatizer pipe tests

* Rename lookups in lemmatizer init test

* Clean up merge

* Refactor lookup table loading

* Add helper from `load_lemmatizer_lookups` that loads required and
optional lookups tables based on settings provided by a config.

Additional slight refactor of lookups:

* Add `Lookups.set_table` to set a table from a provided `Table`
* Reorder class definitions to be able to specify type as `Table`

* Move registry assets into test methods

* Refactor lookups tables config

Use class methods within `Lemmatizer` to provide the config for
particular modes and to load the lookups from a config.

* Add pipe and score to lemmatizer

* Simplify Tagger.score

* Add missing import

* Clean up imports and auto-format

* Remove unused kwarg

* Tidy up and auto-format

* Update docstrings for Lemmatizer

Update docstrings for Lemmatizer.

Additionally modify `is_base_form` API to take `Token` instead of
individual features.

* Update docstrings

* Remove tag map values from Tagger.add_label

* Update API docs

* Fix relative link in Lemmatizer API docs

2020-08-07 15:27:13 +02:00

3.5 KiB

Raw Blame History

title	tag	source
Morphology	class	spacy/morphology.pyx

Store the possible morphological analyses for a language, and index them by hash. To save space on each token, tokens only know the hash of their morphological analysis, so queries of morphological attributes are delegated to this class.

Morphology.init

Create a Morphology object.

Example

from spacy.morphology import Morphology

morphology = Morphology(strings)

Name	Type	Description
`strings`	`StringStore`	The string store.

Morphology.add

Insert a morphological analysis in the morphology table, if not already present. The morphological analysis may be provided in the UD FEATS format as a string or in the tag map dictionary format. Returns the hash of the new analysis.

Example

feats = "Feat1=Val1|Feat2=Val2"
hash = nlp.vocab.morphology.add(feats)
assert hash == nlp.vocab.strings[feats]

Name	Type	Description
`features`	`Union[Dict, str]`	The morphological features.

Morphology.get

Example

feats = "Feat1=Val1|Feat2=Val2"
hash = nlp.vocab.morphology.add(feats)
assert nlp.vocab.morphology.get(hash) == feats

Get the FEATS string for the hash of the morphological analysis.

Name	Type	Description
`morph`	int	The hash of the morphological analysis.

Morphology.feats_to_dict

Convert a string FEATS representation to a dictionary of features and values in the same format as the tag map.

Example

from spacy.morphology import Morphology
d = Morphology.feats_to_dict("Feat1=Val1|Feat2=Val2")
assert d == {"Feat1": "Val1", "Feat2": "Val2"}

Name	Type	Description
`feats`	str	The morphological features in Universal Dependencies FEATS format.
RETURNS	dict	The morphological features as a dictionary.

Morphology.dict_to_feats

Convert a dictionary of features and values to a string FEATS representation.

Example

from spacy.morphology import Morphology
f = Morphology.dict_to_feats({"Feat1": "Val1", "Feat2": "Val2"})
assert f == "Feat1=Val1|Feat2=Val2"

Name	Type	Description
`feats_dict`	`Dict[str, Dict]`	The morphological features as a dictionary.
RETURNS	str	The morphological features as in Universal Dependencies FEATS format.

Attributes

Name	Type	Description
`FEATURE_SEP`	`str`	The FEATS feature separator. Default is `
`FIELD_SEP`	`str`	The FEATS field separator. Default is `=`.
`VALUE_SEP`	`str`	The FEATS value separator. Default is `,`.

3.5 KiB Raw Blame History

Morphology.__init__

Example

Morphology.add

Example

Morphology.get

Example

Morphology.feats_to_dict

Example

Morphology.dict_to_feats

Example

Attributes

3.5 KiB

Raw Blame History

Morphology.init