mirror of
https://github.com/explosion/spaCy.git
synced 2025-10-24 04:31:17 +03:00
5.0 KiB
5.0 KiB
| title | teaser | tag | source |
|---|---|---|---|
| Lemmatizer | Assign the base forms of words | class | spacy/lemmatizer.py |
The Lemmatizer supports simple part-of-speech-sensitive suffix rules and
lookup tables.
Lemmatizer.__init__
Initialize a Lemmatizer. Typically, this happens under the hood within spaCy
when a Language subclass and its Vocab is initialized.
Example
from spacy.lemmatizer import Lemmatizer from spacy.lookups import Lookups lookups = Lookups() lookups.add_table("lemma_rules", {"noun": [["s", ""]]}) lemmatizer = Lemmatizer(lookups)For examples of the data format, see the
spacy-lookups-datarepo.
| Name | Type | Description |
|---|---|---|
lookups 2.2 |
Lookups |
The lookups object containing the (optional) tables "lemma_rules", "lemma_index", "lemma_exc" and "lemma_lookup". |
| RETURNS | Lemmatizer |
The newly created object. |
Lemmatizer.__call__
Lemmatize a string.
Example
from spacy.lemmatizer import Lemmatizer from spacy.lookups import Lookups lookups = Lookups() lookups.add_table("lemma_rules", {"noun": [["s", ""]]}) lemmatizer = Lemmatizer(lookups) lemmas = lemmatizer("ducks", "NOUN") assert lemmas == ["duck"]
| Name | Type | Description |
|---|---|---|
string |
str | The string to lemmatize, e.g. the token text. |
univ_pos |
str / int | The token's universal part-of-speech tag. |
morphology |
dict / None |
Morphological features following the Universal Dependencies scheme. |
| RETURNS | list | The available lemmas for the string. |
Lemmatizer.lookup
Look up a lemma in the lookup table, if available. If no lemma is found, the
original string is returned. Languages can provide a
lookup table via the Lookups.
Example
lookups = Lookups() lookups.add_table("lemma_lookup", {"going": "go"}) assert lemmatizer.lookup("going") == "go"
| Name | Type | Description |
|---|---|---|
string |
str | The string to look up. |
orth |
int | Optional hash of the string to look up. If not set, the string will be used and hashed. Defaults to None. |
| RETURNS | str | The lemma if the string was found, otherwise the original string. |
Lemmatizer.is_base_form
Check whether we're dealing with an uninflected paradigm, so we can avoid lemmatization entirely.
Example
pos = "verb" morph = {"VerbForm": "inf"} is_base_form = lemmatizer.is_base_form(pos, morph) assert is_base_form == True
| Name | Type | Description |
|---|---|---|
univ_pos |
str / int | The token's universal part-of-speech tag. |
morphology |
dict | The token's morphological features. |
| RETURNS | bool | Whether the token's part-of-speech tag and morphological features describe a base form. |
Attributes
| Name | Type | Description |
|---|---|---|
lookups 2.2 |
Lookups |
The lookups object containing the rules and data, if available. |