5.5 KiB
| title | teaser | tag | source |
|---|---|---|---|
| Lemmatizer | Assign the base forms of words | class | spacy/lemmatizer.py |
The Lemmatizer supports simple part-of-speech-sensitive suffix rules and
lookup tables.
Lemmatizer.__init__
Initialize a Lemmatizer. Typically, this happens under the hood within spaCy
when a Language subclass and its Vocab is initialized.
Example
from spacy.lemmatizer import Lemmatizer from spacy.lookups import Lookups lookups = Lookups() lookups.add_table("lemma_rules", {"noun": [["s", ""]]}) lemmatizer = Lemmatizer(lookups)For examples of the data format, see the
spacy-lookups-datarepo.
| Name | Type | Description |
|---|---|---|
lookups 2.2 |
Lookups |
The lookups object containing the (optional) tables "lemma_rules", "lemma_index", "lemma_exc" and "lemma_lookup". |
| RETURNS | Lemmatizer |
The newly created object. |
As of v2.2, the lemmatizer is initialized with a Lookups
object containing tables for the different components. This makes it easier for
spaCy to share and serialize rules and lookup tables via the Vocab, and allows
users to modify lemmatizer data at runtime by updating nlp.vocab.lookups.
- lemmatizer = Lemmatizer(rules=lemma_rules)
+ lemmatizer = Lemmatizer(lookups)
Lemmatizer.__call__
Lemmatize a string.
Example
from spacy.lemmatizer import Lemmatizer from spacy.lookups import Lookups lookups = Lookups() lookups.add_table("lemma_rules", {"noun": [["s", ""]]}) lemmatizer = Lemmatizer(lookups) lemmas = lemmatizer("ducks", "NOUN") assert lemmas == ["duck"]
| Name | Type | Description |
|---|---|---|
string |
str | The string to lemmatize, e.g. the token text. |
univ_pos |
str / int | The token's universal part-of-speech tag. |
morphology |
dict / None |
Morphological features following the Universal Dependencies scheme. |
| RETURNS | list | The available lemmas for the string. |
Lemmatizer.lookup
Look up a lemma in the lookup table, if available. If no lemma is found, the
original string is returned. Languages can provide a
lookup table via the Lookups.
Example
lookups = Lookups() lookups.add_table("lemma_lookup", {"going": "go"}) assert lemmatizer.lookup("going") == "go"
| Name | Type | Description |
|---|---|---|
string |
str | The string to look up. |
orth |
int | Optional hash of the string to look up. If not set, the string will be used and hashed. Defaults to None. |
| RETURNS | str | The lemma if the string was found, otherwise the original string. |
Lemmatizer.is_base_form
Check whether we're dealing with an uninflected paradigm, so we can avoid lemmatization entirely.
Example
pos = "verb" morph = {"VerbForm": "inf"} is_base_form = lemmatizer.is_base_form(pos, morph) assert is_base_form == True
| Name | Type | Description |
|---|---|---|
univ_pos |
str / int | The token's universal part-of-speech tag. |
morphology |
dict | The token's morphological features. |
| RETURNS | bool | Whether the token's part-of-speech tag and morphological features describe a base form. |
Attributes
| Name | Type | Description |
|---|---|---|
lookups 2.2 |
Lookups |
The lookups object containing the rules and data, if available. |