* Move test * Allow default in Lookups.get_table * Start with blank tables in Lookups.from_bytes * Refactor lemmatizer to hold instance of Lookups * Get lookups table within the lemmatization methods to make sure it references the correct table (even if the table was replaced or modified, e.g. when loading a model from disk) * Deprecate other arguments on Lemmatizer.__init__ and expect Lookups for consistency * Remove old and unsupported Lemmatizer.load classmethod * Refactor language-specific lemmatizers to inherit as much as possible from base class and override only what they need * Update tests and docs * Fix more tests * Fix lemmatizer * Upgrade pytest to try and fix weird CI errors * Try pytest 4.6.5
5.5 KiB
| title | teaser | tag | source |
|---|---|---|---|
| Lemmatizer | Assign the base forms of words | class | spacy/lemmatizer.py |
The Lemmatizer supports simple part-of-speech-sensitive suffix rules and
lookup tables.
Lemmatizer.__init__
Initialize a Lemmatizer. Typically, this happens under the hood within spaCy
when a Language subclass and its Vocab is initialized.
Example
from spacy.lemmatizer import Lemmatizer from spacy.lookups import Lookups lookups = Lookups() lookups.add_table("lemma_rules", {"noun": [["s", ""]]}) lemmatizer = Lemmatizer(lookups)For examples of the data format, see the
spacy-lookups-datarepo.
| Name | Type | Description |
|---|---|---|
lookups 2.2 |
Lookups |
The lookups object containing the (optional) tables "lemma_rules", "lemma_index", "lemma_exc" and "lemma_lookup". |
| RETURNS | Lemmatizer |
The newly created object. |
As of v2.2, the lemmatizer is initialized with a Lookups
object containing tables for the different components. This makes it easier for
spaCy to share and serialize rules and lookup tables via the Vocab, and allows
users to modify lemmatizer data at runtime by updating nlp.vocab.lookups.
- lemmatizer = Lemmatizer(rules=lemma_rules)
+ lemmatizer = Lemmatizer(lookups)
Lemmatizer.__call__
Lemmatize a string.
Example
from spacy.lemmatizer import Lemmatizer from spacy.lookups import Lookups lookups = Loookups() lookups.add_table("lemma_rules", {"noun": [["s", ""]]}) lemmatizer = Lemmatizer(lookups) lemmas = lemmatizer("ducks", "NOUN") assert lemmas == ["duck"]
| Name | Type | Description |
|---|---|---|
string |
unicode | The string to lemmatize, e.g. the token text. |
univ_pos |
unicode / int | The token's universal part-of-speech tag. |
morphology |
dict / None |
Morphological features following the Universal Dependencies scheme. |
| RETURNS | list | The available lemmas for the string. |
Lemmatizer.lookup
Look up a lemma in the lookup table, if available. If no lemma is found, the
original string is returned. Languages can provide a
lookup table via the Lookups.
Example
lookups = Lookups() lookups.add_table("lemma_lookup", {"going": "go"}) assert lemmatizer.lookup("going") == "go"
| Name | Type | Description |
|---|---|---|
string |
unicode | The string to look up. |
orth |
int | Optional hash of the string to look up. If not set, the string will be used and hashed. Defaults to None. |
| RETURNS | unicode | The lemma if the string was found, otherwise the original string. |
Lemmatizer.is_base_form
Check whether we're dealing with an uninflected paradigm, so we can avoid lemmatization entirely.
Example
pos = "verb" morph = {"VerbForm": "inf"} is_base_form = lemmatizer.is_base_form(pos, morph) assert is_base_form == True
| Name | Type | Description |
|---|---|---|
univ_pos |
unicode / int | The token's universal part-of-speech tag. |
morphology |
dict | The token's morphological features. |
| RETURNS | bool | Whether the token's part-of-speech tag and morphological features describe a base form. |
Attributes
| Name | Type | Description |
|---|---|---|
lookups 2.2 |
Lookups |
The lookups object containing the rules and data, if available. |