mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-10 01:06:33 +03:00
bab9976d9a
* Adjust Table API and add docs * Add attributes and update description [ci skip] * Use strings.get_string_id instead of hash_string * Fix table method calls * Make orth arg in Lemmatizer.lookup optional Fall back to string, which is now handled by Table.__contains__ out-of-the-box * Fix method name * Auto-format
4.9 KiB
4.9 KiB
title | teaser | tag | source |
---|---|---|---|
Lemmatizer | Assign the base forms of words | class | spacy/lemmatizer.py |
The Lemmatizer
supports simple part-of-speech-sensitive suffix rules and
lookup tables.
Lemmatizer.__init__
Create a Lemmatizer
.
Example
from spacy.lemmatizer import Lemmatizer lemmatizer = Lemmatizer()
Name | Type | Description |
---|---|---|
index |
dict / None |
Inventory of lemmas in the language. |
exceptions |
dict / None |
Mapping of string forms to lemmas that bypass the rules . |
rules |
dict / None |
List of suffix rewrite rules. |
lookup |
dict / None |
Lookup table mapping string to their lemmas. |
RETURNS | Lemmatizer |
The newly created object. |
Lemmatizer.__call__
Lemmatize a string.
Example
from spacy.lemmatizer import Lemmatizer rules = {"noun": [["s", ""]]} lemmatizer = Lemmatizer(index={}, exceptions={}, rules=rules) lemmas = lemmatizer("ducks", "NOUN") assert lemmas == ["duck"]
Name | Type | Description |
---|---|---|
string |
unicode | The string to lemmatize, e.g. the token text. |
univ_pos |
unicode / int | The token's universal part-of-speech tag. |
morphology |
dict / None |
Morphological features following the Universal Dependencies scheme. |
RETURNS | list | The available lemmas for the string. |
Lemmatizer.lookup
Look up a lemma in the lookup table, if available. If no lemma is found, the
original string is returned. Languages can provide a
lookup table via the resources
, set on
the individual Language
class.
Example
lookup = {"going": "go"} lemmatizer = Lemmatizer(lookup=lookup) assert lemmatizer.lookup("going") == "go"
Name | Type | Description |
---|---|---|
string |
unicode | The string to look up. |
orth |
int | Optional hash of the string to look up. If not set, the string will be used and hashed. Defaults to None . |
RETURNS | unicode | The lemma if the string was found, otherwise the original string. |
Lemmatizer.is_base_form
Check whether we're dealing with an uninflected paradigm, so we can avoid lemmatization entirely.
Example
pos = "verb" morph = {"VerbForm": "inf"} is_base_form = lemmatizer.is_base_form(pos, morph) assert is_base_form == True
Name | Type | Description |
---|---|---|
univ_pos |
unicode / int | The token's universal part-of-speech tag. |
morphology |
dict | The token's morphological features. |
RETURNS | bool | Whether the token's part-of-speech tag and morphological features describe a base form. |
Attributes
Name | Type | Description |
---|---|---|
index |
dict / None |
Inventory of lemmas in the language. |
exc |
dict / None |
Mapping of string forms to lemmas that bypass the rules . |
rules |
dict / None |
List of suffix rewrite rules. |
lookup_table 2 |
dict / None |
The lemma lookup table, if available. |