mirror of
				https://github.com/explosion/spaCy.git
				synced 2025-11-04 01:48:04 +03:00 
			
		
		
		
	* Adjust Table API and add docs * Add attributes and update description [ci skip] * Use strings.get_string_id instead of hash_string * Fix table method calls * Make orth arg in Lemmatizer.lookup optional Fall back to string, which is now handled by Table.__contains__ out-of-the-box * Fix method name * Auto-format
		
			
				
	
	
	
		
			4.9 KiB
		
	
	
	
	
	
	
	
			
		
		
	
	
			4.9 KiB
		
	
	
	
	
	
	
	
| title | teaser | tag | source | 
|---|---|---|---|
| Lemmatizer | Assign the base forms of words | class | spacy/lemmatizer.py | 
The Lemmatizer supports simple part-of-speech-sensitive suffix rules and
lookup tables.
Lemmatizer.__init__
Create a Lemmatizer.
Example
from spacy.lemmatizer import Lemmatizer lemmatizer = Lemmatizer()
| Name | Type | Description | 
|---|---|---|
index | 
dict / None | 
Inventory of lemmas in the language. | 
exceptions | 
dict / None | 
Mapping of string forms to lemmas that bypass the rules. | 
rules | 
dict / None | 
List of suffix rewrite rules. | 
lookup | 
dict / None | 
Lookup table mapping string to their lemmas. | 
| RETURNS | Lemmatizer | 
The newly created object. | 
Lemmatizer.__call__
Lemmatize a string.
Example
from spacy.lemmatizer import Lemmatizer rules = {"noun": [["s", ""]]} lemmatizer = Lemmatizer(index={}, exceptions={}, rules=rules) lemmas = lemmatizer("ducks", "NOUN") assert lemmas == ["duck"]
| Name | Type | Description | 
|---|---|---|
string | 
unicode | The string to lemmatize, e.g. the token text. | 
univ_pos | 
unicode / int | The token's universal part-of-speech tag. | 
morphology | 
dict / None | 
Morphological features following the Universal Dependencies scheme. | 
| RETURNS | list | The available lemmas for the string. | 
Lemmatizer.lookup
Look up a lemma in the lookup table, if available. If no lemma is found, the
original string is returned. Languages can provide a
lookup table via the resources, set on
the individual Language class.
Example
lookup = {"going": "go"} lemmatizer = Lemmatizer(lookup=lookup) assert lemmatizer.lookup("going") == "go"
| Name | Type | Description | 
|---|---|---|
string | 
unicode | The string to look up. | 
orth | 
int | Optional hash of the string to look up. If not set, the string will be used and hashed. Defaults to None. | 
| RETURNS | unicode | The lemma if the string was found, otherwise the original string. | 
Lemmatizer.is_base_form
Check whether we're dealing with an uninflected paradigm, so we can avoid lemmatization entirely.
Example
pos = "verb" morph = {"VerbForm": "inf"} is_base_form = lemmatizer.is_base_form(pos, morph) assert is_base_form == True
| Name | Type | Description | 
|---|---|---|
univ_pos | 
unicode / int | The token's universal part-of-speech tag. | 
morphology | 
dict | The token's morphological features. | 
| RETURNS | bool | Whether the token's part-of-speech tag and morphological features describe a base form. | 
Attributes
| Name | Type | Description | 
|---|---|---|
index | 
dict / None | 
Inventory of lemmas in the language. | 
exc | 
dict / None | 
Mapping of string forms to lemmas that bypass the rules. | 
rules | 
dict / None | 
List of suffix rewrite rules. | 
lookup_table 2 | 
dict / None | 
The lemma lookup table, if available. |