mirror of
				https://github.com/explosion/spaCy.git
				synced 2025-10-31 07:57:35 +03:00 
			
		
		
		
	
		
			
				
	
	
	
		
			4.9 KiB
		
	
	
	
	
	
	
	
			
		
		
	
	
			4.9 KiB
		
	
	
	
	
	
	
	
| title | teaser | tag | source | 
|---|---|---|---|
| Lemmatizer | Assign the base forms of words | class | spacy/lemmatizer.py | 
The Lemmatizer supports simple part-of-speech-sensitive suffix rules and
lookup tables.
Lemmatizer.__init__
Initialize a Lemmatizer. Typically, this happens under the hood within spaCy
when a Language subclass and its Vocab is initialized.
Example
from spacy.lemmatizer import Lemmatizer from spacy.lookups import Lookups lookups = Lookups() lookups.add_table("lemma_rules", {"noun": [["s", ""]]}) lemmatizer = Lemmatizer(lookups)For examples of the data format, see the
spacy-lookups-datarepo.
| Name | Type | Description | 
|---|---|---|
| lookups2.2 | Lookups | The lookups object containing the (optional) tables "lemma_rules","lemma_index","lemma_exc"and"lemma_lookup". | 
Lemmatizer.__call__
Lemmatize a string.
Example
from spacy.lemmatizer import Lemmatizer from spacy.lookups import Lookups lookups = Lookups() lookups.add_table("lemma_rules", {"noun": [["s", ""]]}) lemmatizer = Lemmatizer(lookups) lemmas = lemmatizer("ducks", "NOUN") assert lemmas == ["duck"]
| Name | Type | Description | 
|---|---|---|
| string | str | The string to lemmatize, e.g. the token text. | 
| univ_pos | str / int | The token's universal part-of-speech tag. | 
| morphology | dict / None | Morphological features following the Universal Dependencies scheme. | 
| RETURNS | list | The available lemmas for the string. | 
Lemmatizer.lookup
Look up a lemma in the lookup table, if available. If no lemma is found, the
original string is returned. Languages can provide a
lookup table via the Lookups.
Example
lookups = Lookups() lookups.add_table("lemma_lookup", {"going": "go"}) assert lemmatizer.lookup("going") == "go"
| Name | Type | Description | 
|---|---|---|
| string | str | The string to look up. | 
| orth | int | Optional hash of the string to look up. If not set, the string will be used and hashed. Defaults to None. | 
| RETURNS | str | The lemma if the string was found, otherwise the original string. | 
Lemmatizer.is_base_form
Check whether we're dealing with an uninflected paradigm, so we can avoid lemmatization entirely.
Example
pos = "verb" morph = {"VerbForm": "inf"} is_base_form = lemmatizer.is_base_form(pos, morph) assert is_base_form == True
| Name | Type | Description | 
|---|---|---|
| univ_pos | str / int | The token's universal part-of-speech tag. | 
| morphology | dict | The token's morphological features. | 
| RETURNS | bool | Whether the token's part-of-speech tag and morphological features describe a base form. | 
Attributes
| Name | Type | Description | 
|---|---|---|
| lookups2.2 | Lookups | The lookups object containing the rules and data, if available. |