mirror of
				https://github.com/explosion/spaCy.git
				synced 2025-10-31 07:57:35 +03:00 
			
		
		
		
	* Move test * Allow default in Lookups.get_table * Start with blank tables in Lookups.from_bytes * Refactor lemmatizer to hold instance of Lookups * Get lookups table within the lemmatization methods to make sure it references the correct table (even if the table was replaced or modified, e.g. when loading a model from disk) * Deprecate other arguments on Lemmatizer.__init__ and expect Lookups for consistency * Remove old and unsupported Lemmatizer.load classmethod * Refactor language-specific lemmatizers to inherit as much as possible from base class and override only what they need * Update tests and docs * Fix more tests * Fix lemmatizer * Upgrade pytest to try and fix weird CI errors * Try pytest 4.6.5
		
			
				
	
	
		
			116 lines
		
	
	
		
			5.5 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
			
		
		
	
	
			116 lines
		
	
	
		
			5.5 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
| ---
 | |
| title: Lemmatizer
 | |
| teaser: Assign the base forms of words
 | |
| tag: class
 | |
| source: spacy/lemmatizer.py
 | |
| ---
 | |
| 
 | |
| The `Lemmatizer` supports simple part-of-speech-sensitive suffix rules and
 | |
| lookup tables.
 | |
| 
 | |
| ## Lemmatizer.\_\_init\_\_ {#init tag="method"}
 | |
| 
 | |
| Initialize a `Lemmatizer`. Typically, this happens under the hood within spaCy
 | |
| when a `Language` subclass and its `Vocab` is initialized.
 | |
| 
 | |
| > #### Example
 | |
| >
 | |
| > ```python
 | |
| > from spacy.lemmatizer import Lemmatizer
 | |
| > from spacy.lookups import Lookups
 | |
| > lookups = Lookups()
 | |
| > lookups.add_table("lemma_rules", {"noun": [["s", ""]]})
 | |
| > lemmatizer = Lemmatizer(lookups)
 | |
| > ```
 | |
| >
 | |
| > For examples of the data format, see the
 | |
| > [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) repo.
 | |
| 
 | |
| | Name                                   | Type                      | Description                                                                                                               |
 | |
| | -------------------------------------- | ------------------------- | ------------------------------------------------------------------------------------------------------------------------- |
 | |
| | `lookups` <Tag variant="new">2.2</Tag> | [`Lookups`](/api/lookups) | The lookups object containing the (optional) tables `"lemma_rules"`, `"lemma_index"`, `"lemma_exc"` and `"lemma_lookup"`. |
 | |
| | **RETURNS**                            | `Lemmatizer`              | The newly created object.                                                                                                 |
 | |
| 
 | |
| <Infobox title="Deprecation note" variant="danger">
 | |
| 
 | |
| As of v2.2, the lemmatizer is initialized with a [`Lookups`](/api/lookups)
 | |
| object containing tables for the different components. This makes it easier for
 | |
| spaCy to share and serialize rules and lookup tables via the `Vocab`, and allows
 | |
| users to modify lemmatizer data at runtime by updating `nlp.vocab.lookups`.
 | |
| 
 | |
| ```diff
 | |
| - lemmatizer = Lemmatizer(rules=lemma_rules)
 | |
| + lemmatizer = Lemmatizer(lookups)
 | |
| ```
 | |
| 
 | |
| </Infobox>
 | |
| 
 | |
| ## Lemmatizer.\_\_call\_\_ {#call tag="method"}
 | |
| 
 | |
| Lemmatize a string.
 | |
| 
 | |
| > #### Example
 | |
| >
 | |
| > ```python
 | |
| > from spacy.lemmatizer import Lemmatizer
 | |
| > from spacy.lookups import Lookups
 | |
| > lookups = Loookups()
 | |
| > lookups.add_table("lemma_rules", {"noun": [["s", ""]]})
 | |
| > lemmatizer = Lemmatizer(lookups)
 | |
| > lemmas = lemmatizer("ducks", "NOUN")
 | |
| > assert lemmas == ["duck"]
 | |
| > ```
 | |
| 
 | |
| | Name         | Type          | Description                                                                                              |
 | |
| | ------------ | ------------- | -------------------------------------------------------------------------------------------------------- |
 | |
| | `string`     | unicode       | The string to lemmatize, e.g. the token text.                                                            |
 | |
| | `univ_pos`   | unicode / int | The token's universal part-of-speech tag.                                                                |
 | |
| | `morphology` | dict / `None` | Morphological features following the [Universal Dependencies](http://universaldependencies.org/) scheme. |
 | |
| | **RETURNS**  | list          | The available lemmas for the string.                                                                     |
 | |
| 
 | |
| ## Lemmatizer.lookup {#lookup tag="method" new="2"}
 | |
| 
 | |
| Look up a lemma in the lookup table, if available. If no lemma is found, the
 | |
| original string is returned. Languages can provide a
 | |
| [lookup table](/usage/adding-languages#lemmatizer) via the `Lookups`.
 | |
| 
 | |
| > #### Example
 | |
| >
 | |
| > ```python
 | |
| > lookups = Lookups()
 | |
| > lookups.add_table("lemma_lookup", {"going": "go"})
 | |
| > assert lemmatizer.lookup("going") == "go"
 | |
| > ```
 | |
| 
 | |
| | Name        | Type    | Description                                                                                                 |
 | |
| | ----------- | ------- | ----------------------------------------------------------------------------------------------------------- |
 | |
| | `string`    | unicode | The string to look up.                                                                                      |
 | |
| | `orth`      | int     | Optional hash of the string to look up. If not set, the string will be used and hashed. Defaults to `None`. |
 | |
| | **RETURNS** | unicode | The lemma if the string was found, otherwise the original string.                                           |
 | |
| 
 | |
| ## Lemmatizer.is_base_form {#is_base_form tag="method"}
 | |
| 
 | |
| Check whether we're dealing with an uninflected paradigm, so we can avoid
 | |
| lemmatization entirely.
 | |
| 
 | |
| > #### Example
 | |
| >
 | |
| > ```python
 | |
| > pos = "verb"
 | |
| > morph = {"VerbForm": "inf"}
 | |
| > is_base_form = lemmatizer.is_base_form(pos, morph)
 | |
| > assert is_base_form == True
 | |
| > ```
 | |
| 
 | |
| | Name         | Type          | Description                                                                             |
 | |
| | ------------ | ------------- | --------------------------------------------------------------------------------------- |
 | |
| | `univ_pos`   | unicode / int | The token's universal part-of-speech tag.                                               |
 | |
| | `morphology` | dict          | The token's morphological features.                                                     |
 | |
| | **RETURNS**  | bool          | Whether the token's part-of-speech tag and morphological features describe a base form. |
 | |
| 
 | |
| ## Attributes {#attributes}
 | |
| 
 | |
| | Name                                   | Type                      | Description                                                     |
 | |
| | -------------------------------------- | ------------------------- | --------------------------------------------------------------- |
 | |
| | `lookups` <Tag variant="new">2.2</Tag> | [`Lookups`](/api/lookups) | The lookups object containing the rules and data, if available. |
 |