mirror of
https://github.com/explosion/spaCy.git
synced 2024-11-11 12:18:04 +03:00
bab9976d9a
* Adjust Table API and add docs * Add attributes and update description [ci skip] * Use strings.get_string_id instead of hash_string * Fix table method calls * Make orth arg in Lemmatizer.lookup optional Fall back to string, which is now handled by Table.__contains__ out-of-the-box * Fix method name * Auto-format
100 lines
4.9 KiB
Markdown
100 lines
4.9 KiB
Markdown
---
|
|
title: Lemmatizer
|
|
teaser: Assign the base forms of words
|
|
tag: class
|
|
source: spacy/lemmatizer.py
|
|
---
|
|
|
|
The `Lemmatizer` supports simple part-of-speech-sensitive suffix rules and
|
|
lookup tables.
|
|
|
|
## Lemmatizer.\_\_init\_\_ {#init tag="method"}
|
|
|
|
Create a `Lemmatizer`.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> from spacy.lemmatizer import Lemmatizer
|
|
> lemmatizer = Lemmatizer()
|
|
> ```
|
|
|
|
| Name | Type | Description |
|
|
| ------------ | ------------- | ---------------------------------------------------------- |
|
|
| `index` | dict / `None` | Inventory of lemmas in the language. |
|
|
| `exceptions` | dict / `None` | Mapping of string forms to lemmas that bypass the `rules`. |
|
|
| `rules` | dict / `None` | List of suffix rewrite rules. |
|
|
| `lookup` | dict / `None` | Lookup table mapping string to their lemmas. |
|
|
| **RETURNS** | `Lemmatizer` | The newly created object. |
|
|
|
|
## Lemmatizer.\_\_call\_\_ {#call tag="method"}
|
|
|
|
Lemmatize a string.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> from spacy.lemmatizer import Lemmatizer
|
|
> rules = {"noun": [["s", ""]]}
|
|
> lemmatizer = Lemmatizer(index={}, exceptions={}, rules=rules)
|
|
> lemmas = lemmatizer("ducks", "NOUN")
|
|
> assert lemmas == ["duck"]
|
|
> ```
|
|
|
|
| Name | Type | Description |
|
|
| ------------ | ------------- | -------------------------------------------------------------------------------------------------------- |
|
|
| `string` | unicode | The string to lemmatize, e.g. the token text. |
|
|
| `univ_pos` | unicode / int | The token's universal part-of-speech tag. |
|
|
| `morphology` | dict / `None` | Morphological features following the [Universal Dependencies](http://universaldependencies.org/) scheme. |
|
|
| **RETURNS** | list | The available lemmas for the string. |
|
|
|
|
## Lemmatizer.lookup {#lookup tag="method" new="2"}
|
|
|
|
Look up a lemma in the lookup table, if available. If no lemma is found, the
|
|
original string is returned. Languages can provide a
|
|
[lookup table](/usage/adding-languages#lemmatizer) via the `resources`, set on
|
|
the individual `Language` class.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> lookup = {"going": "go"}
|
|
> lemmatizer = Lemmatizer(lookup=lookup)
|
|
> assert lemmatizer.lookup("going") == "go"
|
|
> ```
|
|
|
|
| Name | Type | Description |
|
|
| ----------- | ------- | ----------------------------------------------------------------------------------------------------------- |
|
|
| `string` | unicode | The string to look up. |
|
|
| `orth` | int | Optional hash of the string to look up. If not set, the string will be used and hashed. Defaults to `None`. |
|
|
| **RETURNS** | unicode | The lemma if the string was found, otherwise the original string. |
|
|
|
|
## Lemmatizer.is_base_form {#is_base_form tag="method"}
|
|
|
|
Check whether we're dealing with an uninflected paradigm, so we can avoid
|
|
lemmatization entirely.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> pos = "verb"
|
|
> morph = {"VerbForm": "inf"}
|
|
> is_base_form = lemmatizer.is_base_form(pos, morph)
|
|
> assert is_base_form == True
|
|
> ```
|
|
|
|
| Name | Type | Description |
|
|
| ------------ | ------------- | --------------------------------------------------------------------------------------- |
|
|
| `univ_pos` | unicode / int | The token's universal part-of-speech tag. |
|
|
| `morphology` | dict | The token's morphological features. |
|
|
| **RETURNS** | bool | Whether the token's part-of-speech tag and morphological features describe a base form. |
|
|
|
|
## Attributes {#attributes}
|
|
|
|
| Name | Type | Description |
|
|
| ----------------------------------------- | ------------- | ---------------------------------------------------------- |
|
|
| `index` | dict / `None` | Inventory of lemmas in the language. |
|
|
| `exc` | dict / `None` | Mapping of string forms to lemmas that bypass the `rules`. |
|
|
| `rules` | dict / `None` | List of suffix rewrite rules. |
|
|
| `lookup_table` <Tag variant="new">2</Tag> | dict / `None` | The lemma lookup table, if available. |
|