mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-10 09:16:31 +03:00
cf65a80f36
* Move test * Allow default in Lookups.get_table * Start with blank tables in Lookups.from_bytes * Refactor lemmatizer to hold instance of Lookups * Get lookups table within the lemmatization methods to make sure it references the correct table (even if the table was replaced or modified, e.g. when loading a model from disk) * Deprecate other arguments on Lemmatizer.__init__ and expect Lookups for consistency * Remove old and unsupported Lemmatizer.load classmethod * Refactor language-specific lemmatizers to inherit as much as possible from base class and override only what they need * Update tests and docs * Fix more tests * Fix lemmatizer * Upgrade pytest to try and fix weird CI errors * Try pytest 4.6.5
116 lines
5.5 KiB
Markdown
116 lines
5.5 KiB
Markdown
---
|
|
title: Lemmatizer
|
|
teaser: Assign the base forms of words
|
|
tag: class
|
|
source: spacy/lemmatizer.py
|
|
---
|
|
|
|
The `Lemmatizer` supports simple part-of-speech-sensitive suffix rules and
|
|
lookup tables.
|
|
|
|
## Lemmatizer.\_\_init\_\_ {#init tag="method"}
|
|
|
|
Initialize a `Lemmatizer`. Typically, this happens under the hood within spaCy
|
|
when a `Language` subclass and its `Vocab` is initialized.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> from spacy.lemmatizer import Lemmatizer
|
|
> from spacy.lookups import Lookups
|
|
> lookups = Lookups()
|
|
> lookups.add_table("lemma_rules", {"noun": [["s", ""]]})
|
|
> lemmatizer = Lemmatizer(lookups)
|
|
> ```
|
|
>
|
|
> For examples of the data format, see the
|
|
> [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) repo.
|
|
|
|
| Name | Type | Description |
|
|
| -------------------------------------- | ------------------------- | ------------------------------------------------------------------------------------------------------------------------- |
|
|
| `lookups` <Tag variant="new">2.2</Tag> | [`Lookups`](/api/lookups) | The lookups object containing the (optional) tables `"lemma_rules"`, `"lemma_index"`, `"lemma_exc"` and `"lemma_lookup"`. |
|
|
| **RETURNS** | `Lemmatizer` | The newly created object. |
|
|
|
|
<Infobox title="Deprecation note" variant="danger">
|
|
|
|
As of v2.2, the lemmatizer is initialized with a [`Lookups`](/api/lookups)
|
|
object containing tables for the different components. This makes it easier for
|
|
spaCy to share and serialize rules and lookup tables via the `Vocab`, and allows
|
|
users to modify lemmatizer data at runtime by updating `nlp.vocab.lookups`.
|
|
|
|
```diff
|
|
- lemmatizer = Lemmatizer(rules=lemma_rules)
|
|
+ lemmatizer = Lemmatizer(lookups)
|
|
```
|
|
|
|
</Infobox>
|
|
|
|
## Lemmatizer.\_\_call\_\_ {#call tag="method"}
|
|
|
|
Lemmatize a string.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> from spacy.lemmatizer import Lemmatizer
|
|
> from spacy.lookups import Lookups
|
|
> lookups = Loookups()
|
|
> lookups.add_table("lemma_rules", {"noun": [["s", ""]]})
|
|
> lemmatizer = Lemmatizer(lookups)
|
|
> lemmas = lemmatizer("ducks", "NOUN")
|
|
> assert lemmas == ["duck"]
|
|
> ```
|
|
|
|
| Name | Type | Description |
|
|
| ------------ | ------------- | -------------------------------------------------------------------------------------------------------- |
|
|
| `string` | unicode | The string to lemmatize, e.g. the token text. |
|
|
| `univ_pos` | unicode / int | The token's universal part-of-speech tag. |
|
|
| `morphology` | dict / `None` | Morphological features following the [Universal Dependencies](http://universaldependencies.org/) scheme. |
|
|
| **RETURNS** | list | The available lemmas for the string. |
|
|
|
|
## Lemmatizer.lookup {#lookup tag="method" new="2"}
|
|
|
|
Look up a lemma in the lookup table, if available. If no lemma is found, the
|
|
original string is returned. Languages can provide a
|
|
[lookup table](/usage/adding-languages#lemmatizer) via the `Lookups`.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> lookups = Lookups()
|
|
> lookups.add_table("lemma_lookup", {"going": "go"})
|
|
> assert lemmatizer.lookup("going") == "go"
|
|
> ```
|
|
|
|
| Name | Type | Description |
|
|
| ----------- | ------- | ----------------------------------------------------------------------------------------------------------- |
|
|
| `string` | unicode | The string to look up. |
|
|
| `orth` | int | Optional hash of the string to look up. If not set, the string will be used and hashed. Defaults to `None`. |
|
|
| **RETURNS** | unicode | The lemma if the string was found, otherwise the original string. |
|
|
|
|
## Lemmatizer.is_base_form {#is_base_form tag="method"}
|
|
|
|
Check whether we're dealing with an uninflected paradigm, so we can avoid
|
|
lemmatization entirely.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> pos = "verb"
|
|
> morph = {"VerbForm": "inf"}
|
|
> is_base_form = lemmatizer.is_base_form(pos, morph)
|
|
> assert is_base_form == True
|
|
> ```
|
|
|
|
| Name | Type | Description |
|
|
| ------------ | ------------- | --------------------------------------------------------------------------------------- |
|
|
| `univ_pos` | unicode / int | The token's universal part-of-speech tag. |
|
|
| `morphology` | dict | The token's morphological features. |
|
|
| **RETURNS** | bool | Whether the token's part-of-speech tag and morphological features describe a base form. |
|
|
|
|
## Attributes {#attributes}
|
|
|
|
| Name | Type | Description |
|
|
| -------------------------------------- | ------------------------- | --------------------------------------------------------------- |
|
|
| `lookups` <Tag variant="new">2.2</Tag> | [`Lookups`](/api/lookups) | The lookups object containing the rules and data, if available. |
|