spaCy/lemmatizer.md at aad66d9bb9ff0060085f61cd096885bbded77381

mirror of https://github.com/explosion/spaCy.git synced 2025-07-08 22:03:24 +03:00

💫 Adjust Table API and add docs (#4289 )

* Adjust Table API and add docs

* Add attributes and update description [ci skip]

* Use strings.get_string_id instead of hash_string

* Fix table method calls

* Make orth arg in Lemmatizer.lookup optional

Fall back to string, which is now handled by Table.__contains__ out-of-the-box

* Fix method name

* Auto-format

2019-09-15 22:08:13 +02:00

4.9 KiB

Raw Blame History

title	teaser	tag	source
Lemmatizer	Assign the base forms of words	class	spacy/lemmatizer.py

The Lemmatizer supports simple part-of-speech-sensitive suffix rules and lookup tables.

Lemmatizer.init

Create a Lemmatizer.

Example

from spacy.lemmatizer import Lemmatizer
lemmatizer = Lemmatizer()

Name	Type	Description
`index`	dict / `None`	Inventory of lemmas in the language.
`exceptions`	dict / `None`	Mapping of string forms to lemmas that bypass the `rules`.
`rules`	dict / `None`	List of suffix rewrite rules.
`lookup`	dict / `None`	Lookup table mapping string to their lemmas.
RETURNS	`Lemmatizer`	The newly created object.

Lemmatizer.call

Lemmatize a string.

Example

from spacy.lemmatizer import Lemmatizer
rules = {"noun": [["s", ""]]}
lemmatizer = Lemmatizer(index={}, exceptions={}, rules=rules)
lemmas = lemmatizer("ducks", "NOUN")
assert lemmas == ["duck"]

Name	Type	Description
`string`	unicode	The string to lemmatize, e.g. the token text.
`univ_pos`	unicode / int	The token's universal part-of-speech tag.
`morphology`	dict / `None`	Morphological features following the Universal Dependencies scheme.
RETURNS	list	The available lemmas for the string.

Lemmatizer.lookup

Look up a lemma in the lookup table, if available. If no lemma is found, the original string is returned. Languages can provide a lookup table via the resources, set on the individual Language class.

Example

lookup = {"going": "go"}
lemmatizer = Lemmatizer(lookup=lookup)
assert lemmatizer.lookup("going") == "go"

Name	Type	Description
`string`	unicode	The string to look up.
`orth`	int	Optional hash of the string to look up. If not set, the string will be used and hashed. Defaults to `None`.
RETURNS	unicode	The lemma if the string was found, otherwise the original string.

Lemmatizer.is_base_form

Check whether we're dealing with an uninflected paradigm, so we can avoid lemmatization entirely.

Example

pos = "verb"
morph = {"VerbForm": "inf"}
is_base_form = lemmatizer.is_base_form(pos, morph)
assert is_base_form == True

Name	Type	Description
`univ_pos`	unicode / int	The token's universal part-of-speech tag.
`morphology`	dict	The token's morphological features.
RETURNS	bool	Whether the token's part-of-speech tag and morphological features describe a base form.

Attributes

Name	Type	Description
`index`	dict / `None`	Inventory of lemmas in the language.
`exc`	dict / `None`	Mapping of string forms to lemmas that bypass the `rules`.
`rules`	dict / `None`	List of suffix rewrite rules.
`lookup_table` 2	dict / `None`	The lemma lookup table, if available.

4.9 KiB Raw Blame History

Lemmatizer.__init__

Example

Lemmatizer.__call__

Example

Lemmatizer.lookup

Example

Lemmatizer.is_base_form

Example

Attributes

4.9 KiB

Raw Blame History

Lemmatizer.init

Lemmatizer.call