spaCy/lexeme.md at 19dc42776af33ee209c256c739303f27aa458144

mirror of https://github.com/explosion/spaCy.git synced 2024-11-16 06:37:04 +03:00

Documentation updates for v2.3.0 (#5593 )

* Update website models for v2.3.0

* Add docs for Chinese word segmentation

* Tighten up Chinese docs section

* Merge branch 'master' into docs/v2.3.0 [ci skip]

* Merge branch 'master' into docs/v2.3.0 [ci skip]

* Auto-format and update version

* Update matcher.md

* Update languages and sorting

* Typo in landing page

* Infobox about token_match behavior

* Add meta and basic docs for Japanese

* POS -> TAG in models table

* Add info about lookups for normalization

* Updates to API docs for v2.3

* Update adding norm exceptions for adding languages

* Add --omit-extra-lookups to CLI API docs

* Add initial draft of "What's New in v2.3"

* Add new in v2.3 tags to Chinese and Japanese sections

* Add tokenizer to migration section

* Add new in v2.3 flags to init-model

* Typo

* More what's new in v2.3

Co-authored-by: Ines Montani <ines@ines.io>

2020-06-16 15:37:35 +02:00

17 KiB

Raw Blame History

title	teaser	tag	source
Lexeme	An entry in the vocabulary	class	spacy/lexeme.pyx

A Lexeme has no string context – it's a word type, as opposed to a word token. It therefore has no part-of-speech tag, dependency parse, or lemma (if lemmatization depends on the part-of-speech tag).

Lexeme.init

Create a Lexeme object.

Name	Type	Description
`vocab`	`Vocab`	The parent vocabulary.
`orth`	int	The orth id of the lexeme.
RETURNS	`Lexeme`	The newly constructed object.

Lexeme.set_flag

Change the value of a boolean flag.

Example

COOL_FLAG = nlp.vocab.add_flag(lambda text: False)
nlp.vocab["spaCy"].set_flag(COOL_FLAG, True)

Name	Type	Description
`flag_id`	int	The attribute ID of the flag to set.
`value`	bool	The new value of the flag.

Lexeme.check_flag

Check the value of a boolean flag.

Example

is_my_library = lambda text: text in ["spaCy", "Thinc"]
MY_LIBRARY = nlp.vocab.add_flag(is_my_library)
assert nlp.vocab["spaCy"].check_flag(MY_LIBRARY) == True

Name	Type	Description
`flag_id`	int	The attribute ID of the flag to query.
RETURNS	bool	The value of the flag.

Lexeme.similarity

Compute a semantic similarity estimate. Defaults to cosine over vectors.

Example

apple = nlp.vocab["apple"]
orange = nlp.vocab["orange"]
apple_orange = apple.similarity(orange)
orange_apple = orange.similarity(apple)
assert apple_orange == orange_apple

Name	Type	Description
other	-	The object to compare with. By default, accepts `Doc`, `Span`, `Token` and `Lexeme` objects.
RETURNS	float	A scalar similarity score. Higher is more similar.

Lexeme.has_vector

A boolean value indicating whether a word vector is associated with the lexeme.

Example

apple = nlp.vocab["apple"]
assert apple.has_vector

Name	Type	Description
RETURNS	bool	Whether the lexeme has a vector data attached.

Lexeme.vector

A real-valued meaning representation.

Example

apple = nlp.vocab["apple"]
assert apple.vector.dtype == "float32"
assert apple.vector.shape == (300,)

Name	Type	Description
RETURNS	`numpy.ndarray[ndim=1, dtype='float32']`	A 1D numpy array representing the lexeme's semantics.

Lexeme.vector_norm

The L2 norm of the lexeme's vector representation.

Example

apple = nlp.vocab["apple"]
pasta = nlp.vocab["pasta"]
apple.vector_norm  # 7.1346845626831055
pasta.vector_norm  # 7.759851932525635
assert apple.vector_norm != pasta.vector_norm

Name	Type	Description
RETURNS	float	The L2 norm of the vector representation.

Attributes

Name	Type	Description
`vocab`	`Vocab`	The lexeme's vocabulary.
`text`	unicode	Verbatim text content.
`orth`	int	ID of the verbatim text content.
`orth_`	unicode	Verbatim text content (identical to `Lexeme.text`). Exists mostly for consistency with the other attributes.
`rank`	int	Sequential ID of the lexemes's lexical type, used to index into tables, e.g. for word vectors.
`flags`	int	Container of the lexeme's binary flags.
`norm`	int	The lexemes's norm, i.e. a normalized form of the lexeme text.
`norm_`	unicode	The lexemes's norm, i.e. a normalized form of the lexeme text.
`lower`	int	Lowercase form of the word.
`lower_`	unicode	Lowercase form of the word.
`shape`	int	Transform of the words's string, to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`.
`shape_`	unicode	Transform of the word's string, to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`.
`prefix`	int	Length-N substring from the start of the word. Defaults to `N=1`.
`prefix_`	unicode	Length-N substring from the start of the word. Defaults to `N=1`.
`suffix`	int	Length-N substring from the end of the word. Defaults to `N=3`.
`suffix_`	unicode	Length-N substring from the start of the word. Defaults to `N=3`.
`is_alpha`	bool	Does the lexeme consist of alphabetic characters? Equivalent to `lexeme.text.isalpha()`.
`is_ascii`	bool	Does the lexeme consist of ASCII characters? Equivalent to `[any(ord(c) >= 128 for c in lexeme.text)]`.
`is_digit`	bool	Does the lexeme consist of digits? Equivalent to `lexeme.text.isdigit()`.
`is_lower`	bool	Is the lexeme in lowercase? Equivalent to `lexeme.text.islower()`.
`is_upper`	bool	Is the lexeme in uppercase? Equivalent to `lexeme.text.isupper()`.
`is_title`	bool	Is the lexeme in titlecase? Equivalent to `lexeme.text.istitle()`.
`is_punct`	bool	Is the lexeme punctuation?
`is_left_punct`	bool	Is the lexeme a left punctuation mark, e.g. `(`?
`is_right_punct`	bool	Is the lexeme a right punctuation mark, e.g. `)`?
`is_space`	bool	Does the lexeme consist of whitespace characters? Equivalent to `lexeme.text.isspace()`.
`is_bracket`	bool	Is the lexeme a bracket?
`is_quote`	bool	Is the lexeme a quotation mark?
`is_currency` 2.0.8	bool	Is the lexeme a currency symbol?
`like_url`	bool	Does the lexeme resemble a URL?
`like_num`	bool	Does the lexeme represent a number? e.g. "10.9", "10", "ten", etc.
`like_email`	bool	Does the lexeme resemble an email address?
`is_oov`	bool	Does the lexeme have a word vector?
`is_stop`	bool	Is the lexeme part of a "stop list"?
`lang`	int	Language of the parent vocabulary.
`lang_`	unicode	Language of the parent vocabulary.
`prob`	float	Smoothed log probability estimate of the lexeme's word type (context-independent entry in the vocabulary).
`cluster`	int	Brown cluster ID.
`sentiment`	float	A scalar value indicating the positivity or negativity of the lexeme.

17 KiB Raw Blame History Unescape Escape

Lexeme.__init__

Lexeme.set_flag

Example

Lexeme.check_flag

Example

Lexeme.similarity

Example

Lexeme.has_vector

Example

Lexeme.vector

Example

Lexeme.vector_norm

Example

Attributes

17 KiB

Raw Blame History

Lexeme.init