spaCy/website/docs/api/lexeme.md
2022-12-05 08:56:15 +01:00

15 KiB
Raw Blame History

title teaser tag source
Lexeme An entry in the vocabulary class spacy/lexeme.pyx

A Lexeme has no string context it's a word type, as opposed to a word token. It therefore has no part-of-speech tag, dependency parse, or lemma (if lemmatization depends on the part-of-speech tag).

Lexeme.__init__

Create a Lexeme object.

Name Description
vocab The parent vocabulary. Vocab
orth The orth id of the lexeme. int

Lexeme.set_flag

Change the value of a boolean flag.

Example

COOL_FLAG = nlp.vocab.add_flag(lambda text: False)
nlp.vocab["spaCy"].set_flag(COOL_FLAG, True)
Name Description
flag_id The attribute ID of the flag to set. int
value The new value of the flag. bool

Lexeme.check_flag

Check the value of a boolean flag.

Example

is_my_library = lambda text: text in ["spaCy", "Thinc"]
MY_LIBRARY = nlp.vocab.add_flag(is_my_library)
assert nlp.vocab["spaCy"].check_flag(MY_LIBRARY) == True
Name Description
flag_id The attribute ID of the flag to query. int
RETURNS The value of the flag. bool

Lexeme.similarity

Compute a semantic similarity estimate. Defaults to cosine over vectors.

Example

apple = nlp.vocab["apple"]
orange = nlp.vocab["orange"]
apple_orange = apple.similarity(orange)
orange_apple = orange.similarity(apple)
assert apple_orange == orange_apple
Name Description
other The object to compare with. By default, accepts Doc, Span, Token and Lexeme objects. Union[Doc, Span, Token, Lexeme]
RETURNS A scalar similarity score. Higher is more similar. float

Lexeme.has_vector

A boolean value indicating whether a word vector is associated with the lexeme.

Example

apple = nlp.vocab["apple"]
assert apple.has_vector
Name Description
RETURNS Whether the lexeme has a vector data attached. bool

Lexeme.vector

A real-valued meaning representation.

Example

apple = nlp.vocab["apple"]
assert apple.vector.dtype == "float32"
assert apple.vector.shape == (300,)
Name Description
RETURNS A 1-dimensional array representing the lexeme's vector. numpy.ndarray[ndim=1, dtype=float32]

Lexeme.vector_norm

The L2 norm of the lexeme's vector representation.

Example

apple = nlp.vocab["apple"]
pasta = nlp.vocab["pasta"]
apple.vector_norm  # 7.1346845626831055
pasta.vector_norm  # 7.759851932525635
assert apple.vector_norm != pasta.vector_norm
Name Description
RETURNS The L2 norm of the vector representation. float

Attributes

Name Description
vocab The lexeme's vocabulary. Vocab
text Verbatim text content. str
orth ID of the verbatim text content. int
orth_ Verbatim text content (identical to Lexeme.text). Exists mostly for consistency with the other attributes. str
rank Sequential ID of the lexeme's lexical type, used to index into tables, e.g. for word vectors. int
flags Container of the lexeme's binary flags. int
norm The lexeme's norm, i.e. a normalized form of the lexeme text. int
norm_ The lexeme's norm, i.e. a normalized form of the lexeme text. str
lower Lowercase form of the word. int
lower_ Lowercase form of the word. str
shape Transform of the word's string, to show orthographic features. Alphabetic characters are replaced by x or X, and numeric characters are replaced by d, and sequences of the same character are truncated after length 4. For example,"Xxxx"or"dd". int
shape_ Transform of the word's string, to show orthographic features. Alphabetic characters are replaced by x or X, and numeric characters are replaced by d, and sequences of the same character are truncated after length 4. For example,"Xxxx"or"dd". str
prefix Length-N substring from the start of the word. Defaults to N=1. int
prefix_ Length-N substring from the start of the word. Defaults to N=1. str
suffix Length-N substring from the end of the word. Defaults to N=3. int
suffix_ Length-N substring from the start of the word. Defaults to N=3. str
is_alpha Does the lexeme consist of alphabetic characters? Equivalent to lexeme.text.isalpha(). bool
is_ascii Does the lexeme consist of ASCII characters? Equivalent to [any(ord(c) >= 128 for c in lexeme.text)]. bool
is_digit Does the lexeme consist of digits? Equivalent to lexeme.text.isdigit(). bool
is_lower Is the lexeme in lowercase? Equivalent to lexeme.text.islower(). bool
is_upper Is the lexeme in uppercase? Equivalent to lexeme.text.isupper(). bool
is_title Is the lexeme in titlecase? Equivalent to lexeme.text.istitle(). bool
is_punct Is the lexeme punctuation? bool
is_left_punct Is the lexeme a left punctuation mark, e.g. (? bool
is_right_punct Is the lexeme a right punctuation mark, e.g. )? bool
is_space Does the lexeme consist of whitespace characters? Equivalent to lexeme.text.isspace(). bool
is_bracket Is the lexeme a bracket? bool
is_quote Is the lexeme a quotation mark? bool
is_currency Is the lexeme a currency symbol? bool
like_url Does the lexeme resemble a URL? bool
like_num Does the lexeme represent a number? e.g. "10.9", "10", "ten", etc. bool
like_email Does the lexeme resemble an email address? bool
is_oov Is the lexeme out-of-vocabulary (i.e. does it not have a word vector)? bool
is_stop Is the lexeme part of a "stop list"? bool
lang Language of the parent vocabulary. int
lang_ Language of the parent vocabulary. str
prob Smoothed log probability estimate of the lexeme's word type (context-independent entry in the vocabulary). float
cluster Brown cluster ID. int