mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-10 09:16:31 +03:00
d5110ffbf2
* Update website models for v2.3.0 * Add docs for Chinese word segmentation * Tighten up Chinese docs section * Merge branch 'master' into docs/v2.3.0 [ci skip] * Merge branch 'master' into docs/v2.3.0 [ci skip] * Auto-format and update version * Update matcher.md * Update languages and sorting * Typo in landing page * Infobox about token_match behavior * Add meta and basic docs for Japanese * POS -> TAG in models table * Add info about lookups for normalization * Updates to API docs for v2.3 * Update adding norm exceptions for adding languages * Add --omit-extra-lookups to CLI API docs * Add initial draft of "What's New in v2.3" * Add new in v2.3 tags to Chinese and Japanese sections * Add tokenizer to migration section * Add new in v2.3 flags to init-model * Typo * More what's new in v2.3 Co-authored-by: Ines Montani <ines@ines.io>
17 KiB
17 KiB
title | teaser | tag | source |
---|---|---|---|
Lexeme | An entry in the vocabulary | class | spacy/lexeme.pyx |
A Lexeme
has no string context – it's a word type, as opposed to a word token.
It therefore has no part-of-speech tag, dependency parse, or lemma (if
lemmatization depends on the part-of-speech tag).
Lexeme.__init__
Create a Lexeme
object.
Name | Type | Description |
---|---|---|
vocab |
Vocab |
The parent vocabulary. |
orth |
int | The orth id of the lexeme. |
RETURNS | Lexeme |
The newly constructed object. |
Lexeme.set_flag
Change the value of a boolean flag.
Example
COOL_FLAG = nlp.vocab.add_flag(lambda text: False) nlp.vocab["spaCy"].set_flag(COOL_FLAG, True)
Name | Type | Description |
---|---|---|
flag_id |
int | The attribute ID of the flag to set. |
value |
bool | The new value of the flag. |
Lexeme.check_flag
Check the value of a boolean flag.
Example
is_my_library = lambda text: text in ["spaCy", "Thinc"] MY_LIBRARY = nlp.vocab.add_flag(is_my_library) assert nlp.vocab["spaCy"].check_flag(MY_LIBRARY) == True
Name | Type | Description |
---|---|---|
flag_id |
int | The attribute ID of the flag to query. |
RETURNS | bool | The value of the flag. |
Lexeme.similarity
Compute a semantic similarity estimate. Defaults to cosine over vectors.
Example
apple = nlp.vocab["apple"] orange = nlp.vocab["orange"] apple_orange = apple.similarity(orange) orange_apple = orange.similarity(apple) assert apple_orange == orange_apple
Name | Type | Description |
---|---|---|
other | - | The object to compare with. By default, accepts Doc , Span , Token and Lexeme objects. |
RETURNS | float | A scalar similarity score. Higher is more similar. |
Lexeme.has_vector
A boolean value indicating whether a word vector is associated with the lexeme.
Example
apple = nlp.vocab["apple"] assert apple.has_vector
Name | Type | Description |
---|---|---|
RETURNS | bool | Whether the lexeme has a vector data attached. |
Lexeme.vector
A real-valued meaning representation.
Example
apple = nlp.vocab["apple"] assert apple.vector.dtype == "float32" assert apple.vector.shape == (300,)
Name | Type | Description |
---|---|---|
RETURNS | numpy.ndarray[ndim=1, dtype='float32'] |
A 1D numpy array representing the lexeme's semantics. |
Lexeme.vector_norm
The L2 norm of the lexeme's vector representation.
Example
apple = nlp.vocab["apple"] pasta = nlp.vocab["pasta"] apple.vector_norm # 7.1346845626831055 pasta.vector_norm # 7.759851932525635 assert apple.vector_norm != pasta.vector_norm
Name | Type | Description |
---|---|---|
RETURNS | float | The L2 norm of the vector representation. |
Attributes
Name | Type | Description |
---|---|---|
vocab |
Vocab |
The lexeme's vocabulary. |
text |
unicode | Verbatim text content. |
orth |
int | ID of the verbatim text content. |
orth_ |
unicode | Verbatim text content (identical to Lexeme.text ). Exists mostly for consistency with the other attributes. |
rank |
int | Sequential ID of the lexemes's lexical type, used to index into tables, e.g. for word vectors. |
flags |
int | Container of the lexeme's binary flags. |
norm |
int | The lexemes's norm, i.e. a normalized form of the lexeme text. |
norm_ |
unicode | The lexemes's norm, i.e. a normalized form of the lexeme text. |
lower |
int | Lowercase form of the word. |
lower_ |
unicode | Lowercase form of the word. |
shape |
int | Transform of the words's string, to show orthographic features. Alphabetic characters are replaced by x or X , and numeric characters are replaced by d, and sequences of the same character are truncated after length 4. For example, "Xxxx"or "dd"`. |
shape_ |
unicode | Transform of the word's string, to show orthographic features. Alphabetic characters are replaced by x or X , and numeric characters are replaced by d, and sequences of the same character are truncated after length 4. For example, "Xxxx"or "dd"`. |
prefix |
int | Length-N substring from the start of the word. Defaults to N=1 . |
prefix_ |
unicode | Length-N substring from the start of the word. Defaults to N=1 . |
suffix |
int | Length-N substring from the end of the word. Defaults to N=3 . |
suffix_ |
unicode | Length-N substring from the start of the word. Defaults to N=3 . |
is_alpha |
bool | Does the lexeme consist of alphabetic characters? Equivalent to lexeme.text.isalpha() . |
is_ascii |
bool | Does the lexeme consist of ASCII characters? Equivalent to [any(ord(c) >= 128 for c in lexeme.text)] . |
is_digit |
bool | Does the lexeme consist of digits? Equivalent to lexeme.text.isdigit() . |
is_lower |
bool | Is the lexeme in lowercase? Equivalent to lexeme.text.islower() . |
is_upper |
bool | Is the lexeme in uppercase? Equivalent to lexeme.text.isupper() . |
is_title |
bool | Is the lexeme in titlecase? Equivalent to lexeme.text.istitle() . |
is_punct |
bool | Is the lexeme punctuation? |
is_left_punct |
bool | Is the lexeme a left punctuation mark, e.g. ( ? |
is_right_punct |
bool | Is the lexeme a right punctuation mark, e.g. ) ? |
is_space |
bool | Does the lexeme consist of whitespace characters? Equivalent to lexeme.text.isspace() . |
is_bracket |
bool | Is the lexeme a bracket? |
is_quote |
bool | Is the lexeme a quotation mark? |
is_currency 2.0.8 |
bool | Is the lexeme a currency symbol? |
like_url |
bool | Does the lexeme resemble a URL? |
like_num |
bool | Does the lexeme represent a number? e.g. "10.9", "10", "ten", etc. |
like_email |
bool | Does the lexeme resemble an email address? |
is_oov |
bool | Does the lexeme have a word vector? |
is_stop |
bool | Is the lexeme part of a "stop list"? |
lang |
int | Language of the parent vocabulary. |
lang_ |
unicode | Language of the parent vocabulary. |
prob |
float | Smoothed log probability estimate of the lexeme's word type (context-independent entry in the vocabulary). |
cluster |
int | Brown cluster ID. |
sentiment |
float | A scalar value indicating the positivity or negativity of the lexeme. |