mirror of
https://github.com/explosion/spaCy.git
synced 2025-10-24 12:41:23 +03:00
<!--- Provide a general summary of your changes in the title. --> ## Description The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on. This PR also includes various new docs pages and content. Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837. ### Types of change enhancement ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
11 KiB
11 KiB
title | teaser | tag | source |
---|---|---|---|
Lexeme | An entry in the vocabulary | class | spacy/lexeme.pyx |
A Lexeme
has no string context – it's a word type, as opposed to a word token.
It therefore has no part-of-speech tag, dependency parse, or lemma (if
lemmatization depends on the part-of-speech tag).
Lexeme.__init__
Create a Lexeme
object.
Name | Type | Description |
---|---|---|
vocab |
Vocab |
The parent vocabulary. |
orth |
int | The orth id of the lexeme. |
RETURNS | Lexeme |
The newly constructed object. |
Lexeme.set_flag
Change the value of a boolean flag.
Example
COOL_FLAG = nlp.vocab.add_flag(lambda text: False) nlp.vocab[u'spaCy'].set_flag(COOL_FLAG, True)
Name | Type | Description |
---|---|---|
flag_id |
int | The attribute ID of the flag to set. |
value |
bool | The new value of the flag. |
Lexeme.check_flag
Check the value of a boolean flag.
Example
is_my_library = lambda text: text in [u"spaCy", u"Thinc"] MY_LIBRARY = nlp.vocab.add_flag(is_my_library) assert nlp.vocab[u"spaCy"].check_flag(MY_LIBRARY) == True
Name | Type | Description |
---|---|---|
flag_id |
int | The attribute ID of the flag to query. |
RETURNS | bool | The value of the flag. |
Lexeme.similarity
Compute a semantic similarity estimate. Defaults to cosine over vectors.
Example
apple = nlp.vocab[u"apple"] orange = nlp.vocab[u"orange"] apple_orange = apple.similarity(orange) orange_apple = orange.similarity(apple) assert apple_orange == orange_apple
Name | Type | Description |
---|---|---|
other | - | The object to compare with. By default, accepts Doc , Span , Token and Lexeme objects. |
RETURNS | float | A scalar similarity score. Higher is more similar. |
Lexeme.has_vector
A boolean value indicating whether a word vector is associated with the lexeme.
Example
apple = nlp.vocab[u"apple"] assert apple.has_vector
Name | Type | Description |
---|---|---|
RETURNS | bool | Whether the lexeme has a vector data attached. |
Lexeme.vector
A real-valued meaning representation.
Example
apple = nlp.vocab[u"apple"] assert apple.vector.dtype == "float32" assert apple.vector.shape == (300,)
Name | Type | Description |
---|---|---|
RETURNS | numpy.ndarray[ndim=1, dtype='float32'] |
A 1D numpy array representing the lexeme's semantics. |
Lexeme.vector_norm
The L2 norm of the lexeme's vector representation.
Example
apple = nlp.vocab[u"apple"] pasta = nlp.vocab[u"pasta"] apple.vector_norm # 7.1346845626831055 pasta.vector_norm # 7.759851932525635 assert apple.vector_norm != pasta.vector_norm
Name | Type | Description |
---|---|---|
RETURNS | float | The L2 norm of the vector representation. |
Attributes
Name | Type | Description |
---|---|---|
vocab |
Vocab |
The lexeme's vocabulary. |
text |
unicode | Verbatim text content. |
orth |
int | ID of the verbatim text content. |
orth_ |
unicode | Verbatim text content (identical to Lexeme.text ). Exists mostly for consistency with the other attributes. |
lex_id |
int | ID of the lexeme's lexical type. |
rank |
int | Sequential ID of the lexemes's lexical type, used to index into tables, e.g. for word vectors. |
flags |
int | Container of the lexeme's binary flags. |
norm |
int | The lexemes's norm, i.e. a normalized form of the lexeme text. |
norm_ |
unicode | The lexemes's norm, i.e. a normalized form of the lexeme text. |
lower |
int | Lowercase form of the word. |
lower_ |
unicode | Lowercase form of the word. |
shape |
int | Transform of the word's string, to show orthographic features. |
shape_ |
unicode | Transform of the word's string, to show orthographic features. |
prefix |
int | Length-N substring from the start of the word. Defaults to N=1 . |
prefix_ |
unicode | Length-N substring from the start of the word. Defaults to N=1 . |
suffix |
int | Length-N substring from the end of the word. Defaults to N=3 . |
suffix_ |
unicode | Length-N substring from the start of the word. Defaults to N=3 . |
is_alpha |
bool | Does the lexeme consist of alphabetic characters? Equivalent to lexeme.text.isalpha() . |
is_ascii |
bool | Does the lexeme consist of ASCII characters? Equivalent to [any(ord(c) >= 128 for c in lexeme.text)] . |
is_digit |
bool | Does the lexeme consist of digits? Equivalent to lexeme.text.isdigit() . |
is_lower |
bool | Is the lexeme in lowercase? Equivalent to lexeme.text.islower() . |
is_upper |
bool | Is the lexeme in uppercase? Equivalent to lexeme.text.isupper() . |
is_title |
bool | Is the lexeme in titlecase? Equivalent to lexeme.text.istitle() . |
is_punct |
bool | Is the lexeme punctuation? |
is_left_punct |
bool | Is the lexeme a left punctuation mark, e.g. ( ? |
is_right_punct |
bool | Is the lexeme a right punctuation mark, e.g. ) ? |
is_space |
bool | Does the lexeme consist of whitespace characters? Equivalent to lexeme.text.isspace() . |
is_bracket |
bool | Is the lexeme a bracket? |
is_quote |
bool | Is the lexeme a quotation mark? |
is_currency 2.0.8 |
bool | Is the lexeme a currency symbol? |
like_url |
bool | Does the lexeme resemble a URL? |
like_num |
bool | Does the lexeme represent a number? e.g. "10.9", "10", "ten", etc. |
like_email |
bool | Does the lexeme resemble an email address? |
is_oov |
bool | Is the lexeme out-of-vocabulary? |
is_stop |
bool | Is the lexeme part of a "stop list"? |
lang |
int | Language of the parent vocabulary. |
lang_ |
unicode | Language of the parent vocabulary. |
prob |
float | Smoothed log probability estimate of the lexeme's type. |
cluster |
int | Brown cluster ID. |
sentiment |
float | A scalar value indicating the positivity or negativity of the lexeme. |