spaCy/website/docs/api/lexeme.md
Ines Montani e597110d31
💫 Update website (#3285)
<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-02-17 19:31:19 +01:00

11 KiB
Raw Blame History

title teaser tag source
Lexeme An entry in the vocabulary class spacy/lexeme.pyx

A Lexeme has no string context it's a word type, as opposed to a word token. It therefore has no part-of-speech tag, dependency parse, or lemma (if lemmatization depends on the part-of-speech tag).

Lexeme.__init__

Create a Lexeme object.

Name Type Description
vocab Vocab The parent vocabulary.
orth int The orth id of the lexeme.
RETURNS Lexeme The newly constructed object.

Lexeme.set_flag

Change the value of a boolean flag.

Example

COOL_FLAG = nlp.vocab.add_flag(lambda text: False)
nlp.vocab[u'spaCy'].set_flag(COOL_FLAG, True)
Name Type Description
flag_id int The attribute ID of the flag to set.
value bool The new value of the flag.

Lexeme.check_flag

Check the value of a boolean flag.

Example

is_my_library = lambda text: text in [u"spaCy", u"Thinc"]
MY_LIBRARY = nlp.vocab.add_flag(is_my_library)
assert nlp.vocab[u"spaCy"].check_flag(MY_LIBRARY) == True
Name Type Description
flag_id int The attribute ID of the flag to query.
RETURNS bool The value of the flag.

Lexeme.similarity

Compute a semantic similarity estimate. Defaults to cosine over vectors.

Example

apple = nlp.vocab[u"apple"]
orange = nlp.vocab[u"orange"]
apple_orange = apple.similarity(orange)
orange_apple = orange.similarity(apple)
assert apple_orange == orange_apple
Name Type Description
other - The object to compare with. By default, accepts Doc, Span, Token and Lexeme objects.
RETURNS float A scalar similarity score. Higher is more similar.

Lexeme.has_vector

A boolean value indicating whether a word vector is associated with the lexeme.

Example

apple = nlp.vocab[u"apple"]
assert apple.has_vector
Name Type Description
RETURNS bool Whether the lexeme has a vector data attached.

Lexeme.vector

A real-valued meaning representation.

Example

apple = nlp.vocab[u"apple"]
assert apple.vector.dtype == "float32"
assert apple.vector.shape == (300,)
Name Type Description
RETURNS numpy.ndarray[ndim=1, dtype='float32'] A 1D numpy array representing the lexeme's semantics.

Lexeme.vector_norm

The L2 norm of the lexeme's vector representation.

Example

apple = nlp.vocab[u"apple"]
pasta = nlp.vocab[u"pasta"]
apple.vector_norm  # 7.1346845626831055
pasta.vector_norm  # 7.759851932525635
assert apple.vector_norm != pasta.vector_norm
Name Type Description
RETURNS float The L2 norm of the vector representation.

Attributes

Name Type Description
vocab Vocab The lexeme's vocabulary.
text unicode Verbatim text content.
orth int ID of the verbatim text content.
orth_ unicode Verbatim text content (identical to Lexeme.text). Exists mostly for consistency with the other attributes.
lex_id int ID of the lexeme's lexical type.
rank int Sequential ID of the lexemes's lexical type, used to index into tables, e.g. for word vectors.
flags int Container of the lexeme's binary flags.
norm int The lexemes's norm, i.e. a normalized form of the lexeme text.
norm_ unicode The lexemes's norm, i.e. a normalized form of the lexeme text.
lower int Lowercase form of the word.
lower_ unicode Lowercase form of the word.
shape int Transform of the word's string, to show orthographic features.
shape_ unicode Transform of the word's string, to show orthographic features.
prefix int Length-N substring from the start of the word. Defaults to N=1.
prefix_ unicode Length-N substring from the start of the word. Defaults to N=1.
suffix int Length-N substring from the end of the word. Defaults to N=3.
suffix_ unicode Length-N substring from the start of the word. Defaults to N=3.
is_alpha bool Does the lexeme consist of alphabetic characters? Equivalent to lexeme.text.isalpha().
is_ascii bool Does the lexeme consist of ASCII characters? Equivalent to [any(ord(c) >= 128 for c in lexeme.text)].
is_digit bool Does the lexeme consist of digits? Equivalent to lexeme.text.isdigit().
is_lower bool Is the lexeme in lowercase? Equivalent to lexeme.text.islower().
is_upper bool Is the lexeme in uppercase? Equivalent to lexeme.text.isupper().
is_title bool Is the lexeme in titlecase? Equivalent to lexeme.text.istitle().
is_punct bool Is the lexeme punctuation?
is_left_punct bool Is the lexeme a left punctuation mark, e.g. (?
is_right_punct bool Is the lexeme a right punctuation mark, e.g. )?
is_space bool Does the lexeme consist of whitespace characters? Equivalent to lexeme.text.isspace().
is_bracket bool Is the lexeme a bracket?
is_quote bool Is the lexeme a quotation mark?
is_currency 2.0.8 bool Is the lexeme a currency symbol?
like_url bool Does the lexeme resemble a URL?
like_num bool Does the lexeme represent a number? e.g. "10.9", "10", "ten", etc.
like_email bool Does the lexeme resemble an email address?
is_oov bool Is the lexeme out-of-vocabulary?
is_stop bool Is the lexeme part of a "stop list"?
lang int Language of the parent vocabulary.
lang_ unicode Language of the parent vocabulary.
prob float Smoothed log probability estimate of the lexeme's type.
cluster int Brown cluster ID.
sentiment float A scalar value indicating the positivity or negativity of the lexeme.