spaCy/lexeme.md at 9e7deeaf48d2decce8c318dcdb922979e5ba1300

mirror of https://github.com/explosion/spaCy.git synced 2024-11-14 05:37:03 +03:00

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

2019-02-17 19:31:19 +01:00

11 KiB

Raw Blame History

title	teaser	tag	source
Lexeme	An entry in the vocabulary	class	spacy/lexeme.pyx

A Lexeme has no string context – it's a word type, as opposed to a word token. It therefore has no part-of-speech tag, dependency parse, or lemma (if lemmatization depends on the part-of-speech tag).

Lexeme.init

Create a Lexeme object.

Name	Type	Description
`vocab`	`Vocab`	The parent vocabulary.
`orth`	int	The orth id of the lexeme.
RETURNS	`Lexeme`	The newly constructed object.

Lexeme.set_flag

Change the value of a boolean flag.

Example

COOL_FLAG = nlp.vocab.add_flag(lambda text: False)
nlp.vocab[u'spaCy'].set_flag(COOL_FLAG, True)

Name	Type	Description
`flag_id`	int	The attribute ID of the flag to set.
`value`	bool	The new value of the flag.

Lexeme.check_flag

Check the value of a boolean flag.

Example

is_my_library = lambda text: text in [u"spaCy", u"Thinc"]
MY_LIBRARY = nlp.vocab.add_flag(is_my_library)
assert nlp.vocab[u"spaCy"].check_flag(MY_LIBRARY) == True

Name	Type	Description
`flag_id`	int	The attribute ID of the flag to query.
RETURNS	bool	The value of the flag.

Lexeme.similarity

Compute a semantic similarity estimate. Defaults to cosine over vectors.

Example

apple = nlp.vocab[u"apple"]
orange = nlp.vocab[u"orange"]
apple_orange = apple.similarity(orange)
orange_apple = orange.similarity(apple)
assert apple_orange == orange_apple

Name	Type	Description
other	-	The object to compare with. By default, accepts `Doc`, `Span`, `Token` and `Lexeme` objects.
RETURNS	float	A scalar similarity score. Higher is more similar.

Lexeme.has_vector

A boolean value indicating whether a word vector is associated with the lexeme.

Example

apple = nlp.vocab[u"apple"]
assert apple.has_vector

Name	Type	Description
RETURNS	bool	Whether the lexeme has a vector data attached.

Lexeme.vector

A real-valued meaning representation.

Example

apple = nlp.vocab[u"apple"]
assert apple.vector.dtype == "float32"
assert apple.vector.shape == (300,)

Name	Type	Description
RETURNS	`numpy.ndarray[ndim=1, dtype='float32']`	A 1D numpy array representing the lexeme's semantics.

Lexeme.vector_norm

The L2 norm of the lexeme's vector representation.

Example

apple = nlp.vocab[u"apple"]
pasta = nlp.vocab[u"pasta"]
apple.vector_norm  # 7.1346845626831055
pasta.vector_norm  # 7.759851932525635
assert apple.vector_norm != pasta.vector_norm

Name	Type	Description
RETURNS	float	The L2 norm of the vector representation.

Attributes

Name	Type	Description
`vocab`	`Vocab`	The lexeme's vocabulary.
`text`	unicode	Verbatim text content.
`orth`	int	ID of the verbatim text content.
`orth_`	unicode	Verbatim text content (identical to `Lexeme.text`). Exists mostly for consistency with the other attributes.
`lex_id`	int	ID of the lexeme's lexical type.
`rank`	int	Sequential ID of the lexemes's lexical type, used to index into tables, e.g. for word vectors.
`flags`	int	Container of the lexeme's binary flags.
`norm`	int	The lexemes's norm, i.e. a normalized form of the lexeme text.
`norm_`	unicode	The lexemes's norm, i.e. a normalized form of the lexeme text.
`lower`	int	Lowercase form of the word.
`lower_`	unicode	Lowercase form of the word.
`shape`	int	Transform of the word's string, to show orthographic features.
`shape_`	unicode	Transform of the word's string, to show orthographic features.
`prefix`	int	Length-N substring from the start of the word. Defaults to `N=1`.
`prefix_`	unicode	Length-N substring from the start of the word. Defaults to `N=1`.
`suffix`	int	Length-N substring from the end of the word. Defaults to `N=3`.
`suffix_`	unicode	Length-N substring from the start of the word. Defaults to `N=3`.
`is_alpha`	bool	Does the lexeme consist of alphabetic characters? Equivalent to `lexeme.text.isalpha()`.
`is_ascii`	bool	Does the lexeme consist of ASCII characters? Equivalent to `[any(ord(c) >= 128 for c in lexeme.text)]`.
`is_digit`	bool	Does the lexeme consist of digits? Equivalent to `lexeme.text.isdigit()`.
`is_lower`	bool	Is the lexeme in lowercase? Equivalent to `lexeme.text.islower()`.
`is_upper`	bool	Is the lexeme in uppercase? Equivalent to `lexeme.text.isupper()`.
`is_title`	bool	Is the lexeme in titlecase? Equivalent to `lexeme.text.istitle()`.
`is_punct`	bool	Is the lexeme punctuation?
`is_left_punct`	bool	Is the lexeme a left punctuation mark, e.g. `(`?
`is_right_punct`	bool	Is the lexeme a right punctuation mark, e.g. `)`?
`is_space`	bool	Does the lexeme consist of whitespace characters? Equivalent to `lexeme.text.isspace()`.
`is_bracket`	bool	Is the lexeme a bracket?
`is_quote`	bool	Is the lexeme a quotation mark?
`is_currency` 2.0.8	bool	Is the lexeme a currency symbol?
`like_url`	bool	Does the lexeme resemble a URL?
`like_num`	bool	Does the lexeme represent a number? e.g. "10.9", "10", "ten", etc.
`like_email`	bool	Does the lexeme resemble an email address?
`is_oov`	bool	Is the lexeme out-of-vocabulary?
`is_stop`	bool	Is the lexeme part of a "stop list"?
`lang`	int	Language of the parent vocabulary.
`lang_`	unicode	Language of the parent vocabulary.
`prob`	float	Smoothed log probability estimate of the lexeme's type.
`cluster`	int	Brown cluster ID.
`sentiment`	float	A scalar value indicating the positivity or negativity of the lexeme.

11 KiB Raw Blame History Unescape Escape

Lexeme.__init__

Lexeme.set_flag

Example

Lexeme.check_flag

Example

Lexeme.similarity

Example

Lexeme.has_vector

Example

Lexeme.vector

Example

Lexeme.vector_norm

Example

Attributes

11 KiB

Raw Blame History

Lexeme.init