title |
teaser |
tag |
source |
Token |
An individual token — i.e. a word, punctuation symbol, whitespace, etc. |
class |
spacy/tokens/token.pyx |
Token.__init__
Construct a Token
object.
Example
doc = nlp(u"Give it back! He pleaded.")
token = doc[0]
assert token.text == u"Give"
Name |
Type |
Description |
vocab |
Vocab |
A storage container for lexical types. |
doc |
Doc |
The parent document. |
offset |
int |
The index of the token within the document. |
RETURNS |
Token |
The newly constructed object. |
Token.__len__
The number of unicode characters in the token, i.e. token.text
.
Example
doc = nlp(u"Give it back! He pleaded.")
token = doc[0]
assert len(token) == 4
Name |
Type |
Description |
RETURNS |
int |
The number of unicode characters in the token. |
Token.set_extension
Define a custom attribute on the Token
which becomes available via Token._
.
For details, see the documentation on
custom attributes.
Example
from spacy.tokens import Token
fruit_getter = lambda token: token.text in (u"apple", u"pear", u"banana")
Token.set_extension("is_fruit", getter=fruit_getter)
doc = nlp(u"I have an apple")
assert doc[3]._.is_fruit
Name |
Type |
Description |
name |
unicode |
Name of the attribute to set by the extension. For example, 'my_attr' will be available as token._.my_attr . |
default |
- |
Optional default value of the attribute if no getter or method is defined. |
method |
callable |
Set a custom method on the object, for example token._.compare(other_token) . |
getter |
callable |
Getter function that takes the object and returns an attribute value. Is called when the user accesses the ._ attribute. |
setter |
callable |
Setter function that takes the Token and a value, and modifies the object. Is called when the user writes to the Token._ attribute. |
Token.get_extension
Look up a previously registered extension by name. Returns a 4-tuple
(default, method, getter, setter)
if the extension is registered. Raises a
KeyError
otherwise.
Example
from spacy.tokens import Token
Token.set_extension("is_fruit", default=False)
extension = Token.get_extension("is_fruit")
assert extension == (False, None, None, None)
Name |
Type |
Description |
name |
unicode |
Name of the extension. |
RETURNS |
tuple |
A (default, method, getter, setter) tuple of the extension. |
Token.has_extension
Check whether an extension has been registered on the Token
class.
Example
from spacy.tokens import Token
Token.set_extension("is_fruit", default=False)
assert Token.has_extension("is_fruit")
Name |
Type |
Description |
name |
unicode |
Name of the extension to check. |
RETURNS |
bool |
Whether the extension has been registered. |
Token.remove_extension {#remove_extension tag="classmethod" new=""2.0.11""}
Remove a previously registered extension.
Example
from spacy.tokens import Token
Token.set_extension("is_fruit", default=False)
removed = Token.remove_extension("is_fruit")
assert not Token.has_extension("is_fruit")
Name |
Type |
Description |
name |
unicode |
Name of the extension. |
RETURNS |
tuple |
A (default, method, getter, setter) tuple of the removed extension. |
Token.check_flag
Check the value of a boolean flag.
Example
from spacy.attrs import IS_TITLE
doc = nlp(u"Give it back! He pleaded.")
token = doc[0]
assert token.check_flag(IS_TITLE) == True
Name |
Type |
Description |
flag_id |
int |
The attribute ID of the flag to check. |
RETURNS |
bool |
Whether the flag is set. |
Token.similarity
Compute a semantic similarity estimate. Defaults to cosine over vectors.
Example
apples, _, oranges = nlp(u"apples and oranges")
apples_oranges = apples.similarity(oranges)
oranges_apples = oranges.similarity(apples)
assert apples_oranges == oranges_apples
Name |
Type |
Description |
other |
- |
The object to compare with. By default, accepts Doc , Span , Token and Lexeme objects. |
RETURNS |
float |
A scalar similarity score. Higher is more similar. |
Token.nbor
Get a neighboring token.
Example
doc = nlp(u"Give it back! He pleaded.")
give_nbor = doc[0].nbor()
assert give_nbor.text == u"it"
Name |
Type |
Description |
i |
int |
The relative position of the token to get. Defaults to 1 . |
RETURNS |
Token |
The token at position self.doc[self.i+i] . |
Token.is_ancestor
Check whether this token is a parent, grandparent, etc. of another in the
dependency tree.
Example
doc = nlp(u"Give it back! He pleaded.")
give = doc[0]
it = doc[1]
assert give.is_ancestor(it)
Name |
Type |
Description |
descendant |
Token |
Another token. |
RETURNS |
bool |
Whether this token is the ancestor of the descendant. |
Token.ancestors
The rightmost token of this token's syntactic descendants.
Example
doc = nlp(u"Give it back! He pleaded.")
it_ancestors = doc[1].ancestors
assert [t.text for t in it_ancestors] == [u"Give"]
he_ancestors = doc[4].ancestors
assert [t.text for t in he_ancestors] == [u"pleaded"]
Name |
Type |
Description |
YIELDS |
Token |
A sequence of ancestor tokens such that ancestor.is_ancestor(self) . |
Token.conjuncts
A sequence of coordinated tokens, including the token itself.
Example
doc = nlp(u"I like apples and oranges")
apples_conjuncts = doc[2].conjuncts
assert [t.text for t in apples_conjuncts] == [u"oranges"]
Name |
Type |
Description |
YIELDS |
Token |
A coordinated token. |
Token.children
A sequence of the token's immediate syntactic children.
Example
doc = nlp(u"Give it back! He pleaded.")
give_children = doc[0].children
assert [t.text for t in give_children] == [u"it", u"back", u"!"]
Name |
Type |
Description |
YIELDS |
Token |
A child token such that child.head==self . |
Token.lefts
The leftward immediate children of the word, in the syntactic dependency parse.
Example
doc = nlp(u"I like New York in Autumn.")
lefts = [t.text for t in doc[3].lefts]
assert lefts == [u'New']
Name |
Type |
Description |
YIELDS |
Token |
A left-child of the token. |
Token.rights
The rightward immediate children of the word, in the syntactic dependency parse.
Example
doc = nlp(u"I like New York in Autumn.")
rights = [t.text for t in doc[3].rights]
assert rights == [u"in"]
Name |
Type |
Description |
YIELDS |
Token |
A right-child of the token. |
Token.n_lefts
The number of leftward immediate children of the word, in the syntactic
dependency parse.
Example
doc = nlp(u"I like New York in Autumn.")
assert doc[3].n_lefts == 1
Name |
Type |
Description |
RETURNS |
int |
The number of left-child tokens. |
Token.n_rights
The number of rightward immediate children of the word, in the syntactic
dependency parse.
Example
doc = nlp(u"I like New York in Autumn.")
assert doc[3].n_rights == 1
Name |
Type |
Description |
RETURNS |
int |
The number of right-child tokens. |
Token.subtree
A sequence containing the token and all the token's syntactic descendants.
Example
doc = nlp(u"Give it back! He pleaded.")
give_subtree = doc[0].subtree
assert [t.text for t in give_subtree] == [u"Give", u"it", u"back", u"!"]
Name |
Type |
Description |
YIELDS |
Token |
A descendant token such that self.is_ancestor(token) or token == self . |
Token.is_sent_start
A boolean value indicating whether the token starts a sentence. None
if
unknown. Defaults to True
for the first token in the doc
.
Example
doc = nlp(u"Give it back! He pleaded.")
assert doc[4].is_sent_start
assert not doc[5].is_sent_start
Name |
Type |
Description |
RETURNS |
bool |
Whether the token starts a sentence. |
As of spaCy v2.0, the Token.sent_start
property is deprecated and has been
replaced with Token.is_sent_start
, which returns a boolean value instead of a
misleading 0
for False
and 1
for True
. It also now returns None
if the
answer is unknown, and fixes a quirk in the old logic that would always set the
property to 0
for the first word of the document.
- assert doc[4].sent_start == 1
+ assert doc[4].is_sent_start == True
Token.has_vector
A boolean value indicating whether a word vector is associated with the token.
Example
doc = nlp(u"I like apples")
apples = doc[2]
assert apples.has_vector
Name |
Type |
Description |
RETURNS |
bool |
Whether the token has a vector data attached. |
Token.vector
A real-valued meaning representation.
Example
doc = nlp(u"I like apples")
apples = doc[2]
assert apples.vector.dtype == "float32"
assert apples.vector.shape == (300,)
Name |
Type |
Description |
RETURNS |
numpy.ndarray[ndim=1, dtype='float32'] |
A 1D numpy array representing the token's semantics. |
Token.vector_norm
The L2 norm of the token's vector representation.
Example
doc = nlp(u"I like apples and pasta")
apples = doc[2]
pasta = doc[4]
apples.vector_norm # 6.89589786529541
pasta.vector_norm # 7.759851932525635
assert apples.vector_norm != pasta.vector_norm
Name |
Type |
Description |
RETURNS |
float |
The L2 norm of the vector representation. |
Attributes
Name |
Type |
Description |
doc |
Doc |
The parent document. |
sent 2.0.12 |
Span |
The sentence span that this token is a part of. |
text |
unicode |
Verbatim text content. |
text_with_ws |
unicode |
Text content, with trailing space character if present. |
whitespace_ |
unicode |
Trailing space character if present. |
orth |
int |
ID of the verbatim text content. |
orth_ |
unicode |
Verbatim text content (identical to Token.text ). Exists mostly for consistency with the other attributes. |
vocab |
Vocab |
The vocab object of the parent Doc . |
doc |
Doc |
The parent document. |
head |
Token |
The syntactic parent, or "governor", of this token. |
left_edge |
Token |
The leftmost token of this token's syntactic descendants. |
right_edge |
Token |
The rightmost token of this token's syntactic descendants. |
i |
int |
The index of the token within the parent document. |
ent_type |
int |
Named entity type. |
ent_type_ |
unicode |
Named entity type. |
ent_iob |
int |
IOB code of named entity tag. 3 means the token begins an entity, 2 means it is outside an entity, 1 means it is inside an entity, and 0 means no entity tag is set. |
ent_iob_ |
unicode |
IOB code of named entity tag. 3 means the token begins an entity, 2 means it is outside an entity, 1 means it is inside an entity, and 0 means no entity tag is set. |
ent_id |
int |
ID of the entity the token is an instance of, if any. Currently not used, but potentially for coreference resolution. |
ent_id_ |
unicode |
ID of the entity the token is an instance of, if any. Currently not used, but potentially for coreference resolution. |
lemma |
int |
Base form of the token, with no inflectional suffixes. |
lemma_ |
unicode |
Base form of the token, with no inflectional suffixes. |
norm |
int |
The token's norm, i.e. a normalized form of the token text. Usually set in the language's tokenizer exceptions or norm exceptions. |
norm_ |
unicode |
The token's norm, i.e. a normalized form of the token text. Usually set in the language's tokenizer exceptions or norm exceptions. |
lower |
int |
Lowercase form of the token. |
lower_ |
unicode |
Lowercase form of the token text. Equivalent to Token.text.lower() . |
shape |
int |
Transform of the tokens's string, to show orthographic features. For example, "Xxxx" or "dd". |
shape_ |
unicode |
Transform of the tokens's string, to show orthographic features. For example, "Xxxx" or "dd". |
prefix |
int |
Hash value of a length-N substring from the start of the token. Defaults to N=1 . |
prefix_ |
unicode |
A length-N substring from the start of the token. Defaults to N=1 . |
suffix |
int |
Hash value of a length-N substring from the end of the token. Defaults to N=3 . |
suffix_ |
unicode |
Length-N substring from the end of the token. Defaults to N=3 . |
is_alpha |
bool |
Does the token consist of alphabetic characters? Equivalent to token.text.isalpha() . |
is_ascii |
bool |
Does the token consist of ASCII characters? Equivalent to all(ord(c) < 128 for c in token.text) . |
is_digit |
bool |
Does the token consist of digits? Equivalent to token.text.isdigit() . |
is_lower |
bool |
Is the token in lowercase? Equivalent to token.text.islower() . |
is_upper |
bool |
Is the token in uppercase? Equivalent to token.text.isupper() . |
is_title |
bool |
Is the token in titlecase? Equivalent to token.text.istitle() . |
is_punct |
bool |
Is the token punctuation? |
is_left_punct |
bool |
Is the token a left punctuation mark, e.g. ( ? |
is_right_punct |
bool |
Is the token a right punctuation mark, e.g. ) ? |
is_space |
bool |
Does the token consist of whitespace characters? Equivalent to token.text.isspace() . |
is_bracket |
bool |
Is the token a bracket? |
is_quote |
bool |
Is the token a quotation mark? |
is_currency 2.0.8 |
bool |
Is the token a currency symbol? |
like_url |
bool |
Does the token resemble a URL? |
like_num |
bool |
Does the token represent a number? e.g. "10.9", "10", "ten", etc. |
like_email |
bool |
Does the token resemble an email address? |
is_oov |
bool |
Is the token out-of-vocabulary? |
is_stop |
bool |
Is the token part of a "stop list"? |
pos |
int |
Coarse-grained part-of-speech. |
pos_ |
unicode |
Coarse-grained part-of-speech. |
tag |
int |
Fine-grained part-of-speech. |
tag_ |
unicode |
Fine-grained part-of-speech. |
dep |
int |
Syntactic dependency relation. |
dep_ |
unicode |
Syntactic dependency relation. |
lang |
int |
Language of the parent document's vocabulary. |
lang_ |
unicode |
Language of the parent document's vocabulary. |
prob |
float |
Smoothed log probability estimate of token's type. |
idx |
int |
The character offset of the token within the parent document. |
sentiment |
float |
A scalar value indicating the positivity or negativity of the token. |
lex_id |
int |
Sequential ID of the token's lexical type. |
rank |
int |
Sequential ID of the token's lexical type, used to index into tables, e.g. for word vectors. |
cluster |
int |
Brown cluster ID. |
_ |
Underscore |
User space for adding custom attribute extensions. |