spaCy/token.md at 88059664609c332849f674ba4eb4c8554c3c115c

mirror of https://github.com/explosion/spaCy.git synced 2025-11-04 09:57:26 +03:00

Matthew Honnibal 4a3371acd5

Make doc[0].is_sent_start == True (closes #2869 ) (#3340 )

* Make doc[0] have sent_start True. Closes #2869

* Document that doc[0].is_sent_start defaults True.

2019-02-27 11:17:17 +01:00

32 KiB

Raw Blame History

title	teaser	tag	source
Token	An individual token — i.e. a word, punctuation symbol, whitespace, etc.	class	spacy/tokens/token.pyx

Token.init

Construct a Token object.

Example

doc = nlp(u"Give it back! He pleaded.")
token = doc[0]
assert token.text == u"Give"

Name	Type	Description
`vocab`	`Vocab`	A storage container for lexical types.
`doc`	`Doc`	The parent document.
`offset`	int	The index of the token within the document.
RETURNS	`Token`	The newly constructed object.

Token.len

The number of unicode characters in the token, i.e. token.text.

Example

doc = nlp(u"Give it back! He pleaded.")
token = doc[0]
assert len(token) == 4

Name	Type	Description
RETURNS	int	The number of unicode characters in the token.

Token.set_extension

Define a custom attribute on the Token which becomes available via Token._. For details, see the documentation on custom attributes.

Example

from spacy.tokens import Token
fruit_getter = lambda token: token.text in (u"apple", u"pear", u"banana")
Token.set_extension("is_fruit", getter=fruit_getter)
doc = nlp(u"I have an apple")
assert doc[3]._.is_fruit

Name	Type	Description
`name`	unicode	Name of the attribute to set by the extension. For example, `'my_attr'` will be available as `token._.my_attr`.
`default`	-	Optional default value of the attribute if no getter or method is defined.
`method`	callable	Set a custom method on the object, for example `token._.compare(other_token)`.
`getter`	callable	Getter function that takes the object and returns an attribute value. Is called when the user accesses the `._` attribute.
`setter`	callable	Setter function that takes the `Token` and a value, and modifies the object. Is called when the user writes to the `Token._` attribute.

Token.get_extension

Look up a previously registered extension by name. Returns a 4-tuple (default, method, getter, setter) if the extension is registered. Raises a KeyError otherwise.

Example

from spacy.tokens import Token
Token.set_extension("is_fruit", default=False)
extension = Token.get_extension("is_fruit")
assert extension == (False, None, None, None)

Name	Type	Description
`name`	unicode	Name of the extension.
RETURNS	tuple	A `(default, method, getter, setter)` tuple of the extension.

Token.has_extension

Check whether an extension has been registered on the Token class.

Example

from spacy.tokens import Token
Token.set_extension("is_fruit", default=False)
assert Token.has_extension("is_fruit")

Name	Type	Description
`name`	unicode	Name of the extension to check.
RETURNS	bool	Whether the extension has been registered.

Token.remove_extension {#remove_extension tag="classmethod" new=""2.0.11""}

Remove a previously registered extension.

Example

from spacy.tokens import Token
Token.set_extension("is_fruit", default=False)
removed = Token.remove_extension("is_fruit")
assert not Token.has_extension("is_fruit")

Name	Type	Description
`name`	unicode	Name of the extension.
RETURNS	tuple	A `(default, method, getter, setter)` tuple of the removed extension.

Token.check_flag

Check the value of a boolean flag.

Example

from spacy.attrs import IS_TITLE
doc = nlp(u"Give it back! He pleaded.")
token = doc[0]
assert token.check_flag(IS_TITLE) == True

Name	Type	Description
`flag_id`	int	The attribute ID of the flag to check.
RETURNS	bool	Whether the flag is set.

Token.similarity

Compute a semantic similarity estimate. Defaults to cosine over vectors.

Example

apples, _, oranges = nlp(u"apples and oranges")
apples_oranges = apples.similarity(oranges)
oranges_apples = oranges.similarity(apples)
assert apples_oranges == oranges_apples

Name	Type	Description
other	-	The object to compare with. By default, accepts `Doc`, `Span`, `Token` and `Lexeme` objects.
RETURNS	float	A scalar similarity score. Higher is more similar.

Token.nbor

Get a neighboring token.

Example

doc = nlp(u"Give it back! He pleaded.")
give_nbor = doc[0].nbor()
assert give_nbor.text == u"it"

Name	Type	Description
`i`	int	The relative position of the token to get. Defaults to `1`.
RETURNS	`Token`	The token at position `self.doc[self.i+i]`.

Token.is_ancestor

Check whether this token is a parent, grandparent, etc. of another in the dependency tree.

Example

doc = nlp(u"Give it back! He pleaded.")
give = doc[0]
it = doc[1]
assert give.is_ancestor(it)

Name	Type	Description
descendant	`Token`	Another token.
RETURNS	bool	Whether this token is the ancestor of the descendant.

Token.ancestors

The rightmost token of this token's syntactic descendants.

Example

doc = nlp(u"Give it back! He pleaded.")
it_ancestors = doc[1].ancestors
assert [t.text for t in it_ancestors] == [u"Give"]
he_ancestors = doc[4].ancestors
assert [t.text for t in he_ancestors] == [u"pleaded"]

Name	Type	Description
YIELDS	`Token`	A sequence of ancestor tokens such that `ancestor.is_ancestor(self)`.

Token.conjuncts

A sequence of coordinated tokens, including the token itself.

Example

doc = nlp(u"I like apples and oranges")
apples_conjuncts = doc[2].conjuncts
assert [t.text for t in apples_conjuncts] == [u"oranges"]

Name	Type	Description
YIELDS	`Token`	A coordinated token.

Token.children

A sequence of the token's immediate syntactic children.

Example

doc = nlp(u"Give it back! He pleaded.")
give_children = doc[0].children
assert [t.text for t in give_children] == [u"it", u"back", u"!"]

Name	Type	Description
YIELDS	`Token`	A child token such that `child.head==self`.

Token.lefts

The leftward immediate children of the word, in the syntactic dependency parse.

Example

doc = nlp(u"I like New York in Autumn.")
lefts = [t.text for t in doc[3].lefts]
assert lefts == [u'New']

Name	Type	Description
YIELDS	`Token`	A left-child of the token.

Token.rights

The rightward immediate children of the word, in the syntactic dependency parse.

Example

doc = nlp(u"I like New York in Autumn.")
rights = [t.text for t in doc[3].rights]
assert rights == [u"in"]

Name	Type	Description
YIELDS	`Token`	A right-child of the token.

Token.n_lefts

The number of leftward immediate children of the word, in the syntactic dependency parse.

Example

doc = nlp(u"I like New York in Autumn.")
assert doc[3].n_lefts == 1

Name	Type	Description
RETURNS	int	The number of left-child tokens.

Token.n_rights

The number of rightward immediate children of the word, in the syntactic dependency parse.

Example

doc = nlp(u"I like New York in Autumn.")
assert doc[3].n_rights == 1

Name	Type	Description
RETURNS	int	The number of right-child tokens.

Token.subtree

A sequence containing the token and all the token's syntactic descendants.

Example

doc = nlp(u"Give it back! He pleaded.")
give_subtree = doc[0].subtree
assert [t.text for t in give_subtree] == [u"Give", u"it", u"back", u"!"]

Name	Type	Description
YIELDS	`Token`	A descendant token such that `self.is_ancestor(token)` or `token == self`.

Token.is_sent_start

A boolean value indicating whether the token starts a sentence. None if unknown. Defaults to True for the first token in the doc.

Example

doc = nlp(u"Give it back! He pleaded.")
assert doc[4].is_sent_start
assert not doc[5].is_sent_start

Name	Type	Description
RETURNS	bool	Whether the token starts a sentence.

As of spaCy v2.0, the Token.sent_start property is deprecated and has been replaced with Token.is_sent_start, which returns a boolean value instead of a misleading 0 for False and 1 for True. It also now returns None if the answer is unknown, and fixes a quirk in the old logic that would always set the property to 0 for the first word of the document.

- assert doc[4].sent_start == 1
+ assert doc[4].is_sent_start == True

Token.has_vector

A boolean value indicating whether a word vector is associated with the token.

Example

doc = nlp(u"I like apples")
apples = doc[2]
assert apples.has_vector

Name	Type	Description
RETURNS	bool	Whether the token has a vector data attached.

Token.vector

A real-valued meaning representation.

Example

doc = nlp(u"I like apples")
apples = doc[2]
assert apples.vector.dtype == "float32"
assert apples.vector.shape == (300,)

Name	Type	Description
RETURNS	`numpy.ndarray[ndim=1, dtype='float32']`	A 1D numpy array representing the token's semantics.

Token.vector_norm

The L2 norm of the token's vector representation.

Example

doc = nlp(u"I like apples and pasta")
apples = doc[2]
pasta = doc[4]
apples.vector_norm  # 6.89589786529541
pasta.vector_norm  # 7.759851932525635
assert apples.vector_norm != pasta.vector_norm

Name	Type	Description
RETURNS	float	The L2 norm of the vector representation.

Attributes

Name	Type	Description
`doc`	`Doc`	The parent document.
`sent` 2.0.12	`Span`	The sentence span that this token is a part of.
`text`	unicode	Verbatim text content.
`text_with_ws`	unicode	Text content, with trailing space character if present.
`whitespace_`	unicode	Trailing space character if present.
`orth`	int	ID of the verbatim text content.
`orth_`	unicode	Verbatim text content (identical to `Token.text`). Exists mostly for consistency with the other attributes.
`vocab`	`Vocab`	The vocab object of the parent `Doc`.
`doc`	`Doc`	The parent document.
`head`	`Token`	The syntactic parent, or "governor", of this token.
`left_edge`	`Token`	The leftmost token of this token's syntactic descendants.
`right_edge`	`Token`	The rightmost token of this token's syntactic descendants.
`i`	int	The index of the token within the parent document.
`ent_type`	int	Named entity type.
`ent_type_`	unicode	Named entity type.
`ent_iob`	int	IOB code of named entity tag. `3` means the token begins an entity, `2` means it is outside an entity, `1` means it is inside an entity, and `0` means no entity tag is set.
`ent_iob_`	unicode	IOB code of named entity tag. `3` means the token begins an entity, `2` means it is outside an entity, `1` means it is inside an entity, and `0` means no entity tag is set.
`ent_id`	int	ID of the entity the token is an instance of, if any. Currently not used, but potentially for coreference resolution.
`ent_id_`	unicode	ID of the entity the token is an instance of, if any. Currently not used, but potentially for coreference resolution.
`lemma`	int	Base form of the token, with no inflectional suffixes.
`lemma_`	unicode	Base form of the token, with no inflectional suffixes.
`norm`	int	The token's norm, i.e. a normalized form of the token text. Usually set in the language's tokenizer exceptions or norm exceptions.
`norm_`	unicode	The token's norm, i.e. a normalized form of the token text. Usually set in the language's tokenizer exceptions or norm exceptions.
`lower`	int	Lowercase form of the token.
`lower_`	unicode	Lowercase form of the token text. Equivalent to `Token.text.lower()`.
`shape`	int	Transform of the tokens's string, to show orthographic features. For example, "Xxxx" or "dd".
`shape_`	unicode	Transform of the tokens's string, to show orthographic features. For example, "Xxxx" or "dd".
`prefix`	int	Hash value of a length-N substring from the start of the token. Defaults to `N=1`.
`prefix_`	unicode	A length-N substring from the start of the token. Defaults to `N=1`.
`suffix`	int	Hash value of a length-N substring from the end of the token. Defaults to `N=3`.
`suffix_`	unicode	Length-N substring from the end of the token. Defaults to `N=3`.
`is_alpha`	bool	Does the token consist of alphabetic characters? Equivalent to `token.text.isalpha()`.
`is_ascii`	bool	Does the token consist of ASCII characters? Equivalent to `all(ord(c) < 128 for c in token.text)`.
`is_digit`	bool	Does the token consist of digits? Equivalent to `token.text.isdigit()`.
`is_lower`	bool	Is the token in lowercase? Equivalent to `token.text.islower()`.
`is_upper`	bool	Is the token in uppercase? Equivalent to `token.text.isupper()`.
`is_title`	bool	Is the token in titlecase? Equivalent to `token.text.istitle()`.
`is_punct`	bool	Is the token punctuation?
`is_left_punct`	bool	Is the token a left punctuation mark, e.g. `(`?
`is_right_punct`	bool	Is the token a right punctuation mark, e.g. `)`?
`is_space`	bool	Does the token consist of whitespace characters? Equivalent to `token.text.isspace()`.
`is_bracket`	bool	Is the token a bracket?
`is_quote`	bool	Is the token a quotation mark?
`is_currency` 2.0.8	bool	Is the token a currency symbol?
`like_url`	bool	Does the token resemble a URL?
`like_num`	bool	Does the token represent a number? e.g. "10.9", "10", "ten", etc.
`like_email`	bool	Does the token resemble an email address?
`is_oov`	bool	Is the token out-of-vocabulary?
`is_stop`	bool	Is the token part of a "stop list"?
`pos`	int	Coarse-grained part-of-speech.
`pos_`	unicode	Coarse-grained part-of-speech.
`tag`	int	Fine-grained part-of-speech.
`tag_`	unicode	Fine-grained part-of-speech.
`dep`	int	Syntactic dependency relation.
`dep_`	unicode	Syntactic dependency relation.
`lang`	int	Language of the parent document's vocabulary.
`lang_`	unicode	Language of the parent document's vocabulary.
`prob`	float	Smoothed log probability estimate of token's type.
`idx`	int	The character offset of the token within the parent document.
`sentiment`	float	A scalar value indicating the positivity or negativity of the token.
`lex_id`	int	Sequential ID of the token's lexical type.
`rank`	int	Sequential ID of the token's lexical type, used to index into tables, e.g. for word vectors.
`cluster`	int	Brown cluster ID.
`_`	`Underscore`	User space for adding custom attribute extensions.

32 KiB Raw Blame History

Token.__init__

Example

Token.__len__

Example

Token.set_extension

Example

Token.get_extension

Example

Token.has_extension

Example

Token.remove_extension {#remove_extension tag="classmethod" new=""2.0.11""}

Example

Token.check_flag

Example

Token.similarity

Example

Token.nbor

Example

Token.is_ancestor

Example

Token.ancestors

Example

Token.conjuncts

Example

Token.children

Example

Token.lefts

Example

Token.rights

Example

Token.n_lefts

Example

Token.n_rights

Example

Token.subtree

Example

Token.is_sent_start

Example

Token.has_vector

Example

Token.vector

Example

Token.vector_norm

Example

Attributes

32 KiB

Raw Blame History

Token.init

Token.len