mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-10 09:16:31 +03:00
296446a1c8
<!--- Provide a general summary of your changes in the title. --> ## Description * tidy up and adjust Cython code to code style * improve docstrings and make calling `help()` nicer * add URLs to new docs pages to docstrings wherever possible, mostly to user-facing objects * fix various typos and inconsistencies in docs ### Types of change enhancement, docs ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
476 lines
32 KiB
Markdown
476 lines
32 KiB
Markdown
---
|
|
title: Token
|
|
teaser: An individual token — i.e. a word, punctuation symbol, whitespace, etc.
|
|
tag: class
|
|
source: spacy/tokens/token.pyx
|
|
---
|
|
|
|
## Token.\_\_init\_\_ {#init tag="method"}
|
|
|
|
Construct a `Token` object.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> doc = nlp(u"Give it back! He pleaded.")
|
|
> token = doc[0]
|
|
> assert token.text == u"Give"
|
|
> ```
|
|
|
|
| Name | Type | Description |
|
|
| ----------- | ------- | ------------------------------------------- |
|
|
| `vocab` | `Vocab` | A storage container for lexical types. |
|
|
| `doc` | `Doc` | The parent document. |
|
|
| `offset` | int | The index of the token within the document. |
|
|
| **RETURNS** | `Token` | The newly constructed object. |
|
|
|
|
## Token.\_\_len\_\_ {#len tag="method"}
|
|
|
|
The number of unicode characters in the token, i.e. `token.text`.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> doc = nlp(u"Give it back! He pleaded.")
|
|
> token = doc[0]
|
|
> assert len(token) == 4
|
|
> ```
|
|
|
|
| Name | Type | Description |
|
|
| ----------- | ---- | ---------------------------------------------- |
|
|
| **RETURNS** | int | The number of unicode characters in the token. |
|
|
|
|
## Token.set_extension {#set_extension tag="classmethod" new="2"}
|
|
|
|
Define a custom attribute on the `Token` which becomes available via `Token._`.
|
|
For details, see the documentation on
|
|
[custom attributes](/usage/processing-pipelines#custom-components-attributes).
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> from spacy.tokens import Token
|
|
> fruit_getter = lambda token: token.text in (u"apple", u"pear", u"banana")
|
|
> Token.set_extension("is_fruit", getter=fruit_getter)
|
|
> doc = nlp(u"I have an apple")
|
|
> assert doc[3]._.is_fruit
|
|
> ```
|
|
|
|
| Name | Type | Description |
|
|
| --------- | -------- | --------------------------------------------------------------------------------------------------------------------------------------- |
|
|
| `name` | unicode | Name of the attribute to set by the extension. For example, `'my_attr'` will be available as `token._.my_attr`. |
|
|
| `default` | - | Optional default value of the attribute if no getter or method is defined. |
|
|
| `method` | callable | Set a custom method on the object, for example `token._.compare(other_token)`. |
|
|
| `getter` | callable | Getter function that takes the object and returns an attribute value. Is called when the user accesses the `._` attribute. |
|
|
| `setter` | callable | Setter function that takes the `Token` and a value, and modifies the object. Is called when the user writes to the `Token._` attribute. |
|
|
|
|
## Token.get_extension {#get_extension tag="classmethod" new="2"}
|
|
|
|
Look up a previously registered extension by name. Returns a 4-tuple
|
|
`(default, method, getter, setter)` if the extension is registered. Raises a
|
|
`KeyError` otherwise.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> from spacy.tokens import Token
|
|
> Token.set_extension("is_fruit", default=False)
|
|
> extension = Token.get_extension("is_fruit")
|
|
> assert extension == (False, None, None, None)
|
|
> ```
|
|
|
|
| Name | Type | Description |
|
|
| ----------- | ------- | ------------------------------------------------------------- |
|
|
| `name` | unicode | Name of the extension. |
|
|
| **RETURNS** | tuple | A `(default, method, getter, setter)` tuple of the extension. |
|
|
|
|
## Token.has_extension {#has_extension tag="classmethod" new="2"}
|
|
|
|
Check whether an extension has been registered on the `Token` class.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> from spacy.tokens import Token
|
|
> Token.set_extension("is_fruit", default=False)
|
|
> assert Token.has_extension("is_fruit")
|
|
> ```
|
|
|
|
| Name | Type | Description |
|
|
| ----------- | ------- | ------------------------------------------ |
|
|
| `name` | unicode | Name of the extension to check. |
|
|
| **RETURNS** | bool | Whether the extension has been registered. |
|
|
|
|
## Token.remove_extension {#remove_extension tag="classmethod" new=""2.0.11""}
|
|
|
|
Remove a previously registered extension.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> from spacy.tokens import Token
|
|
> Token.set_extension("is_fruit", default=False)
|
|
> removed = Token.remove_extension("is_fruit")
|
|
> assert not Token.has_extension("is_fruit")
|
|
> ```
|
|
|
|
| Name | Type | Description |
|
|
| ----------- | ------- | --------------------------------------------------------------------- |
|
|
| `name` | unicode | Name of the extension. |
|
|
| **RETURNS** | tuple | A `(default, method, getter, setter)` tuple of the removed extension. |
|
|
|
|
## Token.check_flag {#check_flag tag="method"}
|
|
|
|
Check the value of a boolean flag.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> from spacy.attrs import IS_TITLE
|
|
> doc = nlp(u"Give it back! He pleaded.")
|
|
> token = doc[0]
|
|
> assert token.check_flag(IS_TITLE) == True
|
|
> ```
|
|
|
|
| Name | Type | Description |
|
|
| ----------- | ---- | -------------------------------------- |
|
|
| `flag_id` | int | The attribute ID of the flag to check. |
|
|
| **RETURNS** | bool | Whether the flag is set. |
|
|
|
|
## Token.similarity {#similarity tag="method" model="vectors"}
|
|
|
|
Compute a semantic similarity estimate. Defaults to cosine over vectors.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> apples, _, oranges = nlp(u"apples and oranges")
|
|
> apples_oranges = apples.similarity(oranges)
|
|
> oranges_apples = oranges.similarity(apples)
|
|
> assert apples_oranges == oranges_apples
|
|
> ```
|
|
|
|
| Name | Type | Description |
|
|
| ----------- | ----- | -------------------------------------------------------------------------------------------- |
|
|
| other | - | The object to compare with. By default, accepts `Doc`, `Span`, `Token` and `Lexeme` objects. |
|
|
| **RETURNS** | float | A scalar similarity score. Higher is more similar. |
|
|
|
|
## Token.nbor {#nbor tag="method"}
|
|
|
|
Get a neighboring token.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> doc = nlp(u"Give it back! He pleaded.")
|
|
> give_nbor = doc[0].nbor()
|
|
> assert give_nbor.text == u"it"
|
|
> ```
|
|
|
|
| Name | Type | Description |
|
|
| ----------- | ------- | ----------------------------------------------------------- |
|
|
| `i` | int | The relative position of the token to get. Defaults to `1`. |
|
|
| **RETURNS** | `Token` | The token at position `self.doc[self.i+i]`. |
|
|
|
|
## Token.is_ancestor {#is_ancestor tag="method" model="parser"}
|
|
|
|
Check whether this token is a parent, grandparent, etc. of another in the
|
|
dependency tree.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> doc = nlp(u"Give it back! He pleaded.")
|
|
> give = doc[0]
|
|
> it = doc[1]
|
|
> assert give.is_ancestor(it)
|
|
> ```
|
|
|
|
| Name | Type | Description |
|
|
| ----------- | ------- | ----------------------------------------------------- |
|
|
| descendant | `Token` | Another token. |
|
|
| **RETURNS** | bool | Whether this token is the ancestor of the descendant. |
|
|
|
|
## Token.ancestors {#ancestors tag="property" model="parser"}
|
|
|
|
The rightmost token of this token's syntactic descendants.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> doc = nlp(u"Give it back! He pleaded.")
|
|
> it_ancestors = doc[1].ancestors
|
|
> assert [t.text for t in it_ancestors] == [u"Give"]
|
|
> he_ancestors = doc[4].ancestors
|
|
> assert [t.text for t in he_ancestors] == [u"pleaded"]
|
|
> ```
|
|
|
|
| Name | Type | Description |
|
|
| ---------- | ------- | --------------------------------------------------------------------- |
|
|
| **YIELDS** | `Token` | A sequence of ancestor tokens such that `ancestor.is_ancestor(self)`. |
|
|
|
|
## Token.conjuncts {#conjuncts tag="property" model="parser"}
|
|
|
|
A sequence of coordinated tokens, including the token itself.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> doc = nlp(u"I like apples and oranges")
|
|
> apples_conjuncts = doc[2].conjuncts
|
|
> assert [t.text for t in apples_conjuncts] == [u"oranges"]
|
|
> ```
|
|
|
|
| Name | Type | Description |
|
|
| ---------- | ------- | -------------------- |
|
|
| **YIELDS** | `Token` | A coordinated token. |
|
|
|
|
## Token.children {#children tag="property" model="parser"}
|
|
|
|
A sequence of the token's immediate syntactic children.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> doc = nlp(u"Give it back! He pleaded.")
|
|
> give_children = doc[0].children
|
|
> assert [t.text for t in give_children] == [u"it", u"back", u"!"]
|
|
> ```
|
|
|
|
| Name | Type | Description |
|
|
| ---------- | ------- | ------------------------------------------- |
|
|
| **YIELDS** | `Token` | A child token such that `child.head==self`. |
|
|
|
|
## Token.lefts {#lefts tag="property" model="parser"}
|
|
|
|
The leftward immediate children of the word, in the syntactic dependency parse.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> doc = nlp(u"I like New York in Autumn.")
|
|
> lefts = [t.text for t in doc[3].lefts]
|
|
> assert lefts == [u'New']
|
|
> ```
|
|
|
|
| Name | Type | Description |
|
|
| ---------- | ------- | -------------------------- |
|
|
| **YIELDS** | `Token` | A left-child of the token. |
|
|
|
|
## Token.rights {#rights tag="property" model="parser"}
|
|
|
|
The rightward immediate children of the word, in the syntactic dependency parse.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> doc = nlp(u"I like New York in Autumn.")
|
|
> rights = [t.text for t in doc[3].rights]
|
|
> assert rights == [u"in"]
|
|
> ```
|
|
|
|
| Name | Type | Description |
|
|
| ---------- | ------- | --------------------------- |
|
|
| **YIELDS** | `Token` | A right-child of the token. |
|
|
|
|
## Token.n_lefts {#n_lefts tag="property" model="parser"}
|
|
|
|
The number of leftward immediate children of the word, in the syntactic
|
|
dependency parse.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> doc = nlp(u"I like New York in Autumn.")
|
|
> assert doc[3].n_lefts == 1
|
|
> ```
|
|
|
|
| Name | Type | Description |
|
|
| ----------- | ---- | -------------------------------- |
|
|
| **RETURNS** | int | The number of left-child tokens. |
|
|
|
|
## Token.n_rights {#n_rights tag="property" model="parser"}
|
|
|
|
The number of rightward immediate children of the word, in the syntactic
|
|
dependency parse.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> doc = nlp(u"I like New York in Autumn.")
|
|
> assert doc[3].n_rights == 1
|
|
> ```
|
|
|
|
| Name | Type | Description |
|
|
| ----------- | ---- | --------------------------------- |
|
|
| **RETURNS** | int | The number of right-child tokens. |
|
|
|
|
## Token.subtree {#subtree tag="property" model="parser"}
|
|
|
|
A sequence containing the token and all the token's syntactic descendants.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> doc = nlp(u"Give it back! He pleaded.")
|
|
> give_subtree = doc[0].subtree
|
|
> assert [t.text for t in give_subtree] == [u"Give", u"it", u"back", u"!"]
|
|
> ```
|
|
|
|
| Name | Type | Description |
|
|
| ---------- | ------- | -------------------------------------------------------------------------- |
|
|
| **YIELDS** | `Token` | A descendant token such that `self.is_ancestor(token)` or `token == self`. |
|
|
|
|
## Token.is_sent_start {#is_sent_start tag="property" new="2"}
|
|
|
|
A boolean value indicating whether the token starts a sentence. `None` if
|
|
unknown. Defaults to `True` for the first token in the `Doc`.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> doc = nlp(u"Give it back! He pleaded.")
|
|
> assert doc[4].is_sent_start
|
|
> assert not doc[5].is_sent_start
|
|
> ```
|
|
|
|
| Name | Type | Description |
|
|
| ----------- | ---- | ------------------------------------ |
|
|
| **RETURNS** | bool | Whether the token starts a sentence. |
|
|
|
|
<Infobox title="Changed in v2.0" variant="warning">
|
|
|
|
As of spaCy v2.0, the `Token.sent_start` property is deprecated and has been
|
|
replaced with `Token.is_sent_start`, which returns a boolean value instead of a
|
|
misleading `0` for `False` and `1` for `True`. It also now returns `None` if the
|
|
answer is unknown, and fixes a quirk in the old logic that would always set the
|
|
property to `0` for the first word of the document.
|
|
|
|
```diff
|
|
- assert doc[4].sent_start == 1
|
|
+ assert doc[4].is_sent_start == True
|
|
```
|
|
|
|
</Infobox>
|
|
|
|
## Token.has_vector {#has_vector tag="property" model="vectors"}
|
|
|
|
A boolean value indicating whether a word vector is associated with the token.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> doc = nlp(u"I like apples")
|
|
> apples = doc[2]
|
|
> assert apples.has_vector
|
|
> ```
|
|
|
|
| Name | Type | Description |
|
|
| ----------- | ---- | --------------------------------------------- |
|
|
| **RETURNS** | bool | Whether the token has a vector data attached. |
|
|
|
|
## Token.vector {#vector tag="property" model="vectors"}
|
|
|
|
A real-valued meaning representation.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> doc = nlp(u"I like apples")
|
|
> apples = doc[2]
|
|
> assert apples.vector.dtype == "float32"
|
|
> assert apples.vector.shape == (300,)
|
|
> ```
|
|
|
|
| Name | Type | Description |
|
|
| ----------- | ---------------------------------------- | ---------------------------------------------------- |
|
|
| **RETURNS** | `numpy.ndarray[ndim=1, dtype='float32']` | A 1D numpy array representing the token's semantics. |
|
|
|
|
## Token.vector_norm {#vector_norm tag="property" model="vectors"}
|
|
|
|
The L2 norm of the token's vector representation.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> doc = nlp(u"I like apples and pasta")
|
|
> apples = doc[2]
|
|
> pasta = doc[4]
|
|
> apples.vector_norm # 6.89589786529541
|
|
> pasta.vector_norm # 7.759851932525635
|
|
> assert apples.vector_norm != pasta.vector_norm
|
|
> ```
|
|
|
|
| Name | Type | Description |
|
|
| ----------- | ----- | ----------------------------------------- |
|
|
| **RETURNS** | float | The L2 norm of the vector representation. |
|
|
|
|
## Attributes {#attributes}
|
|
|
|
| Name | Type | Description |
|
|
| -------------------------------------------- | ------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
|
| `doc` | `Doc` | The parent document. |
|
|
| `sent` <Tag variant="new">2.0.12</Tag> | `Span` | The sentence span that this token is a part of. |
|
|
| `text` | unicode | Verbatim text content. |
|
|
| `text_with_ws` | unicode | Text content, with trailing space character if present. |
|
|
| `whitespace_` | unicode | Trailing space character if present. |
|
|
| `orth` | int | ID of the verbatim text content. |
|
|
| `orth_` | unicode | Verbatim text content (identical to `Token.text`). Exists mostly for consistency with the other attributes. |
|
|
| `vocab` | `Vocab` | The vocab object of the parent `Doc`. |
|
|
| `doc` | `Doc` | The parent document. |
|
|
| `head` | `Token` | The syntactic parent, or "governor", of this token. |
|
|
| `left_edge` | `Token` | The leftmost token of this token's syntactic descendants. |
|
|
| `right_edge` | `Token` | The rightmost token of this token's syntactic descendants. |
|
|
| `i` | int | The index of the token within the parent document. |
|
|
| `ent_type` | int | Named entity type. |
|
|
| `ent_type_` | unicode | Named entity type. |
|
|
| `ent_iob` | int | IOB code of named entity tag. `3` means the token begins an entity, `2` means it is outside an entity, `1` means it is inside an entity, and `0` means no entity tag is set. | |
|
|
| `ent_iob_` | unicode | IOB code of named entity tag. `3` means the token begins an entity, `2` means it is outside an entity, `1` means it is inside an entity, and `0` means no entity tag is set. |
|
|
| `ent_id` | int | ID of the entity the token is an instance of, if any. Currently not used, but potentially for coreference resolution. |
|
|
| `ent_id_` | unicode | ID of the entity the token is an instance of, if any. Currently not used, but potentially for coreference resolution. |
|
|
| `lemma` | int | Base form of the token, with no inflectional suffixes. |
|
|
| `lemma_` | unicode | Base form of the token, with no inflectional suffixes. |
|
|
| `norm` | int | The token's norm, i.e. a normalized form of the token text. Usually set in the language's [tokenizer exceptions](/usage/adding-languages#tokenizer-exceptions) or [norm exceptions](/usage/adding-languages#norm-exceptions). |
|
|
| `norm_` | unicode | The token's norm, i.e. a normalized form of the token text. Usually set in the language's [tokenizer exceptions](/usage/adding-languages#tokenizer-exceptions) or [norm exceptions](/usage/adding-languages#norm-exceptions). |
|
|
| `lower` | int | Lowercase form of the token. |
|
|
| `lower_` | unicode | Lowercase form of the token text. Equivalent to `Token.text.lower()`. |
|
|
| `shape` | int | Transform of the tokens's string, to show orthographic features. For example, "Xxxx" or "dd". |
|
|
| `shape_` | unicode | Transform of the tokens's string, to show orthographic features. For example, "Xxxx" or "dd". |
|
|
| `prefix` | int | Hash value of a length-N substring from the start of the token. Defaults to `N=1`. |
|
|
| `prefix_` | unicode | A length-N substring from the start of the token. Defaults to `N=1`. |
|
|
| `suffix` | int | Hash value of a length-N substring from the end of the token. Defaults to `N=3`. |
|
|
| `suffix_` | unicode | Length-N substring from the end of the token. Defaults to `N=3`. |
|
|
| `is_alpha` | bool | Does the token consist of alphabetic characters? Equivalent to `token.text.isalpha()`. |
|
|
| `is_ascii` | bool | Does the token consist of ASCII characters? Equivalent to `all(ord(c) < 128 for c in token.text)`. |
|
|
| `is_digit` | bool | Does the token consist of digits? Equivalent to `token.text.isdigit()`. |
|
|
| `is_lower` | bool | Is the token in lowercase? Equivalent to `token.text.islower()`. |
|
|
| `is_upper` | bool | Is the token in uppercase? Equivalent to `token.text.isupper()`. |
|
|
| `is_title` | bool | Is the token in titlecase? Equivalent to `token.text.istitle()`. |
|
|
| `is_punct` | bool | Is the token punctuation? |
|
|
| `is_left_punct` | bool | Is the token a left punctuation mark, e.g. `(`? |
|
|
| `is_right_punct` | bool | Is the token a right punctuation mark, e.g. `)`? |
|
|
| `is_space` | bool | Does the token consist of whitespace characters? Equivalent to `token.text.isspace()`. |
|
|
| `is_bracket` | bool | Is the token a bracket? |
|
|
| `is_quote` | bool | Is the token a quotation mark? |
|
|
| `is_currency` <Tag variant="new">2.0.8</Tag> | bool | Is the token a currency symbol? |
|
|
| `like_url` | bool | Does the token resemble a URL? |
|
|
| `like_num` | bool | Does the token represent a number? e.g. "10.9", "10", "ten", etc. |
|
|
| `like_email` | bool | Does the token resemble an email address? |
|
|
| `is_oov` | bool | Is the token out-of-vocabulary? |
|
|
| `is_stop` | bool | Is the token part of a "stop list"? |
|
|
| `pos` | int | Coarse-grained part-of-speech. |
|
|
| `pos_` | unicode | Coarse-grained part-of-speech. |
|
|
| `tag` | int | Fine-grained part-of-speech. |
|
|
| `tag_` | unicode | Fine-grained part-of-speech. |
|
|
| `dep` | int | Syntactic dependency relation. |
|
|
| `dep_` | unicode | Syntactic dependency relation. |
|
|
| `lang` | int | Language of the parent document's vocabulary. |
|
|
| `lang_` | unicode | Language of the parent document's vocabulary. |
|
|
| `prob` | float | Smoothed log probability estimate of token's type. |
|
|
| `idx` | int | The character offset of the token within the parent document. |
|
|
| `sentiment` | float | A scalar value indicating the positivity or negativity of the token. |
|
|
| `lex_id` | int | Sequential ID of the token's lexical type. |
|
|
| `rank` | int | Sequential ID of the token's lexical type, used to index into tables, e.g. for word vectors. |
|
|
| `cluster` | int | Brown cluster ID. |
|
|
| `_` | `Underscore` | User space for adding custom [attribute extensions](/usage/processing-pipelines#custom-components-attributes). |
|