mirror of
https://github.com/explosion/spaCy.git
synced 2025-10-24 12:41:23 +03:00
* Add API docs for token attribute symbols * Remove NBSP's * Fix typo * Rephrase Co-authored-by: svlandeg <svlandeg@github.com>
6.7 KiB
6.7 KiB
| title | teaser | source |
|---|---|---|
| Attributes | Token attributes | spacy/attrs.pyx |
Token attributes are specified using internal IDs in many places including:
Matcherpatterns,Doc.to_arrayandDoc.from_arrayDoc.has_annotationMultiHashEmbedTok2Vec architectureattrs
import spacy from spacy.attrs import DEP nlp = spacy.blank("en") doc = nlp("There are many attributes.") # DEP always has the same internal value assert DEP == 76 # "DEP" is automatically converted to DEP assert DEP == nlp.vocab.strings["DEP"] assert doc.has_annotation(DEP) == doc.has_annotation("DEP") # look up IDs in spacy.attrs.IDS from spacy.attrs import IDS assert IDS["DEP"] == DEP
All methods automatically convert between the string version of an ID ("DEP")
and the internal integer symbols (DEP). The internal IDs can be imported from
spacy.attrs or retrieved from the StringStore. A map
from string attribute names to internal attribute IDs is stored in
spacy.attrs.IDS.
The corresponding Token object attributes can be
accessed using the same names in lowercase, e.g. token.orth or token.length.
For attributes that represent string values, the internal integer ID is
accessed as Token.attr, e.g. token.dep, while the string value can be
retrieved by appending _ as in token.dep_.
| Attribute | Description |
|---|---|
DEP |
The token's dependency label. |
ENT_ID |
The token's entity ID (ent_id). |
ENT_IOB |
The IOB part of the token's entity tag. Uses custom integer vaues rather than the string store: unset is 0, I is 1, O is 2, and B is 3. |
ENT_KB_ID |
The token's entity knowledge base ID. |
ENT_TYPE |
The token's entity label. |
IS_ALPHA |
Token text consists of alphabetic characters. |
IS_ASCII |
Token text consists of ASCII characters. |
IS_DIGIT |
Token text consists of digits. |
IS_LOWER |
Token text is in lowercase. |
IS_PUNCT |
Token is punctuation. |
IS_SPACE |
Token is whitespace. |
IS_STOP |
Token is a stop word. |
IS_TITLE |
Token text is in titlecase. |
IS_UPPER |
Token text is in uppercase. |
LEMMA |
The token's lemma. |
LENGTH |
The length of the token text. |
LIKE_EMAIL |
Token text resembles an email address. |
LIKE_NUM |
Token text resembles a number. |
LIKE_URL |
Token text resembles a URL. |
LOWER |
The lowercase form of the token text. |
MORPH |
The token's morphological analysis. |
NORM |
The normalized form of the token text. |
ORTH |
The exact verbatim text of a token. |
POS |
The token's universal part of speech (UPOS). |
SENT_START |
Token is start of sentence. |
SHAPE |
The token's shape. |
SPACY |
Token has a trailing space. |
TAG |
The token's fine-grained part of speech. |