mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-04 06:16:33 +03:00
f1197d9175
* Add API docs for token attribute symbols * Remove NBSP's * Fix typo * Rephrase Co-authored-by: svlandeg <svlandeg@github.com>
6.7 KiB
6.7 KiB
title | teaser | source |
---|---|---|
Attributes | Token attributes | spacy/attrs.pyx |
Token attributes are specified using internal IDs in many places including:
Matcher
patterns,Doc.to_array
andDoc.from_array
Doc.has_annotation
MultiHashEmbed
Tok2Vec architectureattrs
import spacy from spacy.attrs import DEP nlp = spacy.blank("en") doc = nlp("There are many attributes.") # DEP always has the same internal value assert DEP == 76 # "DEP" is automatically converted to DEP assert DEP == nlp.vocab.strings["DEP"] assert doc.has_annotation(DEP) == doc.has_annotation("DEP") # look up IDs in spacy.attrs.IDS from spacy.attrs import IDS assert IDS["DEP"] == DEP
All methods automatically convert between the string version of an ID ("DEP"
)
and the internal integer symbols (DEP
). The internal IDs can be imported from
spacy.attrs
or retrieved from the StringStore
. A map
from string attribute names to internal attribute IDs is stored in
spacy.attrs.IDS
.
The corresponding Token
object attributes can be
accessed using the same names in lowercase, e.g. token.orth
or token.length
.
For attributes that represent string values, the internal integer ID is
accessed as Token.attr
, e.g. token.dep
, while the string value can be
retrieved by appending _
as in token.dep_
.
Attribute | Description |
---|---|
DEP |
The token's dependency label. |
ENT_ID |
The token's entity ID (ent_id ). |
ENT_IOB |
The IOB part of the token's entity tag. Uses custom integer vaues rather than the string store: unset is 0 , I is 1 , O is 2 , and B is 3 . |
ENT_KB_ID |
The token's entity knowledge base ID. |
ENT_TYPE |
The token's entity label. |
IS_ALPHA |
Token text consists of alphabetic characters. |
IS_ASCII |
Token text consists of ASCII characters. |
IS_DIGIT |
Token text consists of digits. |
IS_LOWER |
Token text is in lowercase. |
IS_PUNCT |
Token is punctuation. |
IS_SPACE |
Token is whitespace. |
IS_STOP |
Token is a stop word. |
IS_TITLE |
Token text is in titlecase. |
IS_UPPER |
Token text is in uppercase. |
LEMMA |
The token's lemma. |
LENGTH |
The length of the token text. |
LIKE_EMAIL |
Token text resembles an email address. |
LIKE_NUM |
Token text resembles a number. |
LIKE_URL |
Token text resembles a URL. |
LOWER |
The lowercase form of the token text. |
MORPH |
The token's morphological analysis. |
NORM |
The normalized form of the token text. |
ORTH |
The exact verbatim text of a token. |
POS |
The token's universal part of speech (UPOS). |
SENT_START |
Token is start of sentence. |
SHAPE |
The token's shape. |
SPACY |
Token has a trailing space. |
TAG |
The token's fine-grained part of speech. |