Add API docs for token attribute symbols (#10836)

* Add API docs for token attribute symbols * Remove NBSP's * Fix typo * Rephrase Co-authored-by: svlandeg <svlandeg@github.com>
2025-11-17 16:26:09 +03:00 · 2022-06-23 08:16:38 +02:00 · 2022-06-23 08:16:38 +02:00 · f1197d9175
commit f1197d9175
parent 3335bb9d0c
2 changed files with 79 additions and 0 deletions
--- a/website/docs/api/attributes.md
+++ b/website/docs/api/attributes.md
@ -0,0 +1,78 @@
+---
+title: Attributes
+teaser: Token attributes
+source: spacy/attrs.pyx
+---
+
+[Token](/api/token) attributes are specified using internal IDs in many places
+including:
+
+- [`Matcher` patterns](/api/matcher#patterns),
+- [`Doc.to_array`](/api/doc#to_array) and
+  [`Doc.from_array`](/api/doc#from_array)
+- [`Doc.has_annotation`](/api/doc#has_annotation)
+- [`MultiHashEmbed`](/api/architectures#MultiHashEmbed) Tok2Vec architecture
+  `attrs`
+
+> ```python
+> import spacy
+> from spacy.attrs import DEP
+>
+> nlp = spacy.blank("en")
+> doc = nlp("There are many attributes.")
+>
+> # DEP always has the same internal value
+> assert DEP == 76
+>
+> # "DEP" is automatically converted to DEP
+> assert DEP == nlp.vocab.strings["DEP"]
+> assert doc.has_annotation(DEP) == doc.has_annotation("DEP")
+>
+> # look up IDs in spacy.attrs.IDS
+> from spacy.attrs import IDS
+> assert IDS["DEP"] == DEP
+> ```
+
+All methods automatically convert between the string version of an ID (`"DEP"`)
+and the internal integer symbols (`DEP`). The internal IDs can be imported from
+`spacy.attrs` or retrieved from the [`StringStore`](/api/stringstore). A map
+from string attribute names to internal attribute IDs is stored in
+`spacy.attrs.IDS`.
+
+The corresponding [`Token` object attributes](/api/token#attributes) can be
+accessed using the same names in lowercase, e.g. `token.orth` or `token.length`.
+For attributes that represent string values, the internal integer ID is
+accessed as `Token.attr`, e.g. `token.dep`, while the string value can be
+retrieved by appending `_` as in `token.dep_`.
+
+
+| Attribute    | Description                                                                                                                                                   |
+| ------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `DEP`        | The token's dependency label. ~~str~~                                                                                                                         |
+| `ENT_ID`     | The token's entity ID (`ent_id`). ~~str~~                                                                                                                     |
+| `ENT_IOB`    | The IOB part of the token's entity tag. Uses custom integer vaues rather than the string store: unset is `0`, `I` is `1`, `O` is `2`, and `B` is `3`. ~~str~~ |
+| `ENT_KB_ID`  | The token's entity knowledge base ID. ~~str~~                                                                                                                 |
+| `ENT_TYPE`   | The token's entity label. ~~str~~                                                                                                                             |
+| `IS_ALPHA`   | Token text consists of alphabetic characters. ~~bool~~                                                                                                        |
+| `IS_ASCII`   | Token text consists of ASCII characters. ~~bool~~                                                                                                             |
+| `IS_DIGIT`   | Token text consists of digits. ~~bool~~                                                                                                                       |
+| `IS_LOWER`   | Token text is in lowercase. ~~bool~~                                                                                                                          |
+| `IS_PUNCT`   | Token is punctuation. ~~bool~~                                                                                                                                |
+| `IS_SPACE`   | Token is whitespace. ~~bool~~                                                                                                                                 |
+| `IS_STOP`    | Token is a stop word. ~~bool~~                                                                                                                                |
+| `IS_TITLE`   | Token text is in titlecase. ~~bool~~                                                                                                                          |
+| `IS_UPPER`   | Token text is in uppercase. ~~bool~~                                                                                                                          |
+| `LEMMA`      | The token's lemma. ~~str~~                                                                                                                                    |
+| `LENGTH`     | The length of the token text. ~~int~~                                                                                                                         |
+| `LIKE_EMAIL` | Token text resembles an email address. ~~bool~~                                                                                                               |
+| `LIKE_NUM`   | Token text resembles a number. ~~bool~~                                                                                                                       |
+| `LIKE_URL`   | Token text resembles a URL. ~~bool~~                                                                                                                          |
+| `LOWER`      | The lowercase form of the token text. ~~str~~                                                                                                                 |
+| `MORPH`      | The token's morphological analysis. ~~MorphAnalysis~~                                                                                                         |
+| `NORM`       | The normalized form of the token text. ~~str~~                                                                                                                |
+| `ORTH`       | The exact verbatim text of a token. ~~str~~                                                                                                                   |
+| `POS`        | The token's universal part of speech (UPOS). ~~str~~                                                                                                          |
+| `SENT_START` | Token is start of sentence. ~~bool~~                                                                                                                          |
+| `SHAPE`      | The token's shape. ~~str~~                                                                                                                                    |
+| `SPACY`      | Token has a trailing space. ~~bool~~                                                                                                                          |
+| `TAG`        | The token's fine-grained part of speech. ~~str~~                                                                                                              |
--- a/website/meta/sidebars.json
+++ b/website/meta/sidebars.json
@ -124,6 +124,7 @@
            {
                "label": "Other",
                "items": [
+                    { "text": "Attributes", "url": "/api/attributes" },
                    { "text": "Corpus", "url": "/api/corpus" },
                    { "text": "KnowledgeBase", "url": "/api/kb" },
                    { "text": "Lookups", "url": "/api/lookups" },