mirror of
				https://github.com/explosion/spaCy.git
				synced 2025-11-04 09:57:26 +03:00 
			
		
		
		
	* fix typos * prettier formatting --------- Co-authored-by: svlandeg <svlandeg@github.com>
		
			
				
	
	
		
			78 lines
		
	
	
		
			6.7 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
			
		
		
	
	
			78 lines
		
	
	
		
			6.7 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
---
 | 
						|
title: Attributes
 | 
						|
teaser: Token attributes
 | 
						|
source: spacy/attrs.pyx
 | 
						|
---
 | 
						|
 | 
						|
[Token](/api/token) attributes are specified using internal IDs in many places
 | 
						|
including:
 | 
						|
 | 
						|
- [`Matcher` patterns](/api/matcher#patterns),
 | 
						|
- [`Doc.to_array`](/api/doc#to_array) and
 | 
						|
  [`Doc.from_array`](/api/doc#from_array)
 | 
						|
- [`Doc.has_annotation`](/api/doc#has_annotation)
 | 
						|
- [`MultiHashEmbed`](/api/architectures#MultiHashEmbed) Tok2Vec architecture
 | 
						|
  `attrs`
 | 
						|
 | 
						|
> ```python
 | 
						|
> import spacy
 | 
						|
> from spacy.attrs import DEP
 | 
						|
>
 | 
						|
> nlp = spacy.blank("en")
 | 
						|
> doc = nlp("There are many attributes.")
 | 
						|
>
 | 
						|
> # DEP always has the same internal value
 | 
						|
> assert DEP == 76
 | 
						|
>
 | 
						|
> # "DEP" is automatically converted to DEP
 | 
						|
> assert DEP == nlp.vocab.strings["DEP"]
 | 
						|
> assert doc.has_annotation(DEP) == doc.has_annotation("DEP")
 | 
						|
>
 | 
						|
> # look up IDs in spacy.attrs.IDS
 | 
						|
> from spacy.attrs import IDS
 | 
						|
> assert IDS["DEP"] == DEP
 | 
						|
> ```
 | 
						|
 | 
						|
All methods automatically convert between the string version of an ID (`"DEP"`)
 | 
						|
and the internal integer symbols (`DEP`). The internal IDs can be imported from
 | 
						|
`spacy.attrs` or retrieved from the [`StringStore`](/api/stringstore). A map
 | 
						|
from string attribute names to internal attribute IDs is stored in
 | 
						|
`spacy.attrs.IDS`.
 | 
						|
 | 
						|
The corresponding [`Token` object attributes](/api/token#attributes) can be
 | 
						|
accessed using the same names in lowercase, e.g. `token.orth` or `token.length`.
 | 
						|
For attributes that represent string values, the internal integer ID is accessed
 | 
						|
as `Token.attr`, e.g. `token.dep`, while the string value can be retrieved by
 | 
						|
appending `_` as in `token.dep_`.
 | 
						|
 | 
						|
| Attribute    | Description                                                                                                                                                    |
 | 
						|
| ------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------- |
 | 
						|
| `DEP`        | The token's dependency label. ~~str~~                                                                                                                          |
 | 
						|
| `ENT_ID`     | The token's entity ID (`ent_id`). ~~str~~                                                                                                                      |
 | 
						|
| `ENT_IOB`    | The IOB part of the token's entity tag. Uses custom integer values rather than the string store: unset is `0`, `I` is `1`, `O` is `2`, and `B` is `3`. ~~str~~ |
 | 
						|
| `ENT_KB_ID`  | The token's entity knowledge base ID. ~~str~~                                                                                                                  |
 | 
						|
| `ENT_TYPE`   | The token's entity label. ~~str~~                                                                                                                              |
 | 
						|
| `IS_ALPHA`   | Token text consists of alphabetic characters. ~~bool~~                                                                                                         |
 | 
						|
| `IS_ASCII`   | Token text consists of ASCII characters. ~~bool~~                                                                                                              |
 | 
						|
| `IS_DIGIT`   | Token text consists of digits. ~~bool~~                                                                                                                        |
 | 
						|
| `IS_LOWER`   | Token text is in lowercase. ~~bool~~                                                                                                                           |
 | 
						|
| `IS_PUNCT`   | Token is punctuation. ~~bool~~                                                                                                                                 |
 | 
						|
| `IS_SPACE`   | Token is whitespace. ~~bool~~                                                                                                                                  |
 | 
						|
| `IS_STOP`    | Token is a stop word. ~~bool~~                                                                                                                                 |
 | 
						|
| `IS_TITLE`   | Token text is in titlecase. ~~bool~~                                                                                                                           |
 | 
						|
| `IS_UPPER`   | Token text is in uppercase. ~~bool~~                                                                                                                           |
 | 
						|
| `LEMMA`      | The token's lemma. ~~str~~                                                                                                                                     |
 | 
						|
| `LENGTH`     | The length of the token text. ~~int~~                                                                                                                          |
 | 
						|
| `LIKE_EMAIL` | Token text resembles an email address. ~~bool~~                                                                                                                |
 | 
						|
| `LIKE_NUM`   | Token text resembles a number. ~~bool~~                                                                                                                        |
 | 
						|
| `LIKE_URL`   | Token text resembles a URL. ~~bool~~                                                                                                                           |
 | 
						|
| `LOWER`      | The lowercase form of the token text. ~~str~~                                                                                                                  |
 | 
						|
| `MORPH`      | The token's morphological analysis. ~~MorphAnalysis~~                                                                                                          |
 | 
						|
| `NORM`       | The normalized form of the token text. ~~str~~                                                                                                                 |
 | 
						|
| `ORTH`       | The exact verbatim text of a token. ~~str~~                                                                                                                    |
 | 
						|
| `POS`        | The token's universal part of speech (UPOS). ~~str~~                                                                                                           |
 | 
						|
| `SENT_START` | Token is start of sentence. ~~bool~~                                                                                                                           |
 | 
						|
| `SHAPE`      | The token's shape. ~~str~~                                                                                                                                     |
 | 
						|
| `SPACY`      | Token has a trailing space. ~~bool~~                                                                                                                           |
 | 
						|
| `TAG`        | The token's fine-grained part of speech. ~~str~~                                                                                                               |
 |