C-language objects that let you group variables together
/api/cython-classes
TokenC
tokenc
LexemeC
lexemec
TokenC
Cython data container for the Token object.
Example
token=&doc.c[3]token_ptr=&doc.c[3]
Name
Type
Description
lex
const LexemeC*
A pointer to the lexeme for the token.
morph
uint64_t
An ID allowing lookup of morphological attributes.
pos
univ_pos_t
Coarse-grained part-of-speech tag.
spacy
bint
A binary value indicating whether the token has trailing whitespace.
tag
attr_t
Fine-grained part-of-speech tag.
idx
int
The character offset of the token within the parent document.
lemma
attr_t
Base form of the token, with no inflectional suffixes.
sense
attr_t
Space for storing a word sense ID, currently unused.
head
int
Offset of the syntactic parent relative to the token.
dep
attr_t
Syntactic dependency relation.
l_kids
uint32_t
Number of left children.
r_kids
uint32_t
Number of right children.
l_edge
uint32_t
Offset of the leftmost token of this token's syntactic descendants.
r_edge
uint32_t
Offset of the rightmost token of this token's syntactic descendants.
sent_start
int
Ternary value indicating whether the token is the first word of a sentence. 0 indicates a missing value, -1 indicates False and 1 indicates True. The default value, 0, is interpreted as no sentence break. Sentence boundary detectors will usually set 0 for all tokens except tokens that follow a sentence boundary.
ent_iob
int
IOB code of named entity tag. 0 indicates a missing value, 1 indicates I, 2 indicates 0 and 3 indicates B.
ent_type
attr_t
Named entity type.
ent_id
attr_t
ID of the entity the token is an instance of, if any. Currently not used, but potentially for coreference resolution.
Token.get_struct_attr
Get the value of an attribute from the TokenC struct by attribute ID.
The index of the token in the array or -1 if not found.
set_children_from_heads
Set attributes that allow lookup of syntactic children on a TokenC* array.
This function must be called after making changes to the TokenC.head
attribute, in order to make the parse tree navigation consistent.
Struct holding information about a lexical type. LexemeC structs are usually
owned by the Vocab, and accessed through a read-only pointer on the TokenC
struct.
Example
lex=doc.c[3].lex
Name
Type
Description
flags
flags_t
Bit-field for binary lexical flag values.
id
attr_t
Usually used to map lexemes to rows in a matrix, e.g. for word vectors. Does not need to be unique, so currently misnamed.
length
attr_t
Number of unicode characters in the lexeme.
orth
attr_t
ID of the verbatim text content.
lower
attr_t
ID of the lowercase form of the lexeme.
norm
attr_t
ID of the lexeme's norm, i.e. a normalized form of the text.
shape
attr_t
Transform of the lexeme's string, to show orthographic features.
prefix
attr_t
Length-N substring from the start of the lexeme. Defaults to N=1.
suffix
attr_t
Length-N substring from the end of the lexeme. Defaults to N=3.
cluster
attr_t
Brown cluster ID.
prob
float
Smoothed log probability estimate of the lexeme's word type (context-independent entry in the vocabulary).
sentiment
float
A scalar value indicating positivity or negativity.
Lexeme.get_struct_attr
Get the value of an attribute from the LexemeC struct by attribute ID.