The character offset of the token within the parent document. int
lemma
Base form of the token, with no inflectional suffixes. attr_t (uint64_t)
sense
Space for storing a word sense ID, currently unused. attr_t (uint64_t)
head
Offset of the syntactic parent relative to the token. int
dep
Syntactic dependency relation. attr_t (uint64_t)
l_kids
Number of left children. uint32_t
r_kids
Number of right children. uint32_t
l_edge
Offset of the leftmost token of this token's syntactic descendants. uint32_t
r_edge
Offset of the rightmost token of this token's syntactic descendants. uint32_t
sent_start
Ternary value indicating whether the token is the first word of a sentence. 0 indicates a missing value, -1 indicates False and 1 indicates True. The default value, 0, is interpreted as no sentence break. Sentence boundary detectors will usually set 0 for all tokens except tokens that follow a sentence boundary. int
ent_iob
IOB code of named entity tag. 0 indicates a missing value, 1 indicates I, 2 indicates 0 and 3 indicates B. int
ent_type
Named entity type. attr_t (uint64_t)
ent_id
ID of the entity the token is an instance of, if any. Currently not used, but potentially for coreference resolution. attr_t (uint64_t)
Token.get_struct_attr
Get the value of an attribute from the TokenC struct by attribute ID.
The index of the token in the array or -1 if not found. int
set_children_from_heads
Set attributes that allow lookup of syntactic children on a TokenC* array.
This function must be called after making changes to the TokenC.head
attribute, in order to make the parse tree navigation consistent.
Struct holding information about a lexical type. LexemeC structs are usually
owned by the Vocab, and accessed through a read-only pointer on the TokenC
struct.
Example
lex=doc.c[3].lex
Name
Description
flags
Bit-field for binary lexical flag values. flags_t (uint64_t)
id
Usually used to map lexemes to rows in a matrix, e.g. for word vectors. Does not need to be unique, so currently misnamed. attr_t (uint64_t)
length
Number of unicode characters in the lexeme. attr_t (uint64_t)
orth
ID of the verbatim text content. attr_t (uint64_t)
lower
ID of the lowercase form of the lexeme. attr_t (uint64_t)
norm
ID of the lexeme's norm, i.e. a normalized form of the text. attr_t (uint64_t)
shape
Transform of the lexeme's string, to show orthographic features. attr_t (uint64_t)
prefix
Length-N substring from the start of the lexeme. Defaults to N=1. attr_t (uint64_t)
suffix
Length-N substring from the end of the lexeme. Defaults to N=3. attr_t (uint64_t)
Lexeme.get_struct_attr
Get the value of an attribute from the LexemeC struct by attribute ID.