spaCy/cython-structs.md at 009ba14aafff1769bff408b2069e69245c441d2b

explosion/spaCy

Fork 0

mirror of https://github.com/explosion/spaCy.git synced 2024-09-21 19:39:13 +03:00

Ines Montani 3ae5e02f4f Update docs, types and API consistency

2020-08-17 16:45:24 +02:00

17 KiB

Raw Blame History

title

teaser

Cython Structs

C-language objects that let you group variables together

/api/cython-classes

TokenC

tokenc

LexemeC

lexemec

TokenC

Cython data container for the Token object.

Example

token = &doc.c[3]
token_ptr = &doc.c[3]

Name	Description
`lex`	A pointer to the lexeme for the token. const LexemeC*
`morph`	An ID allowing lookup of morphological attributes. ~~uint64_t~~
`pos`	Coarse-grained part-of-speech tag. ~~univ_pos_t~~
`spacy`	A binary value indicating whether the token has trailing whitespace. ~~bint~~
`tag`	Fine-grained part-of-speech tag. ~~attr_t (uint64_t)~~
`idx`	The character offset of the token within the parent document. ~~int~~
`lemma`	Base form of the token, with no inflectional suffixes. ~~attr_t (uint64_t)~~
`sense`	Space for storing a word sense ID, currently unused. ~~attr_t (uint64_t)~~
`head`	Offset of the syntactic parent relative to the token. ~~int~~
`dep`	Syntactic dependency relation. ~~attr_t (uint64_t)~~
`l_kids`	Number of left children. ~~uint32_t~~
`r_kids`	Number of right children. ~~uint32_t~~
`l_edge`	Offset of the leftmost token of this token's syntactic descendants. ~~uint32_t~~
`r_edge`	Offset of the rightmost token of this token's syntactic descendants. ~~uint32_t~~
`sent_start`	Ternary value indicating whether the token is the first word of a sentence. `0` indicates a missing value, `-1` indicates `False` and `1` indicates `True`. The default value, 0, is interpreted as no sentence break. Sentence boundary detectors will usually set 0 for all tokens except tokens that follow a sentence boundary. ~~int~~
`ent_iob`	IOB code of named entity tag. `0` indicates a missing value, `1` indicates `I`, `2` indicates `0` and `3` indicates `B`. ~~int~~
`ent_type`	Named entity type. ~~attr_t (uint64_t)~~
`ent_id`	ID of the entity the token is an instance of, if any. Currently not used, but potentially for coreference resolution. ~~attr_t (uint64_t)~~

Token.get_struct_attr

Get the value of an attribute from the TokenC struct by attribute ID.

Example

from spacy.attrs cimport IS_ALPHA
from spacy.tokens cimport Token

is_alpha = Token.get_struct_attr(&doc.c[3], IS_ALPHA)

Name	Description
`token`	A pointer to a `TokenC` struct. const TokenC*
`feat_name`	The ID of the attribute to look up. The attributes are enumerated in `spacy.typedefs`. ~~attr_id_t~~
RETURNS	The value of the attribute. ~~attr_t (uint64_t)~~

Token.set_struct_attr

Set the value of an attribute of the TokenC struct by attribute ID.

Example

from spacy.attrs cimport TAG
from spacy.tokens cimport Token

token = &doc.c[3]
Token.set_struct_attr(token, TAG, 0)

Name	Description
`token`	A pointer to a `TokenC` struct. const TokenC*
`feat_name`	The ID of the attribute to look up. The attributes are enumerated in `spacy.typedefs`. ~~attr_id_t~~
`value`	The value to set. ~~attr_t (uint64_t)~~

token_by_start

Find a token in a TokenC* array by the offset of its first character.

Example

from spacy.tokens.doc cimport Doc, token_by_start
from spacy.vocab cimport Vocab

doc = Doc(Vocab(), words=["hello", "world"])
assert token_by_start(doc.c, doc.length, 6) == 1
assert token_by_start(doc.c, doc.length, 4) == -1

Name	Description
`tokens`	A `TokenC` array. const TokenC
`length`	The number of tokens in the array. ~~int~~
`start_char`	The start index to search for. ~~int~~
RETURNS	The index of the token in the array or `-1` if not found. ~~int~~

token_by_end

Find a token in a TokenC* array by the offset of its final character.

Example

from spacy.tokens.doc cimport Doc, token_by_end
from spacy.vocab cimport Vocab

doc = Doc(Vocab(), words=["hello", "world"])
assert token_by_end(doc.c, doc.length, 5) == 0
assert token_by_end(doc.c, doc.length, 1) == -1

Name	Description
`tokens`	A `TokenC` array. const TokenC
`length`	The number of tokens in the array. ~~int~~
`end_char`	The end index to search for. ~~int~~
RETURNS	The index of the token in the array or `-1` if not found. ~~int~~

set_children_from_heads

Set attributes that allow lookup of syntactic children on a TokenC* array. This function must be called after making changes to the TokenC.head attribute, in order to make the parse tree navigation consistent.

Example

from spacy.tokens.doc cimport Doc, set_children_from_heads
from spacy.vocab cimport Vocab

doc = Doc(Vocab(), words=["Baileys", "from", "a", "shoe"])
doc.c[0].head = 0
doc.c[1].head = 0
doc.c[2].head = 3
doc.c[3].head = 1
set_children_from_heads(doc.c, doc.length)
assert doc.c[3].l_kids == 1

Name	Description
`tokens`	A `TokenC` array. const TokenC
`length`	The number of tokens in the array. ~~int~~

LexemeC

Struct holding information about a lexical type. LexemeC structs are usually owned by the Vocab, and accessed through a read-only pointer on the TokenC struct.

Example

lex = doc.c[3].lex

Name	Description
`flags`	Bit-field for binary lexical flag values. ~~flags_t (uint64_t)~~
`id`	Usually used to map lexemes to rows in a matrix, e.g. for word vectors. Does not need to be unique, so currently misnamed. ~~attr_t (uint64_t)~~
`length`	Number of unicode characters in the lexeme. ~~attr_t (uint64_t)~~
`orth`	ID of the verbatim text content. ~~attr_t (uint64_t)~~
`lower`	ID of the lowercase form of the lexeme. ~~attr_t (uint64_t)~~
`norm`	ID of the lexeme's norm, i.e. a normalized form of the text. ~~attr_t (uint64_t)~~
`shape`	Transform of the lexeme's string, to show orthographic features. ~~attr_t (uint64_t)~~
`prefix`	Length-N substring from the start of the lexeme. Defaults to `N=1`. ~~attr_t (uint64_t)~~
`suffix`	Length-N substring from the end of the lexeme. Defaults to `N=3`. ~~attr_t (uint64_t)~~

Lexeme.get_struct_attr

Get the value of an attribute from the LexemeC struct by attribute ID.

Example

from spacy.attrs cimport IS_ALPHA
from spacy.lexeme cimport Lexeme

lexeme = doc.c[3].lex
is_alpha = Lexeme.get_struct_attr(lexeme, IS_ALPHA)

Name	Description
`lex`	A pointer to a `LexemeC` struct. const LexemeC*
`feat_name`	The ID of the attribute to look up. The attributes are enumerated in `spacy.typedefs`. ~~attr_id_t~~
RETURNS	The value of the attribute. ~~attr_t (uint64_t)~~

Lexeme.set_struct_attr

Set the value of an attribute of the LexemeC struct by attribute ID.

Example

from spacy.attrs cimport NORM
from spacy.lexeme cimport Lexeme

lexeme = doc.c[3].lex
Lexeme.set_struct_attr(lexeme, NORM, lexeme.lower)

Name	Description
`lex`	A pointer to a `LexemeC` struct. const LexemeC*
`feat_name`	The ID of the attribute to look up. The attributes are enumerated in `spacy.typedefs`. ~~attr_id_t~~
`value`	The value to set. ~~attr_t (uint64_t)~~

Lexeme.c_check_flag

Check the value of a binary flag attribute.

Example

from spacy.attrs cimport IS_STOP
from spacy.lexeme cimport Lexeme

lexeme = doc.c[3].lex
is_stop = Lexeme.c_check_flag(lexeme, IS_STOP)

Name	Description
`lexeme`	A pointer to a `LexemeC` struct. const LexemeC*
`flag_id`	The ID of the flag to look up. The flag IDs are enumerated in `spacy.typedefs`. ~~attr_id_t~~
RETURNS	The boolean value of the flag. ~~bint~~

Lexeme.c_set_flag

Set the value of a binary flag attribute.

Example

from spacy.attrs cimport IS_STOP
from spacy.lexeme cimport Lexeme

lexeme = doc.c[3].lex
Lexeme.c_set_flag(lexeme, IS_STOP, 0)

Name	Description
`lexeme`	A pointer to a `LexemeC` struct. const LexemeC*
`flag_id`	The ID of the flag to look up. The flag IDs are enumerated in `spacy.typedefs`. ~~attr_id_t~~
`value`	The value to set. ~~bint~~

17 KiB Raw Blame History

TokenC

Example

Token.get_struct_attr

Example

Token.set_struct_attr

Example

token_by_start

Example

token_by_end

Example

set_children_from_heads

Example

LexemeC

Example

Lexeme.get_struct_attr

Example

Lexeme.set_struct_attr

Example

Lexeme.c_check_flag

Example

Lexeme.c_set_flag

Example

17 KiB

Raw Blame History