spaCy/website/docs/api/cython-classes.md
2020-08-17 16:45:24 +02:00

9.3 KiB

title menu
Cython Classes
Doc
doc
Token
token
Span
span
Lexeme
lexeme
Vocab
vocab
StringStore
stringstore

Doc

The Doc object holds an array of TokenC structs.

This section documents the extra C-level attributes and methods that can't be accessed from Python. For the Python documentation, see Doc.

Attributes

Name Description
mem A memory pool. Allocated memory will be freed once the Doc object is garbage collected. cymem.Pool
vocab A reference to the shared Vocab object. Vocab
c A pointer to a TokenC struct. TokenC*
length The number of tokens in the document. int
max_length The underlying size of the Doc.c array. int

Doc.push_back

Append a token to the Doc. The token can be provided as a LexemeC or TokenC pointer, using Cython's fused types.

Example

from spacy.tokens cimport Doc
from spacy.vocab cimport Vocab

doc = Doc(Vocab())
lexeme = doc.vocab.get("hello")
doc.push_back(lexeme, True)
assert doc.text == "hello "
Name Description
lex_or_tok The word to append to the Doc. LexemeOrToken
has_space Whether the word has trailing whitespace. bint

Token

A Cython class providing access and methods for a TokenC struct. Note that the Token object does not own the struct. It only receives a pointer to it.

This section documents the extra C-level attributes and methods that can't be accessed from Python. For the Python documentation, see Token.

Attributes

Name Description
vocab A reference to the shared Vocab object. Vocab
c A pointer to a TokenC struct. TokenC*
i The offset of the token within the document. int
doc The parent document. Doc

Token.cinit

Create a Token object from a TokenC* pointer.

Example

token = Token.cinit(&doc.c[3], doc, 3)
Name Description
vocab A reference to the shared Vocab. Vocab
c A pointer to a TokenC struct. TokenC*
offset The offset of the token within the document. int
doc The parent document. int

Span

A Cython class providing access and methods for a slice of a Doc object.

This section documents the extra C-level attributes and methods that can't be accessed from Python. For the Python documentation, see Span.

Attributes

Name Description
doc The parent document. Doc
start The index of the first token of the span. int
end The index of the first token after the span. int
start_char The index of the first character of the span. int
end_char The index of the last character of the span. int
label A label to attach to the span, e.g. for named entities. attr_t (uint64_t)

Lexeme

A Cython class providing access and methods for an entry in the vocabulary.

This section documents the extra C-level attributes and methods that can't be accessed from Python. For the Python documentation, see Lexeme.

Attributes

Name Description
c A pointer to a LexemeC struct. LexemeC*
vocab A reference to the shared Vocab object. Vocab
orth ID of the verbatim text content. attr_t (uint64_t)

Vocab

A Cython class providing access and methods for a vocabulary and other data shared across a language.

This section documents the extra C-level attributes and methods that can't be accessed from Python. For the Python documentation, see Vocab.

Attributes

Name Description
mem A memory pool. Allocated memory will be freed once the Vocab object is garbage collected. cymem.Pool
strings A StringStore that maps string to hash values and vice versa. StringStore
length The number of entries in the vocabulary. int

Vocab.get

Retrieve a LexemeC* pointer from the vocabulary.

Example

lexeme = vocab.get(vocab.mem, "hello")
Name Description
mem A memory pool. Allocated memory will be freed once the Vocab object is garbage collected. cymem.Pool
string The string of the word to look up. str
RETURNS The lexeme in the vocabulary. const LexemeC*

Vocab.get_by_orth

Retrieve a LexemeC* pointer from the vocabulary.

Example

lexeme = vocab.get_by_orth(doc[0].lex.norm)
Name Description
mem A memory pool. Allocated memory will be freed once the Vocab object is garbage collected. cymem.Pool
orth ID of the verbatim text content. attr_t (uint64_t)
RETURNS The lexeme in the vocabulary. const LexemeC*

StringStore

A lookup table to retrieve strings by 64-bit hashes.

This section documents the extra C-level attributes and methods that can't be accessed from Python. For the Python documentation, see StringStore.

Attributes

Name Description
mem A memory pool. Allocated memory will be freed once the StringStore object is garbage collected. cymem.Pool
keys A list of hash values in the StringStore. vector[hash_t] (vector[uint64_t])