spaCy/website/docs/api/cython-structs.md
Ines Montani e597110d31
💫 Update website (#3285)
<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-02-17 19:31:19 +01:00

20 KiB

title teaser next menu
Cython Structs C-language objects that let you group variables together /api/cython-classes
TokenC
tokenc
LexemeC
lexemec

TokenC

Cython data container for the Token object.

Example

token = &doc.c[3]
token_ptr = &doc.c[3]
Name Type Description
lex const LexemeC* A pointer to the lexeme for the token.
morph uint64_t An ID allowing lookup of morphological attributes.
pos univ_pos_t Coarse-grained part-of-speech tag.
spacy bint A binary value indicating whether the token has trailing whitespace.
tag attr_t Fine-grained part-of-speech tag.
idx int The character offset of the token within the parent document.
lemma attr_t Base form of the token, with no inflectional suffixes.
sense attr_t Space for storing a word sense ID, currently unused.
head int Offset of the syntactic parent relative to the token.
dep attr_t Syntactic dependency relation.
l_kids uint32_t Number of left children.
r_kids uint32_t Number of right children.
l_edge uint32_t Offset of the leftmost token of this token's syntactic descendants.
r_edge uint32_t Offset of the rightmost token of this token's syntactic descendants.
sent_start int Ternary value indicating whether the token is the first word of a sentence. 0 indicates a missing value, -1 indicates False and 1 indicates True. The default value, 0, is interpreted as no sentence break. Sentence boundary detectors will usually set 0 for all tokens except tokens that follow a sentence boundary.
ent_iob int IOB code of named entity tag. 0 indicates a missing value, 1 indicates I, 2 indicates 0 and 3 indicates B.
ent_type attr_t Named entity type.
ent_id attr_t ID of the entity the token is an instance of, if any. Currently not used, but potentially for coreference resolution.

Token.get_struct_attr

Get the value of an attribute from the TokenC struct by attribute ID.

Example

from spacy.attrs cimport IS_ALPHA
from spacy.tokens cimport Token

is_alpha = Token.get_struct_attr(&doc.c[3], IS_ALPHA)
Name Type Description
token const TokenC* A pointer to a TokenC struct.
feat_name attr_id_t The ID of the attribute to look up. The attributes are enumerated in spacy.typedefs.
RETURNS attr_t The value of the attribute.

Token.set_struct_attr

Set the value of an attribute of the TokenC struct by attribute ID.

Example

from spacy.attrs cimport TAG
from spacy.tokens cimport Token

token = &doc.c[3]
Token.set_struct_attr(token, TAG, 0)
Name Type Description
token const TokenC* A pointer to a TokenC struct.
feat_name attr_id_t The ID of the attribute to look up. The attributes are enumerated in spacy.typedefs.
value attr_t The value to set.

token_by_start

Find a token in a TokenC* array by the offset of its first character.

Example

from spacy.tokens.doc cimport Doc, token_by_start
from spacy.vocab cimport Vocab

doc = Doc(Vocab(), words=[u'hello', u'world'])
assert token_by_start(doc.c, doc.length, 6) == 1
assert token_by_start(doc.c, doc.length, 4) == -1
Name Type Description
tokens const TokenC* A TokenC* array.
length int The number of tokens in the array.
start_char int The start index to search for.
RETURNS int The index of the token in the array or -1 if not found.

token_by_end

Find a token in a TokenC* array by the offset of its final character.

Example

from spacy.tokens.doc cimport Doc, token_by_end
from spacy.vocab cimport Vocab

doc = Doc(Vocab(), words=[u'hello', u'world'])
assert token_by_end(doc.c, doc.length, 5) == 0
assert token_by_end(doc.c, doc.length, 1) == -1
Name Type Description
tokens const TokenC* A TokenC* array.
length int The number of tokens in the array.
end_char int The end index to search for.
RETURNS int The index of the token in the array or -1 if not found.

set_children_from_heads

Set attributes that allow lookup of syntactic children on a TokenC* array. This function must be called after making changes to the TokenC.head attribute, in order to make the parse tree navigation consistent.

Example

from spacy.tokens.doc cimport Doc, set_children_from_heads
from spacy.vocab cimport Vocab

doc = Doc(Vocab(), words=[u'Baileys', u'from', u'a', u'shoe'])
doc.c[0].head = 0
doc.c[1].head = 0
doc.c[2].head = 3
doc.c[3].head = 1
set_children_from_heads(doc.c, doc.length)
assert doc.c[3].l_kids == 1
Name Type Description
tokens const TokenC* A TokenC* array.
length int The number of tokens in the array.

LexemeC

Struct holding information about a lexical type. LexemeC structs are usually owned by the Vocab, and accessed through a read-only pointer on the TokenC struct.

Example

lex = doc.c[3].lex
Name Type Description
flags flags_t Bit-field for binary lexical flag values.
id attr_t Usually used to map lexemes to rows in a matrix, e.g. for word vectors. Does not need to be unique, so currently misnamed.
length attr_t Number of unicode characters in the lexeme.
orth attr_t ID of the verbatim text content.
lower attr_t ID of the lowercase form of the lexeme.
norm attr_t ID of the lexeme's norm, i.e. a normalized form of the text.
shape attr_t Transform of the lexeme's string, to show orthographic features.
prefix attr_t Length-N substring from the start of the lexeme. Defaults to N=1.
suffix attr_t Length-N substring from the end of the lexeme. Defaults to N=3.
cluster attr_t Brown cluster ID.
prob float Smoothed log probability estimate of the lexeme's type.
sentiment float A scalar value indicating positivity or negativity.

Lexeme.get_struct_attr

Get the value of an attribute from the LexemeC struct by attribute ID.

Example

from spacy.attrs cimport IS_ALPHA
from spacy.lexeme cimport Lexeme

lexeme = doc.c[3].lex
is_alpha = Lexeme.get_struct_attr(lexeme, IS_ALPHA)
Name Type Description
lex const LexemeC* A pointer to a LexemeC struct.
feat_name attr_id_t The ID of the attribute to look up. The attributes are enumerated in spacy.typedefs.
RETURNS attr_t The value of the attribute.

Lexeme.set_struct_attr

Set the value of an attribute of the LexemeC struct by attribute ID.

Example

from spacy.attrs cimport NORM
from spacy.lexeme cimport Lexeme

lexeme = doc.c[3].lex
Lexeme.set_struct_attr(lexeme, NORM, lexeme.lower)
Name Type Description
lex const LexemeC* A pointer to a LexemeC struct.
feat_name attr_id_t The ID of the attribute to look up. The attributes are enumerated in spacy.typedefs.
value attr_t The value to set.

Lexeme.c_check_flag

Check the value of a binary flag attribute.

Example

from spacy.attrs cimport IS_STOP
from spacy.lexeme cimport Lexeme

lexeme = doc.c[3].lex
is_stop = Lexeme.c_check_flag(lexeme, IS_STOP)
Name Type Description
lexeme const LexemeC* A pointer to a LexemeC struct.
flag_id attr_id_t The ID of the flag to look up. The flag IDs are enumerated in spacy.typedefs.
RETURNS bint The boolean value of the flag.

Lexeme.c_set_flag

Set the value of a binary flag attribute.

Example

from spacy.attrs cimport IS_STOP
from spacy.lexeme cimport Lexeme

lexeme = doc.c[3].lex
Lexeme.c_set_flag(lexeme, IS_STOP, 0)
Name Type Description
lexeme const LexemeC* A pointer to a LexemeC struct.
flag_id attr_id_t The ID of the flag to look up. The flag IDs are enumerated in spacy.typedefs.
value bint The value to set.