spaCy/cython-structs.md at f7b5ff79072cba2f3919a139c62903d807055f3e

mirror of https://github.com/explosion/spaCy.git synced 2025-07-11 00:32:40 +03:00

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

2019-02-17 19:31:19 +01:00

20 KiB

Raw Blame History

title

teaser

Cython Structs

C-language objects that let you group variables together

/api/cython-classes

TokenC

tokenc

LexemeC

lexemec

TokenC

Cython data container for the Token object.

Example

token = &doc.c[3]
token_ptr = &doc.c[3]

Name	Type	Description
`lex`	`const LexemeC*`	A pointer to the lexeme for the token.
`morph`	`uint64_t`	An ID allowing lookup of morphological attributes.
`pos`	`univ_pos_t`	Coarse-grained part-of-speech tag.
`spacy`	`bint`	A binary value indicating whether the token has trailing whitespace.
`tag`	`attr_t`	Fine-grained part-of-speech tag.
`idx`	`int`	The character offset of the token within the parent document.
`lemma`	`attr_t`	Base form of the token, with no inflectional suffixes.
`sense`	`attr_t`	Space for storing a word sense ID, currently unused.
`head`	`int`	Offset of the syntactic parent relative to the token.
`dep`	`attr_t`	Syntactic dependency relation.
`l_kids`	`uint32_t`	Number of left children.
`r_kids`	`uint32_t`	Number of right children.
`l_edge`	`uint32_t`	Offset of the leftmost token of this token's syntactic descendants.
`r_edge`	`uint32_t`	Offset of the rightmost token of this token's syntactic descendants.
`sent_start`	`int`	Ternary value indicating whether the token is the first word of a sentence. `0` indicates a missing value, `-1` indicates `False` and `1` indicates `True`. The default value, 0, is interpreted as no sentence break. Sentence boundary detectors will usually set 0 for all tokens except tokens that follow a sentence boundary.
`ent_iob`	`int`	IOB code of named entity tag. `0` indicates a missing value, `1` indicates `I`, `2` indicates `0` and `3` indicates `B`.
`ent_type`	`attr_t`	Named entity type.
`ent_id`	`attr_t`	ID of the entity the token is an instance of, if any. Currently not used, but potentially for coreference resolution.

Token.get_struct_attr

Get the value of an attribute from the TokenC struct by attribute ID.

Example

from spacy.attrs cimport IS_ALPHA
from spacy.tokens cimport Token

is_alpha = Token.get_struct_attr(&doc.c[3], IS_ALPHA)

Name	Type	Description
`token`	`const TokenC*`	A pointer to a `TokenC` struct.
`feat_name`	`attr_id_t`	The ID of the attribute to look up. The attributes are enumerated in `spacy.typedefs`.
RETURNS	`attr_t`	The value of the attribute.

Token.set_struct_attr

Set the value of an attribute of the TokenC struct by attribute ID.

Example

from spacy.attrs cimport TAG
from spacy.tokens cimport Token

token = &doc.c[3]
Token.set_struct_attr(token, TAG, 0)

Name	Type	Description
`token`	`const TokenC*`	A pointer to a `TokenC` struct.
`feat_name`	`attr_id_t`	The ID of the attribute to look up. The attributes are enumerated in `spacy.typedefs`.
`value`	`attr_t`	The value to set.

token_by_start

Find a token in a TokenC* array by the offset of its first character.

Example

from spacy.tokens.doc cimport Doc, token_by_start
from spacy.vocab cimport Vocab

doc = Doc(Vocab(), words=[u'hello', u'world'])
assert token_by_start(doc.c, doc.length, 6) == 1
assert token_by_start(doc.c, doc.length, 4) == -1

Name	Type	Description
`tokens`	`const TokenC*`	A `TokenC*` array.
`length`	`int`	The number of tokens in the array.
`start_char`	`int`	The start index to search for.
RETURNS	`int`	The index of the token in the array or `-1` if not found.

token_by_end

Find a token in a TokenC* array by the offset of its final character.

Example

from spacy.tokens.doc cimport Doc, token_by_end
from spacy.vocab cimport Vocab

doc = Doc(Vocab(), words=[u'hello', u'world'])
assert token_by_end(doc.c, doc.length, 5) == 0
assert token_by_end(doc.c, doc.length, 1) == -1

Name	Type	Description
`tokens`	`const TokenC*`	A `TokenC*` array.
`length`	`int`	The number of tokens in the array.
`end_char`	`int`	The end index to search for.
RETURNS	`int`	The index of the token in the array or `-1` if not found.

set_children_from_heads

Set attributes that allow lookup of syntactic children on a TokenC* array. This function must be called after making changes to the TokenC.head attribute, in order to make the parse tree navigation consistent.

Example

from spacy.tokens.doc cimport Doc, set_children_from_heads
from spacy.vocab cimport Vocab

doc = Doc(Vocab(), words=[u'Baileys', u'from', u'a', u'shoe'])
doc.c[0].head = 0
doc.c[1].head = 0
doc.c[2].head = 3
doc.c[3].head = 1
set_children_from_heads(doc.c, doc.length)
assert doc.c[3].l_kids == 1

Name	Type	Description
`tokens`	`const TokenC*`	A `TokenC*` array.
`length`	`int`	The number of tokens in the array.

LexemeC

Struct holding information about a lexical type. LexemeC structs are usually owned by the Vocab, and accessed through a read-only pointer on the TokenC struct.

Example

lex = doc.c[3].lex

Name	Type	Description
`flags`	`flags_t`	Bit-field for binary lexical flag values.
`id`	`attr_t`	Usually used to map lexemes to rows in a matrix, e.g. for word vectors. Does not need to be unique, so currently misnamed.
`length`	`attr_t`	Number of unicode characters in the lexeme.
`orth`	`attr_t`	ID of the verbatim text content.
`lower`	`attr_t`	ID of the lowercase form of the lexeme.
`norm`	`attr_t`	ID of the lexeme's norm, i.e. a normalized form of the text.
`shape`	`attr_t`	Transform of the lexeme's string, to show orthographic features.
`prefix`	`attr_t`	Length-N substring from the start of the lexeme. Defaults to `N=1`.
`suffix`	`attr_t`	Length-N substring from the end of the lexeme. Defaults to `N=3`.
`cluster`	`attr_t`	Brown cluster ID.
`prob`	`float`	Smoothed log probability estimate of the lexeme's type.
`sentiment`	`float`	A scalar value indicating positivity or negativity.

Lexeme.get_struct_attr

Get the value of an attribute from the LexemeC struct by attribute ID.

Example

from spacy.attrs cimport IS_ALPHA
from spacy.lexeme cimport Lexeme

lexeme = doc.c[3].lex
is_alpha = Lexeme.get_struct_attr(lexeme, IS_ALPHA)

Name	Type	Description
`lex`	`const LexemeC*`	A pointer to a `LexemeC` struct.
`feat_name`	`attr_id_t`	The ID of the attribute to look up. The attributes are enumerated in `spacy.typedefs`.
RETURNS	`attr_t`	The value of the attribute.

Lexeme.set_struct_attr

Set the value of an attribute of the LexemeC struct by attribute ID.

Example

from spacy.attrs cimport NORM
from spacy.lexeme cimport Lexeme

lexeme = doc.c[3].lex
Lexeme.set_struct_attr(lexeme, NORM, lexeme.lower)

Name	Type	Description
`lex`	`const LexemeC*`	A pointer to a `LexemeC` struct.
`feat_name`	`attr_id_t`	The ID of the attribute to look up. The attributes are enumerated in `spacy.typedefs`.
`value`	`attr_t`	The value to set.

Lexeme.c_check_flag

Check the value of a binary flag attribute.

Example

from spacy.attrs cimport IS_STOP
from spacy.lexeme cimport Lexeme

lexeme = doc.c[3].lex
is_stop = Lexeme.c_check_flag(lexeme, IS_STOP)

Name	Type	Description
`lexeme`	`const LexemeC*`	A pointer to a `LexemeC` struct.
`flag_id`	`attr_id_t`	The ID of the flag to look up. The flag IDs are enumerated in `spacy.typedefs`.
RETURNS	`bint`	The boolean value of the flag.

Lexeme.c_set_flag

Set the value of a binary flag attribute.

Example

from spacy.attrs cimport IS_STOP
from spacy.lexeme cimport Lexeme

lexeme = doc.c[3].lex
Lexeme.c_set_flag(lexeme, IS_STOP, 0)

Name	Type	Description
`lexeme`	`const LexemeC*`	A pointer to a `LexemeC` struct.
`flag_id`	`attr_id_t`	The ID of the flag to look up. The flag IDs are enumerated in `spacy.typedefs`.
`value`	`bint`	The value to set.

20 KiB Raw Blame History

TokenC

Example

Token.get_struct_attr

Example

Token.set_struct_attr

Example

token_by_start

Example

token_by_end

Example

set_children_from_heads

Example

LexemeC

Example

Lexeme.get_struct_attr

Example

Lexeme.set_struct_attr

Example

Lexeme.c_check_flag

Example

Lexeme.c_set_flag

Example

20 KiB

Raw Blame History