spaCy/website/docs/api/cython-classes.md
Ines Montani e597110d31
💫 Update website (#3285)
<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-02-17 19:31:19 +01:00

10 KiB

title menu
Cython Classes
Doc
doc
Token
token
Span
span
Lexeme
lexeme
Vocab
vocab
StringStore
stringstore

Doc

The Doc object holds an array of TokenC structs.

This section documents the extra C-level attributes and methods that can't be accessed from Python. For the Python documentation, see Doc.

Attributes

Name Type Description
mem cymem.Pool A memory pool. Allocated memory will be freed once the Doc object is garbage collected.
vocab Vocab A reference to the shared Vocab object.
c TokenC* A pointer to a TokenC struct.
length int The number of tokens in the document.
max_length int The underlying size of the Doc.c array.

Doc.push_back

Append a token to the Doc. The token can be provided as a LexemeC or TokenC pointer, using Cython's fused types.

Example

from spacy.tokens cimport Doc
from spacy.vocab cimport Vocab

doc = Doc(Vocab())
lexeme = doc.vocab.get(u'hello')
doc.push_back(lexeme, True)
assert doc.text == u'hello '
Name Type Description
lex_or_tok LexemeOrToken The word to append to the Doc.
has_space bint Whether the word has trailing whitespace.

Token

A Cython class providing access and methods for a TokenC struct. Note that the Token object does not own the struct. It only receives a pointer to it.

This section documents the extra C-level attributes and methods that can't be accessed from Python. For the Python documentation, see Token.

Attributes

Name Type Description
vocab Vocab A reference to the shared Vocab object.
c TokenC* A pointer to a TokenC struct.
i int The offset of the token within the document.
doc Doc The parent document.

Token.cinit

Create a Token object from a TokenC* pointer.

Example

token = Token.cinit(&doc.c[3], doc, 3)
Name Type Description
vocab Vocab A reference to the shared Vocab.
c TokenC* A pointer to a TokenCstruct.
offset int The offset of the token within the document.
doc Doc The parent document.
RETURNS Token The newly constructed object.

Span

A Cython class providing access and methods for a slice of a Doc object.

This section documents the extra C-level attributes and methods that can't be accessed from Python. For the Python documentation, see Span.

Attributes

Name Type Description
doc Doc The parent document.
start int The index of the first token of the span.
end int The index of the first token after the span.
start_char int The index of the first character of the span.
end_char int The index of the last character of the span.
label attr_t A label to attach to the span, e.g. for named entities.

Lexeme

A Cython class providing access and methods for an entry in the vocabulary.

This section documents the extra C-level attributes and methods that can't be accessed from Python. For the Python documentation, see Lexeme.

Attributes

Name Type Description
c LexemeC* A pointer to a LexemeC struct.
vocab Vocab A reference to the shared Vocab object.
orth attr_t ID of the verbatim text content.

Vocab

A Cython class providing access and methods for a vocabulary and other data shared across a language.

This section documents the extra C-level attributes and methods that can't be accessed from Python. For the Python documentation, see Vocab.

Attributes

Name Type Description
mem cymem.Pool A memory pool. Allocated memory will be freed once the Vocab object is garbage collected.
strings StringStore A StringStore that maps string to hash values and vice versa.
length int The number of entries in the vocabulary.

Vocab.get

Retrieve a LexemeC* pointer from the vocabulary.

Example

lexeme = vocab.get(vocab.mem, u'hello')
Name Type Description
mem cymem.Pool A memory pool. Allocated memory will be freed once the Vocab object is garbage collected.
string unicode The string of the word to look up.
RETURNS const LexemeC* The lexeme in the vocabulary.

Vocab.get_by_orth

Retrieve a LexemeC* pointer from the vocabulary.

Example

lexeme = vocab.get_by_orth(doc[0].lex.norm)
Name Type Description
mem cymem.Pool A memory pool. Allocated memory will be freed once the Vocab object is garbage collected.
orth attr_t ID of the verbatim text content.
RETURNS const LexemeC* The lexeme in the vocabulary.

StringStore

A lookup table to retrieve strings by 64-bit hashes.

This section documents the extra C-level attributes and methods that can't be accessed from Python. For the Python documentation, see StringStore.

Attributes

Name Type Description
mem cymem.Pool A memory pool. Allocated memory will be freed once theStringStore object is garbage collected.
keys vector[hash_t] A list of hash values in the StringStore.