spaCy/website/docs/api/cython-classes.md
Ines Montani e597110d31
💫 Update website (#3285)
<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-02-17 19:31:19 +01:00

211 lines
10 KiB
Markdown

---
title: Cython Classes
menu:
- ['Doc', 'doc']
- ['Token', 'token']
- ['Span', 'span']
- ['Lexeme', 'lexeme']
- ['Vocab', 'vocab']
- ['StringStore', 'stringstore']
---
## Doc {#doc tag="cdef class" source="spacy/tokens/doc.pxd"}
The `Doc` object holds an array of [`TokenC`](/api/cython-structs#tokenc)
structs.
<Infobox variant="warning">
This section documents the extra C-level attributes and methods that can't be
accessed from Python. For the Python documentation, see [`Doc`](/api/doc).
</Infobox>
### Attributes {#doc_attributes}
| Name | Type | Description |
| ------------ | ------------ | ----------------------------------------------------------------------------------------- |
| `mem` | `cymem.Pool` | A memory pool. Allocated memory will be freed once the `Doc` object is garbage collected. |
| `vocab` | `Vocab` | A reference to the shared `Vocab` object. |
| `c` | `TokenC*` | A pointer to a [`TokenC`](/api/cython-structs#tokenc) struct. |
| `length` | `int` | The number of tokens in the document. |
| `max_length` | `int` | The underlying size of the `Doc.c` array. |
### Doc.push_back {#doc_push_back tag="method"}
Append a token to the `Doc`. The token can be provided as a
[`LexemeC`](/api/cython-structs#lexemec) or
[`TokenC`](/api/cython-structs#tokenc) pointer, using Cython's
[fused types](http://cython.readthedocs.io/en/latest/src/userguide/fusedtypes.html).
> #### Example
>
> ```python
> from spacy.tokens cimport Doc
> from spacy.vocab cimport Vocab
>
> doc = Doc(Vocab())
> lexeme = doc.vocab.get(u'hello')
> doc.push_back(lexeme, True)
> assert doc.text == u'hello '
> ```
| Name | Type | Description |
| ------------ | --------------- | ----------------------------------------- |
| `lex_or_tok` | `LexemeOrToken` | The word to append to the `Doc`. |
| `has_space` | `bint` | Whether the word has trailing whitespace. |
## Token {#token tag="cdef class" source="spacy/tokens/token.pxd"}
A Cython class providing access and methods for a
[`TokenC`](/api/cython-structs#tokenc) struct. Note that the `Token` object does
not own the struct. It only receives a pointer to it.
<Infobox variant="warning">
This section documents the extra C-level attributes and methods that can't be
accessed from Python. For the Python documentation, see [`Token`](/api/token).
</Infobox>
### Attributes {#token_attributes}
| Name | Type | Description |
| ------- | --------- | ------------------------------------------------------------- |
| `vocab` | `Vocab` | A reference to the shared `Vocab` object. |
| `c` | `TokenC*` | A pointer to a [`TokenC`](/api/cython-structs#tokenc) struct. |
| `i` | `int` | The offset of the token within the document. |
| `doc` | `Doc` | The parent document. |
### Token.cinit {#token_cinit tag="method"}
Create a `Token` object from a `TokenC*` pointer.
> #### Example
>
> ```python
> token = Token.cinit(&doc.c[3], doc, 3)
> ```
| Name | Type | Description |
| ----------- | --------- | ------------------------------------------------------------ |
| `vocab` | `Vocab` | A reference to the shared `Vocab`. |
| `c` | `TokenC*` | A pointer to a [`TokenC`](/api/cython-structs#tokenc)struct. |
| `offset` | `int` | The offset of the token within the document. |
| `doc` | `Doc` | The parent document. |
| **RETURNS** | `Token` | The newly constructed object. |
## Span {#span tag="cdef class" source="spacy/tokens/span.pxd"}
A Cython class providing access and methods for a slice of a `Doc` object.
<Infobox variant="warning">
This section documents the extra C-level attributes and methods that can't be
accessed from Python. For the Python documentation, see [`Span`](/api/span).
</Infobox>
### Attributes {#span_attributes}
| Name | Type | Description |
| ------------ | -------------------------------------- | ------------------------------------------------------- |
| `doc` | `Doc` | The parent document. |
| `start` | `int` | The index of the first token of the span. |
| `end` | `int` | The index of the first token after the span. |
| `start_char` | `int` | The index of the first character of the span. |
| `end_char` | `int` | The index of the last character of the span. |
| `label` | <Abbr title="uint64_t">`attr_t`</Abbr> | A label to attach to the span, e.g. for named entities. |
## Lexeme {#lexeme tag="cdef class" source="spacy/lexeme.pxd"}
A Cython class providing access and methods for an entry in the vocabulary.
<Infobox variant="warning">
This section documents the extra C-level attributes and methods that can't be
accessed from Python. For the Python documentation, see [`Lexeme`](/api/lexeme).
</Infobox>
### Attributes {#lexeme_attributes}
| Name | Type | Description |
| ------- | -------------------------------------- | --------------------------------------------------------------- |
| `c` | `LexemeC*` | A pointer to a [`LexemeC`](/api/cython-structs#lexemec) struct. |
| `vocab` | `Vocab` | A reference to the shared `Vocab` object. |
| `orth` | <Abbr title="uint64_t">`attr_t`</Abbr> | ID of the verbatim text content. |
## Vocab {#vocab tag="cdef class" source="spacy/vocab.pxd"}
A Cython class providing access and methods for a vocabulary and other data
shared across a language.
<Infobox variant="warning">
This section documents the extra C-level attributes and methods that can't be
accessed from Python. For the Python documentation, see [`Vocab`](/api/vocab).
</Infobox>
### Attributes {#vocab_attributes}
| Name | Type | Description |
| --------- | ------------- | ------------------------------------------------------------------------------------------- |
| `mem` | `cymem.Pool` | A memory pool. Allocated memory will be freed once the `Vocab` object is garbage collected. |
| `strings` | `StringStore` | A `StringStore` that maps string to hash values and vice versa. |
| `length` | `int` | The number of entries in the vocabulary. |
### Vocab.get {#vocab_get tag="method"}
Retrieve a [`LexemeC*`](/api/cython-structs#lexemec) pointer from the
vocabulary.
> #### Example
>
> ```python
> lexeme = vocab.get(vocab.mem, u'hello')
> ```
| Name | Type | Description |
| ----------- | ---------------- | ------------------------------------------------------------------------------------------- |
| `mem` | `cymem.Pool` | A memory pool. Allocated memory will be freed once the `Vocab` object is garbage collected. |
| `string` | unicode | The string of the word to look up. |
| **RETURNS** | `const LexemeC*` | The lexeme in the vocabulary. |
### Vocab.get_by_orth {#vocab_get_by_orth tag="method"}
Retrieve a [`LexemeC*`](/api/cython-structs#lexemec) pointer from the
vocabulary.
> #### Example
>
> ```python
> lexeme = vocab.get_by_orth(doc[0].lex.norm)
> ```
| Name | Type | Description |
| ----------- | -------------------------------------- | ------------------------------------------------------------------------------------------- |
| `mem` | `cymem.Pool` | A memory pool. Allocated memory will be freed once the `Vocab` object is garbage collected. |
| `orth` | <Abbr title="uint64_t">`attr_t`</Abbr> | ID of the verbatim text content. |
| **RETURNS** | `const LexemeC*` | The lexeme in the vocabulary. |
## StringStore {#stringstore tag="cdef class" source="spacy/strings.pxd"}
A lookup table to retrieve strings by 64-bit hashes.
<Infobox variant="warning">
This section documents the extra C-level attributes and methods that can't be
accessed from Python. For the Python documentation, see
[`StringStore`](/api/stringstore).
</Infobox>
### Attributes {#stringstore_attributes}
| Name | Type | Description |
| ------ | ------------------------------------------------------ | ------------------------------------------------------------------------------------------------ |
| `mem` | `cymem.Pool` | A memory pool. Allocated memory will be freed once the`StringStore` object is garbage collected. |
| `keys` | <Abbr title="vector[uint64_t]">`vector[hash_t]`</Abbr> | A list of hash values in the `StringStore`. |