mirror of
https://github.com/explosion/spaCy.git
synced 2024-12-27 02:16:32 +03:00
d5110ffbf2
* Update website models for v2.3.0 * Add docs for Chinese word segmentation * Tighten up Chinese docs section * Merge branch 'master' into docs/v2.3.0 [ci skip] * Merge branch 'master' into docs/v2.3.0 [ci skip] * Auto-format and update version * Update matcher.md * Update languages and sorting * Typo in landing page * Infobox about token_match behavior * Add meta and basic docs for Japanese * POS -> TAG in models table * Add info about lookups for normalization * Updates to API docs for v2.3 * Update adding norm exceptions for adding languages * Add --omit-extra-lookups to CLI API docs * Add initial draft of "What's New in v2.3" * Add new in v2.3 tags to Chinese and Japanese sections * Add tokenizer to migration section * Add new in v2.3 flags to init-model * Typo * More what's new in v2.3 Co-authored-by: Ines Montani <ines@ines.io>
254 lines
19 KiB
Markdown
254 lines
19 KiB
Markdown
---
|
|
title: Cython Structs
|
|
teaser: C-language objects that let you group variables together
|
|
next: /api/cython-classes
|
|
menu:
|
|
- ['TokenC', 'tokenc']
|
|
- ['LexemeC', 'lexemec']
|
|
---
|
|
|
|
## TokenC {#tokenc tag="C struct" source="spacy/structs.pxd"}
|
|
|
|
Cython data container for the `Token` object.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> token = &doc.c[3]
|
|
> token_ptr = &doc.c[3]
|
|
> ```
|
|
|
|
| Name | Type | Description |
|
|
| ------------ | -------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
|
| `lex` | `const LexemeC*` | A pointer to the lexeme for the token. |
|
|
| `morph` | `uint64_t` | An ID allowing lookup of morphological attributes. |
|
|
| `pos` | `univ_pos_t` | Coarse-grained part-of-speech tag. |
|
|
| `spacy` | `bint` | A binary value indicating whether the token has trailing whitespace. |
|
|
| `tag` | <Abbr title="uint64_t">`attr_t`</Abbr> | Fine-grained part-of-speech tag. |
|
|
| `idx` | `int` | The character offset of the token within the parent document. |
|
|
| `lemma` | <Abbr title="uint64_t">`attr_t`</Abbr> | Base form of the token, with no inflectional suffixes. |
|
|
| `sense` | <Abbr title="uint64_t">`attr_t`</Abbr> | Space for storing a word sense ID, currently unused. |
|
|
| `head` | `int` | Offset of the syntactic parent relative to the token. |
|
|
| `dep` | <Abbr title="uint64_t">`attr_t`</Abbr> | Syntactic dependency relation. |
|
|
| `l_kids` | `uint32_t` | Number of left children. |
|
|
| `r_kids` | `uint32_t` | Number of right children. |
|
|
| `l_edge` | `uint32_t` | Offset of the leftmost token of this token's syntactic descendants. |
|
|
| `r_edge` | `uint32_t` | Offset of the rightmost token of this token's syntactic descendants. |
|
|
| `sent_start` | `int` | Ternary value indicating whether the token is the first word of a sentence. `0` indicates a missing value, `-1` indicates `False` and `1` indicates `True`. The default value, 0, is interpreted as no sentence break. Sentence boundary detectors will usually set 0 for all tokens except tokens that follow a sentence boundary. |
|
|
| `ent_iob` | `int` | IOB code of named entity tag. `0` indicates a missing value, `1` indicates `I`, `2` indicates `0` and `3` indicates `B`. |
|
|
| `ent_type` | <Abbr title="uint64_t">`attr_t`</Abbr> | Named entity type. |
|
|
| `ent_id` | <Abbr title="uint64_t">`attr_t`</Abbr> | ID of the entity the token is an instance of, if any. Currently not used, but potentially for coreference resolution. |
|
|
|
|
### Token.get_struct_attr {#token_get_struct_attr tag="staticmethod, nogil" source="spacy/tokens/token.pxd"}
|
|
|
|
Get the value of an attribute from the `TokenC` struct by attribute ID.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> from spacy.attrs cimport IS_ALPHA
|
|
> from spacy.tokens cimport Token
|
|
>
|
|
> is_alpha = Token.get_struct_attr(&doc.c[3], IS_ALPHA)
|
|
> ```
|
|
|
|
| Name | Type | Description |
|
|
| ----------- | -------------------------------------- | -------------------------------------------------------------------------------------- |
|
|
| `token` | `const TokenC*` | A pointer to a `TokenC` struct. |
|
|
| `feat_name` | `attr_id_t` | The ID of the attribute to look up. The attributes are enumerated in `spacy.typedefs`. |
|
|
| **RETURNS** | <Abbr title="uint64_t">`attr_t`</Abbr> | The value of the attribute. |
|
|
|
|
### Token.set_struct_attr {#token_set_struct_attr tag="staticmethod, nogil" source="spacy/tokens/token.pxd"}
|
|
|
|
Set the value of an attribute of the `TokenC` struct by attribute ID.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> from spacy.attrs cimport TAG
|
|
> from spacy.tokens cimport Token
|
|
>
|
|
> token = &doc.c[3]
|
|
> Token.set_struct_attr(token, TAG, 0)
|
|
> ```
|
|
|
|
| Name | Type | Description |
|
|
| ----------- | -------------------------------------- | -------------------------------------------------------------------------------------- |
|
|
| `token` | `const TokenC*` | A pointer to a `TokenC` struct. |
|
|
| `feat_name` | `attr_id_t` | The ID of the attribute to look up. The attributes are enumerated in `spacy.typedefs`. |
|
|
| `value` | <Abbr title="uint64_t">`attr_t`</Abbr> | The value to set. |
|
|
|
|
### token_by_start {#token_by_start tag="function" source="spacy/tokens/doc.pxd"}
|
|
|
|
Find a token in a `TokenC*` array by the offset of its first character.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> from spacy.tokens.doc cimport Doc, token_by_start
|
|
> from spacy.vocab cimport Vocab
|
|
>
|
|
> doc = Doc(Vocab(), words=["hello", "world"])
|
|
> assert token_by_start(doc.c, doc.length, 6) == 1
|
|
> assert token_by_start(doc.c, doc.length, 4) == -1
|
|
> ```
|
|
|
|
| Name | Type | Description |
|
|
| ------------ | --------------- | --------------------------------------------------------- |
|
|
| `tokens` | `const TokenC*` | A `TokenC*` array. |
|
|
| `length` | `int` | The number of tokens in the array. |
|
|
| `start_char` | `int` | The start index to search for. |
|
|
| **RETURNS** | `int` | The index of the token in the array or `-1` if not found. |
|
|
|
|
### token_by_end {#token_by_end tag="function" source="spacy/tokens/doc.pxd"}
|
|
|
|
Find a token in a `TokenC*` array by the offset of its final character.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> from spacy.tokens.doc cimport Doc, token_by_end
|
|
> from spacy.vocab cimport Vocab
|
|
>
|
|
> doc = Doc(Vocab(), words=["hello", "world"])
|
|
> assert token_by_end(doc.c, doc.length, 5) == 0
|
|
> assert token_by_end(doc.c, doc.length, 1) == -1
|
|
> ```
|
|
|
|
| Name | Type | Description |
|
|
| ----------- | --------------- | --------------------------------------------------------- |
|
|
| `tokens` | `const TokenC*` | A `TokenC*` array. |
|
|
| `length` | `int` | The number of tokens in the array. |
|
|
| `end_char` | `int` | The end index to search for. |
|
|
| **RETURNS** | `int` | The index of the token in the array or `-1` if not found. |
|
|
|
|
### set_children_from_heads {#set_children_from_heads tag="function" source="spacy/tokens/doc.pxd"}
|
|
|
|
Set attributes that allow lookup of syntactic children on a `TokenC*` array.
|
|
This function must be called after making changes to the `TokenC.head`
|
|
attribute, in order to make the parse tree navigation consistent.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> from spacy.tokens.doc cimport Doc, set_children_from_heads
|
|
> from spacy.vocab cimport Vocab
|
|
>
|
|
> doc = Doc(Vocab(), words=["Baileys", "from", "a", "shoe"])
|
|
> doc.c[0].head = 0
|
|
> doc.c[1].head = 0
|
|
> doc.c[2].head = 3
|
|
> doc.c[3].head = 1
|
|
> set_children_from_heads(doc.c, doc.length)
|
|
> assert doc.c[3].l_kids == 1
|
|
> ```
|
|
|
|
| Name | Type | Description |
|
|
| -------- | --------------- | ---------------------------------- |
|
|
| `tokens` | `const TokenC*` | A `TokenC*` array. |
|
|
| `length` | `int` | The number of tokens in the array. |
|
|
|
|
## LexemeC {#lexemec tag="C struct" source="spacy/structs.pxd"}
|
|
|
|
Struct holding information about a lexical type. `LexemeC` structs are usually
|
|
owned by the `Vocab`, and accessed through a read-only pointer on the `TokenC`
|
|
struct.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> lex = doc.c[3].lex
|
|
> ```
|
|
|
|
| Name | Type | Description |
|
|
| ----------- | --------------------------------------- | -------------------------------------------------------------------------------------------------------------------------- |
|
|
| `flags` | <Abbr title="uint64_t">`flags_t`</Abbr> | Bit-field for binary lexical flag values. |
|
|
| `id` | <Abbr title="uint64_t">`attr_t`</Abbr> | Usually used to map lexemes to rows in a matrix, e.g. for word vectors. Does not need to be unique, so currently misnamed. |
|
|
| `length` | <Abbr title="uint64_t">`attr_t`</Abbr> | Number of unicode characters in the lexeme. |
|
|
| `orth` | <Abbr title="uint64_t">`attr_t`</Abbr> | ID of the verbatim text content. |
|
|
| `lower` | <Abbr title="uint64_t">`attr_t`</Abbr> | ID of the lowercase form of the lexeme. |
|
|
| `norm` | <Abbr title="uint64_t">`attr_t`</Abbr> | ID of the lexeme's norm, i.e. a normalized form of the text. |
|
|
| `shape` | <Abbr title="uint64_t">`attr_t`</Abbr> | Transform of the lexeme's string, to show orthographic features. |
|
|
| `prefix` | <Abbr title="uint64_t">`attr_t`</Abbr> | Length-N substring from the start of the lexeme. Defaults to `N=1`. |
|
|
| `suffix` | <Abbr title="uint64_t">`attr_t`</Abbr> | Length-N substring from the end of the lexeme. Defaults to `N=3`. |
|
|
|
|
### Lexeme.get_struct_attr {#lexeme_get_struct_attr tag="staticmethod, nogil" source="spacy/lexeme.pxd"}
|
|
|
|
Get the value of an attribute from the `LexemeC` struct by attribute ID.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> from spacy.attrs cimport IS_ALPHA
|
|
> from spacy.lexeme cimport Lexeme
|
|
>
|
|
> lexeme = doc.c[3].lex
|
|
> is_alpha = Lexeme.get_struct_attr(lexeme, IS_ALPHA)
|
|
> ```
|
|
|
|
| Name | Type | Description |
|
|
| ----------- | -------------------------------------- | -------------------------------------------------------------------------------------- |
|
|
| `lex` | `const LexemeC*` | A pointer to a `LexemeC` struct. |
|
|
| `feat_name` | `attr_id_t` | The ID of the attribute to look up. The attributes are enumerated in `spacy.typedefs`. |
|
|
| **RETURNS** | <Abbr title="uint64_t">`attr_t`</Abbr> | The value of the attribute. |
|
|
|
|
### Lexeme.set_struct_attr {#lexeme_set_struct_attr tag="staticmethod, nogil" source="spacy/lexeme.pxd"}
|
|
|
|
Set the value of an attribute of the `LexemeC` struct by attribute ID.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> from spacy.attrs cimport NORM
|
|
> from spacy.lexeme cimport Lexeme
|
|
>
|
|
> lexeme = doc.c[3].lex
|
|
> Lexeme.set_struct_attr(lexeme, NORM, lexeme.lower)
|
|
> ```
|
|
|
|
| Name | Type | Description |
|
|
| ----------- | -------------------------------------- | -------------------------------------------------------------------------------------- |
|
|
| `lex` | `const LexemeC*` | A pointer to a `LexemeC` struct. |
|
|
| `feat_name` | `attr_id_t` | The ID of the attribute to look up. The attributes are enumerated in `spacy.typedefs`. |
|
|
| `value` | <Abbr title="uint64_t">`attr_t`</Abbr> | The value to set. |
|
|
|
|
### Lexeme.c_check_flag {#lexeme_c_check_flag tag="staticmethod, nogil" source="spacy/lexeme.pxd"}
|
|
|
|
Check the value of a binary flag attribute.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> from spacy.attrs cimport IS_STOP
|
|
> from spacy.lexeme cimport Lexeme
|
|
>
|
|
> lexeme = doc.c[3].lex
|
|
> is_stop = Lexeme.c_check_flag(lexeme, IS_STOP)
|
|
> ```
|
|
|
|
| Name | Type | Description |
|
|
| ----------- | ---------------- | ------------------------------------------------------------------------------- |
|
|
| `lexeme` | `const LexemeC*` | A pointer to a `LexemeC` struct. |
|
|
| `flag_id` | `attr_id_t` | The ID of the flag to look up. The flag IDs are enumerated in `spacy.typedefs`. |
|
|
| **RETURNS** | `bint` | The boolean value of the flag. |
|
|
|
|
### Lexeme.c_set_flag {#lexeme_c_set_flag tag="staticmethod, nogil" source="spacy/lexeme.pxd"}
|
|
|
|
Set the value of a binary flag attribute.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> from spacy.attrs cimport IS_STOP
|
|
> from spacy.lexeme cimport Lexeme
|
|
>
|
|
> lexeme = doc.c[3].lex
|
|
> Lexeme.c_set_flag(lexeme, IS_STOP, 0)
|
|
> ```
|
|
|
|
| Name | Type | Description |
|
|
| --------- | ---------------- | ------------------------------------------------------------------------------- |
|
|
| `lexeme` | `const LexemeC*` | A pointer to a `LexemeC` struct. |
|
|
| `flag_id` | `attr_id_t` | The ID of the flag to look up. The flag IDs are enumerated in `spacy.typedefs`. |
|
|
| `value` | `bint` | The value to set. |
|