mirror of
				https://github.com/explosion/spaCy.git
				synced 2025-11-04 01:48:04 +03:00 
			
		
		
		
	
		
			
				
	
	
		
			254 lines
		
	
	
		
			17 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
			
		
		
	
	
			254 lines
		
	
	
		
			17 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
---
 | 
						|
title: Cython Structs
 | 
						|
teaser: C-language objects that let you group variables together
 | 
						|
next: /api/cython-classes
 | 
						|
menu:
 | 
						|
  - ['TokenC', 'tokenc']
 | 
						|
  - ['LexemeC', 'lexemec']
 | 
						|
---
 | 
						|
 | 
						|
## TokenC {#tokenc tag="C struct" source="spacy/structs.pxd"}
 | 
						|
 | 
						|
Cython data container for the `Token` object.
 | 
						|
 | 
						|
> #### Example
 | 
						|
>
 | 
						|
> ```python
 | 
						|
> token = &doc.c[3]
 | 
						|
> token_ptr = &doc.c[3]
 | 
						|
> ```
 | 
						|
 | 
						|
| Name         | Description                                                                                                                                                                                                                                                                                                                                 |
 | 
						|
| ------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
 | 
						|
| `lex`        | A pointer to the lexeme for the token. ~~const LexemeC\*~~                                                                                                                                                                                                                                                                                  |
 | 
						|
| `morph`      | An ID allowing lookup of morphological attributes. ~~uint64_t~~                                                                                                                                                                                                                                                                             |
 | 
						|
| `pos`        | Coarse-grained part-of-speech tag. ~~univ_pos_t~~                                                                                                                                                                                                                                                                                           |
 | 
						|
| `spacy`      | A binary value indicating whether the token has trailing whitespace. ~~bint~~                                                                                                                                                                                                                                                               |
 | 
						|
| `tag`        | Fine-grained part-of-speech tag. ~~attr_t (uint64_t)~~                                                                                                                                                                                                                                                                                      |
 | 
						|
| `idx`        | The character offset of the token within the parent document. ~~int~~                                                                                                                                                                                                                                                                       |
 | 
						|
| `lemma`      | Base form of the token, with no inflectional suffixes. ~~attr_t (uint64_t)~~                                                                                                                                                                                                                                                                |
 | 
						|
| `sense`      | Space for storing a word sense ID, currently unused. ~~attr_t (uint64_t)~~                                                                                                                                                                                                                                                                  |
 | 
						|
| `head`       | Offset of the syntactic parent relative to the token. ~~int~~                                                                                                                                                                                                                                                                               |
 | 
						|
| `dep`        | Syntactic dependency relation. ~~attr_t (uint64_t)~~                                                                                                                                                                                                                                                                                        |
 | 
						|
| `l_kids`     | Number of left children. ~~uint32_t~~                                                                                                                                                                                                                                                                                                       |
 | 
						|
| `r_kids`     | Number of right children. ~~uint32_t~~                                                                                                                                                                                                                                                                                                      |
 | 
						|
| `l_edge`     | Offset of the leftmost token of this token's syntactic descendants. ~~uint32_t~~                                                                                                                                                                                                                                                            |
 | 
						|
| `r_edge`     | Offset of the rightmost token of this token's syntactic descendants. ~~uint32_t~~                                                                                                                                                                                                                                                           |
 | 
						|
| `sent_start` | Ternary value indicating whether the token is the first word of a sentence. `0` indicates a missing value, `-1` indicates `False` and `1` indicates `True`. The default value, 0, is interpreted as no sentence break. Sentence boundary detectors will usually set 0 for all tokens except tokens that follow a sentence boundary. ~~int~~ |
 | 
						|
| `ent_iob`    | IOB code of named entity tag. `0` indicates a missing value, `1` indicates `I`, `2` indicates `0` and `3` indicates `B`. ~~int~~                                                                                                                                                                                                            |
 | 
						|
| `ent_type`   | Named entity type. ~~attr_t (uint64_t)~~                                                                                                                                                                                                                                                                                                    |
 | 
						|
| `ent_id`     | ID of the entity the token is an instance of, if any. Currently not used, but potentially for coreference resolution. ~~attr_t (uint64_t)~~                                                                                                                                                                                                 |
 | 
						|
 | 
						|
### Token.get_struct_attr {#token_get_struct_attr tag="staticmethod, nogil" source="spacy/tokens/token.pxd"}
 | 
						|
 | 
						|
Get the value of an attribute from the `TokenC` struct by attribute ID.
 | 
						|
 | 
						|
> #### Example
 | 
						|
>
 | 
						|
> ```python
 | 
						|
> from spacy.attrs cimport IS_ALPHA
 | 
						|
> from spacy.tokens cimport Token
 | 
						|
>
 | 
						|
> is_alpha = Token.get_struct_attr(&doc.c[3], IS_ALPHA)
 | 
						|
> ```
 | 
						|
 | 
						|
| Name        | Description                                                                                          |
 | 
						|
| ----------- | ---------------------------------------------------------------------------------------------------- |
 | 
						|
| `token`     | A pointer to a `TokenC` struct. ~~const TokenC\*~~                                                   |
 | 
						|
| `feat_name` | The ID of the attribute to look up. The attributes are enumerated in `spacy.typedefs`. ~~attr_id_t~~ |
 | 
						|
| **RETURNS** | The value of the attribute. ~~attr_t (uint64_t)~~                                                    |
 | 
						|
 | 
						|
### Token.set_struct_attr {#token_set_struct_attr tag="staticmethod, nogil" source="spacy/tokens/token.pxd"}
 | 
						|
 | 
						|
Set the value of an attribute of the `TokenC` struct by attribute ID.
 | 
						|
 | 
						|
> #### Example
 | 
						|
>
 | 
						|
> ```python
 | 
						|
> from spacy.attrs cimport TAG
 | 
						|
> from spacy.tokens cimport Token
 | 
						|
>
 | 
						|
> token = &doc.c[3]
 | 
						|
> Token.set_struct_attr(token, TAG, 0)
 | 
						|
> ```
 | 
						|
 | 
						|
| Name        | Description                                                                                          |
 | 
						|
| ----------- | ---------------------------------------------------------------------------------------------------- |
 | 
						|
| `token`     | A pointer to a `TokenC` struct. ~~const TokenC\*~~                                                   |
 | 
						|
| `feat_name` | The ID of the attribute to look up. The attributes are enumerated in `spacy.typedefs`. ~~attr_id_t~~ |
 | 
						|
| `value`     | The value to set. ~~attr_t (uint64_t)~~                                                              |
 | 
						|
 | 
						|
### token_by_start {#token_by_start tag="function" source="spacy/tokens/doc.pxd"}
 | 
						|
 | 
						|
Find a token in a `TokenC*` array by the offset of its first character.
 | 
						|
 | 
						|
> #### Example
 | 
						|
>
 | 
						|
> ```python
 | 
						|
> from spacy.tokens.doc cimport Doc, token_by_start
 | 
						|
> from spacy.vocab cimport Vocab
 | 
						|
>
 | 
						|
> doc = Doc(Vocab(), words=["hello", "world"])
 | 
						|
> assert token_by_start(doc.c, doc.length, 6) == 1
 | 
						|
> assert token_by_start(doc.c, doc.length, 4) == -1
 | 
						|
> ```
 | 
						|
 | 
						|
| Name         | Description                                                       |
 | 
						|
| ------------ | ----------------------------------------------------------------- |
 | 
						|
| `tokens`     | A `TokenC*` array. ~~const TokenC\*~~                             |
 | 
						|
| `length`     | The number of tokens in the array. ~~int~~                        |
 | 
						|
| `start_char` | The start index to search for. ~~int~~                            |
 | 
						|
| **RETURNS**  | The index of the token in the array or `-1` if not found. ~~int~~ |
 | 
						|
 | 
						|
### token_by_end {#token_by_end tag="function" source="spacy/tokens/doc.pxd"}
 | 
						|
 | 
						|
Find a token in a `TokenC*` array by the offset of its final character.
 | 
						|
 | 
						|
> #### Example
 | 
						|
>
 | 
						|
> ```python
 | 
						|
> from spacy.tokens.doc cimport Doc, token_by_end
 | 
						|
> from spacy.vocab cimport Vocab
 | 
						|
>
 | 
						|
> doc = Doc(Vocab(), words=["hello", "world"])
 | 
						|
> assert token_by_end(doc.c, doc.length, 5) == 0
 | 
						|
> assert token_by_end(doc.c, doc.length, 1) == -1
 | 
						|
> ```
 | 
						|
 | 
						|
| Name        | Description                                                       |
 | 
						|
| ----------- | ----------------------------------------------------------------- |
 | 
						|
| `tokens`    | A `TokenC*` array. ~~const TokenC\*~~                             |
 | 
						|
| `length`    | The number of tokens in the array. ~~int~~                        |
 | 
						|
| `end_char`  | The end index to search for. ~~int~~                              |
 | 
						|
| **RETURNS** | The index of the token in the array or `-1` if not found. ~~int~~ |
 | 
						|
 | 
						|
### set_children_from_heads {#set_children_from_heads tag="function" source="spacy/tokens/doc.pxd"}
 | 
						|
 | 
						|
Set attributes that allow lookup of syntactic children on a `TokenC*` array.
 | 
						|
This function must be called after making changes to the `TokenC.head`
 | 
						|
attribute, in order to make the parse tree navigation consistent.
 | 
						|
 | 
						|
> #### Example
 | 
						|
>
 | 
						|
> ```python
 | 
						|
> from spacy.tokens.doc cimport Doc, set_children_from_heads
 | 
						|
> from spacy.vocab cimport Vocab
 | 
						|
>
 | 
						|
> doc = Doc(Vocab(), words=["Baileys", "from", "a", "shoe"])
 | 
						|
> doc.c[0].head = 0
 | 
						|
> doc.c[1].head = 0
 | 
						|
> doc.c[2].head = 3
 | 
						|
> doc.c[3].head = 1
 | 
						|
> set_children_from_heads(doc.c, doc.length)
 | 
						|
> assert doc.c[3].l_kids == 1
 | 
						|
> ```
 | 
						|
 | 
						|
| Name     | Description                                |
 | 
						|
| -------- | ------------------------------------------ |
 | 
						|
| `tokens` | A `TokenC*` array. ~~const TokenC\*~~      |
 | 
						|
| `length` | The number of tokens in the array. ~~int~~ |
 | 
						|
 | 
						|
## LexemeC {#lexemec tag="C struct" source="spacy/structs.pxd"}
 | 
						|
 | 
						|
Struct holding information about a lexical type. `LexemeC` structs are usually
 | 
						|
owned by the `Vocab`, and accessed through a read-only pointer on the `TokenC`
 | 
						|
struct.
 | 
						|
 | 
						|
> #### Example
 | 
						|
>
 | 
						|
> ```python
 | 
						|
> lex = doc.c[3].lex
 | 
						|
> ```
 | 
						|
 | 
						|
| Name     | Description                                                                                                                                      |
 | 
						|
| -------- | ------------------------------------------------------------------------------------------------------------------------------------------------ |
 | 
						|
| `flags`  | Bit-field for binary lexical flag values. ~~flags_t (uint64_t)~~                                                                                 |
 | 
						|
| `id`     | Usually used to map lexemes to rows in a matrix, e.g. for word vectors. Does not need to be unique, so currently misnamed. ~~attr_t (uint64_t)~~ |
 | 
						|
| `length` | Number of unicode characters in the lexeme. ~~attr_t (uint64_t)~~                                                                                |
 | 
						|
| `orth`   | ID of the verbatim text content. ~~attr_t (uint64_t)~~                                                                                           |
 | 
						|
| `lower`  | ID of the lowercase form of the lexeme. ~~attr_t (uint64_t)~~                                                                                    |
 | 
						|
| `norm`   | ID of the lexeme's norm, i.e. a normalized form of the text. ~~attr_t (uint64_t)~~                                                               |
 | 
						|
| `shape`  | Transform of the lexeme's string, to show orthographic features. ~~attr_t (uint64_t)~~                                                           |
 | 
						|
| `prefix` | Length-N substring from the start of the lexeme. Defaults to `N=1`. ~~attr_t (uint64_t)~~                                                        |
 | 
						|
| `suffix` | Length-N substring from the end of the lexeme. Defaults to `N=3`. ~~attr_t (uint64_t)~~                                                          |
 | 
						|
 | 
						|
### Lexeme.get_struct_attr {#lexeme_get_struct_attr tag="staticmethod, nogil" source="spacy/lexeme.pxd"}
 | 
						|
 | 
						|
Get the value of an attribute from the `LexemeC` struct by attribute ID.
 | 
						|
 | 
						|
> #### Example
 | 
						|
>
 | 
						|
> ```python
 | 
						|
> from spacy.attrs cimport IS_ALPHA
 | 
						|
> from spacy.lexeme cimport Lexeme
 | 
						|
>
 | 
						|
> lexeme = doc.c[3].lex
 | 
						|
> is_alpha = Lexeme.get_struct_attr(lexeme, IS_ALPHA)
 | 
						|
> ```
 | 
						|
 | 
						|
| Name        | Description                                                                                          |
 | 
						|
| ----------- | ---------------------------------------------------------------------------------------------------- |
 | 
						|
| `lex`       | A pointer to a `LexemeC` struct. ~~const LexemeC\*~~                                                 |
 | 
						|
| `feat_name` | The ID of the attribute to look up. The attributes are enumerated in `spacy.typedefs`. ~~attr_id_t~~ |
 | 
						|
| **RETURNS** | The value of the attribute. ~~attr_t (uint64_t)~~                                                    |
 | 
						|
 | 
						|
### Lexeme.set_struct_attr {#lexeme_set_struct_attr tag="staticmethod, nogil" source="spacy/lexeme.pxd"}
 | 
						|
 | 
						|
Set the value of an attribute of the `LexemeC` struct by attribute ID.
 | 
						|
 | 
						|
> #### Example
 | 
						|
>
 | 
						|
> ```python
 | 
						|
> from spacy.attrs cimport NORM
 | 
						|
> from spacy.lexeme cimport Lexeme
 | 
						|
>
 | 
						|
> lexeme = doc.c[3].lex
 | 
						|
> Lexeme.set_struct_attr(lexeme, NORM, lexeme.lower)
 | 
						|
> ```
 | 
						|
 | 
						|
| Name        | Description                                                                                          |
 | 
						|
| ----------- | ---------------------------------------------------------------------------------------------------- |
 | 
						|
| `lex`       | A pointer to a `LexemeC` struct. ~~const LexemeC\*~~                                                 |
 | 
						|
| `feat_name` | The ID of the attribute to look up. The attributes are enumerated in `spacy.typedefs`. ~~attr_id_t~~ |
 | 
						|
| `value`     | The value to set. ~~attr_t (uint64_t)~~                                                              |
 | 
						|
 | 
						|
### Lexeme.c_check_flag {#lexeme_c_check_flag tag="staticmethod, nogil" source="spacy/lexeme.pxd"}
 | 
						|
 | 
						|
Check the value of a binary flag attribute.
 | 
						|
 | 
						|
> #### Example
 | 
						|
>
 | 
						|
> ```python
 | 
						|
> from spacy.attrs cimport IS_STOP
 | 
						|
> from spacy.lexeme cimport Lexeme
 | 
						|
>
 | 
						|
> lexeme = doc.c[3].lex
 | 
						|
> is_stop = Lexeme.c_check_flag(lexeme, IS_STOP)
 | 
						|
> ```
 | 
						|
 | 
						|
| Name        | Description                                                                                   |
 | 
						|
| ----------- | --------------------------------------------------------------------------------------------- |
 | 
						|
| `lexeme`    | A pointer to a `LexemeC` struct. ~~const LexemeC\*~~                                          |
 | 
						|
| `flag_id`   | The ID of the flag to look up. The flag IDs are enumerated in `spacy.typedefs`. ~~attr_id_t~~ |
 | 
						|
| **RETURNS** | The boolean value of the flag. ~~bint~~                                                       |
 | 
						|
 | 
						|
### Lexeme.c_set_flag {#lexeme_c_set_flag tag="staticmethod, nogil" source="spacy/lexeme.pxd"}
 | 
						|
 | 
						|
Set the value of a binary flag attribute.
 | 
						|
 | 
						|
> #### Example
 | 
						|
>
 | 
						|
> ```python
 | 
						|
> from spacy.attrs cimport IS_STOP
 | 
						|
> from spacy.lexeme cimport Lexeme
 | 
						|
>
 | 
						|
> lexeme = doc.c[3].lex
 | 
						|
> Lexeme.c_set_flag(lexeme, IS_STOP, 0)
 | 
						|
> ```
 | 
						|
 | 
						|
| Name      | Description                                                                                   |
 | 
						|
| --------- | --------------------------------------------------------------------------------------------- |
 | 
						|
| `lexeme`  | A pointer to a `LexemeC` struct. ~~const LexemeC\*~~                                          |
 | 
						|
| `flag_id` | The ID of the flag to look up. The flag IDs are enumerated in `spacy.typedefs`. ~~attr_id_t~~ |
 | 
						|
| `value`   | The value to set. ~~bint~~                                                                    |
 |