mirror of
https://github.com/explosion/spaCy.git
synced 2024-12-26 01:46:28 +03:00
Update docs [ci skip]
This commit is contained in:
parent
fd20f84927
commit
46bc513a4e
|
@ -417,20 +417,18 @@ network has an internal CNN Tok2Vec layer and uses attention.
|
|||
> nO = null
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| -------------------- | ----- | ------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `exclusive_classes` | bool | Whether or not categories are mutually exclusive. |
|
||||
| `pretrained_vectors` | bool | Whether or not pretrained vectors will be used in addition to the feature vectors. |
|
||||
| `width` | int | Output dimension of the feature encoding step. |
|
||||
| `embed_size` | int | Input dimension of the feature encoding step. |
|
||||
| `conv_depth` | int | Depth of the Tok2Vec layer. |
|
||||
| `window_size` | int | The number of contextual vectors to [concatenate](https://thinc.ai/docs/api-layers#expand_window) from the left and from the right. |
|
||||
| `ngram_size` | int | Determines the maximum length of the n-grams in the BOW model. For instance, `ngram_size=3`would give unigram, trigram and bigram features. |
|
||||
| `dropout` | float | The dropout rate. |
|
||||
| `nO` | int | Output dimension, determined by the number of different labels. |
|
||||
|
||||
If the `nO` dimension is not set, the TextCategorizer component will set it when
|
||||
`begin_training` is called.
|
||||
| Name | Type | Description |
|
||||
| --------------------------- | ----- | -------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `exclusive_classes` | bool | Whether or not categories are mutually exclusive. |
|
||||
| `pretrained_vectors` | bool | Whether or not pretrained vectors will be used in addition to the feature vectors. |
|
||||
| `width` | int | Output dimension of the feature encoding step. |
|
||||
| `embed_size` | int | Input dimension of the feature encoding step. |
|
||||
| `conv_depth` | int | Depth of the Tok2Vec layer. |
|
||||
| `window_size` | int | The number of contextual vectors to [concatenate](https://thinc.ai/docs/api-layers#expand_window) from the left and from the right. |
|
||||
| `ngram_size` | int | Determines the maximum length of the n-grams in the BOW model. For instance, `ngram_size=3`would give unigram, trigram and bigram features. |
|
||||
| `dropout` | float | The dropout rate. |
|
||||
| `nO` | int | Output dimension, determined by the number of different labels. If not set, the the [`TextCategorizer`](/api/textcategorizer) component will set it when |
|
||||
| `begin_training` is called. |
|
||||
|
||||
### spacy.TextCatCNN.v1 {#TextCatCNN}
|
||||
|
||||
|
@ -457,14 +455,12 @@ A neural network model where token vectors are calculated using a CNN. The
|
|||
vectors are mean pooled and used as features in a feed-forward network. This
|
||||
architecture is usually less accurate than the ensemble, but runs faster.
|
||||
|
||||
| Name | Type | Description |
|
||||
| ------------------- | ------------------------------------------ | --------------------------------------------------------------- |
|
||||
| `exclusive_classes` | bool | Whether or not categories are mutually exclusive. |
|
||||
| `tok2vec` | [`Model`](https://thinc.ai/docs/api-model) | The [`tok2vec`](#tok2vec) layer of the model. |
|
||||
| `nO` | int | Output dimension, determined by the number of different labels. |
|
||||
|
||||
If the `nO` dimension is not set, the TextCategorizer component will set it when
|
||||
`begin_training` is called.
|
||||
| Name | Type | Description |
|
||||
| --------------------------- | ------------------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `exclusive_classes` | bool | Whether or not categories are mutually exclusive. |
|
||||
| `tok2vec` | [`Model`](https://thinc.ai/docs/api-model) | The [`tok2vec`](#tok2vec) layer of the model. |
|
||||
| `nO` | int | Output dimension, determined by the number of different labels. If not set, the the [`TextCategorizer`](/api/textcategorizer) component will set it when |
|
||||
| `begin_training` is called. |
|
||||
|
||||
### spacy.TextCatBOW.v1 {#TextCatBOW}
|
||||
|
||||
|
@ -482,17 +478,17 @@ If the `nO` dimension is not set, the TextCategorizer component will set it when
|
|||
An ngram "bag-of-words" model. This architecture should run much faster than the
|
||||
others, but may not be as accurate, especially if texts are short.
|
||||
|
||||
| Name | Type | Description |
|
||||
| ------------------- | ----- | ------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `exclusive_classes` | bool | Whether or not categories are mutually exclusive. |
|
||||
| `ngram_size` | int | Determines the maximum length of the n-grams in the BOW model. For instance, `ngram_size=3`would give unigram, trigram and bigram features. |
|
||||
| `no_output_layer` | float | Whether or not to add an output layer to the model (`Softmax` activation if `exclusive_classes=True`, else `Logistic`. |
|
||||
| `nO` | int | Output dimension, determined by the number of different labels. |
|
||||
|
||||
If the `nO` dimension is not set, the TextCategorizer component will set it when
|
||||
`begin_training` is called.
|
||||
| Name | Type | Description |
|
||||
| --------------------------- | ----- | -------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `exclusive_classes` | bool | Whether or not categories are mutually exclusive. |
|
||||
| `ngram_size` | int | Determines the maximum length of the n-grams in the BOW model. For instance, `ngram_size=3`would give unigram, trigram and bigram features. |
|
||||
| `no_output_layer` | float | Whether or not to add an output layer to the model (`Softmax` activation if `exclusive_classes=True`, else `Logistic`. |
|
||||
| `nO` | int | Output dimension, determined by the number of different labels. If not set, the the [`TextCategorizer`](/api/textcategorizer) component will set it when |
|
||||
| `begin_training` is called. |
|
||||
|
||||
<!-- TODO:
|
||||
### spacy.TextCatLowData.v1 {#TextCatLowData}
|
||||
-->
|
||||
|
||||
## Entity linking architectures {#entitylinker source="spacy/ml/models/entity_linker.py"}
|
||||
|
||||
|
|
|
@ -340,7 +340,7 @@ See the [`Transformer`](/api/transformer) API reference and
|
|||
|
||||
## Batchers {#batchers source="spacy/gold/batchers.py" new="3"}
|
||||
|
||||
<!-- TODO: intro and also describe signature of functions -->
|
||||
<!-- TODO: intro -->
|
||||
|
||||
#### batch_by_words.v1 {#batch_by_words tag="registered function"}
|
||||
|
||||
|
@ -361,19 +361,16 @@ themselves, or be discarded if `discard_oversize` is set to `True`. The argument
|
|||
> get_length = null
|
||||
> ```
|
||||
|
||||
<!-- TODO: complete table -->
|
||||
|
||||
| Name | Type | Description |
|
||||
| ------------------ | ---------------------- | ----------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `size` | `Iterable[int]` / int | The batch size. Can also be a block referencing a schedule, e.g. [`compounding`](https://thinc.ai/docs/api-schedules/#compounding). |
|
||||
| `tolerance` | float | |
|
||||
| `discard_oversize` | bool | Discard items that are longer than the specified batch length. |
|
||||
| `get_length` | `Callable[[Any], int]` | Optional function that receives a sequence and returns its length. Defaults to the built-in `len()` if not set. |
|
||||
| Name | Type | Description |
|
||||
| ------------------ | ---------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `seqs` | `Iterable[Any]` | The sequences to minibatch. |
|
||||
| `size` | `Iterable[int]` / int | The target number of words per batch. Can also be a block referencing a schedule, e.g. [`compounding`](https://thinc.ai/docs/api-schedules/#compounding). |
|
||||
| `tolerance` | float | What percentage of the size to allow batches to exceed. |
|
||||
| `discard_oversize` | bool | Whether to discard sequences that by themselves exceed the tolerated size. |
|
||||
| `get_length` | `Callable[[Any], int]` | Optional function that receives a sequence item and returns its length. Defaults to the built-in `len()` if not set. |
|
||||
|
||||
#### batch_by_sequence.v1 {#batch_by_sequence tag="registered function"}
|
||||
|
||||
<!-- TODO: -->
|
||||
|
||||
> #### Example config
|
||||
>
|
||||
> ```ini
|
||||
|
@ -383,34 +380,37 @@ themselves, or be discarded if `discard_oversize` is set to `True`. The argument
|
|||
> get_length = null
|
||||
> ```
|
||||
|
||||
<!-- TODO: complete table -->
|
||||
Create a batcher that creates batches of the specified size.
|
||||
|
||||
| Name | Type | Description |
|
||||
| ------------ | ---------------------- | ----------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `size` | `Iterable[int]` / int | The batch size. Can also be a block referencing a schedule, e.g. [`compounding`](https://thinc.ai/docs/api-schedules/#compounding). |
|
||||
| `get_length` | `Callable[[Any], int]` | Optional function that receives a sequence and returns its length. Defaults to the built-in `len()` if not set. |
|
||||
| Name | Type | Description |
|
||||
| ------------ | ---------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `size` | `Iterable[int]` / int | The target number of items per batch. Can also be a block referencing a schedule, e.g. [`compounding`](https://thinc.ai/docs/api-schedules/#compounding). |
|
||||
| `get_length` | `Callable[[Any], int]` | Optional function that receives a sequence item and returns its length. Defaults to the built-in `len()` if not set. |
|
||||
|
||||
#### batch_by_padded.v1 {#batch_by_padded tag="registered function"}
|
||||
|
||||
<!-- TODO: -->
|
||||
|
||||
> #### Example config
|
||||
>
|
||||
> ```ini
|
||||
> [training.batcher]
|
||||
> @batchers = "batch_by_words.v1"
|
||||
> @batchers = "batch_by_padded.v1"
|
||||
> size = 100
|
||||
> buffer = TODO:
|
||||
> buffer = 256
|
||||
> discard_oversize = false
|
||||
> get_length = null
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ------------------ | ---------------------- | ----------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `size` | `Iterable[int]` / int | The batch size. Can also be a block referencing a schedule, e.g. [`compounding`](https://thinc.ai/docs/api-schedules/#compounding). |
|
||||
| `buffer` | int | |
|
||||
| `discard_oversize` | bool | Discard items that are longer than the specified batch length. |
|
||||
| `get_length` | `Callable[[Any], int]` | Optional function that receives a sequence and returns its length. Defaults to the built-in `len()` if not set. |
|
||||
Minibatch a sequence by the size of padded batches that would result, with
|
||||
sequences binned by length within a window. The padded size is defined as the
|
||||
maximum length of sequences within the batch multiplied by the number of
|
||||
sequences in the batch.
|
||||
|
||||
| Name | Type | Description |
|
||||
| ------------------ | ---------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `size` | `Iterable[int]` / int | The largest padded size to batch sequences into. Can also be a block referencing a schedule, e.g. [`compounding`](https://thinc.ai/docs/api-schedules/#compounding). |
|
||||
| `buffer` | int | The number of sequences to accumulate before sorting by length. A larger buffer will result in more even sizing, but if the buffer is very large, the iteration order will be less random, which can result in suboptimal training. |
|
||||
| `discard_oversize` | bool | Whether to discard sequences that are by themselves longer than the largest padded batch size. |
|
||||
| `get_length` | `Callable[[Any], int]` | Optional function that receives a sequence item and returns its length. Defaults to the built-in `len()` if not set. |
|
||||
|
||||
## Training data and alignment {#gold source="spacy/gold"}
|
||||
|
||||
|
|
Loading…
Reference in New Issue
Block a user