Update docs [ci skip]

This commit is contained in:
Ines Montani 2020-08-07 20:14:31 +02:00
parent fd20f84927
commit 46bc513a4e
2 changed files with 53 additions and 57 deletions

View File

@ -417,20 +417,18 @@ network has an internal CNN Tok2Vec layer and uses attention.
> nO = null
> ```
| Name | Type | Description |
| -------------------- | ----- | ------------------------------------------------------------------------------------------------------------------------------------------- |
| `exclusive_classes` | bool | Whether or not categories are mutually exclusive. |
| `pretrained_vectors` | bool | Whether or not pretrained vectors will be used in addition to the feature vectors. |
| `width` | int | Output dimension of the feature encoding step. |
| `embed_size` | int | Input dimension of the feature encoding step. |
| `conv_depth` | int | Depth of the Tok2Vec layer. |
| `window_size` | int | The number of contextual vectors to [concatenate](https://thinc.ai/docs/api-layers#expand_window) from the left and from the right. |
| `ngram_size` | int | Determines the maximum length of the n-grams in the BOW model. For instance, `ngram_size=3`would give unigram, trigram and bigram features. |
| `dropout` | float | The dropout rate. |
| `nO` | int | Output dimension, determined by the number of different labels. |
If the `nO` dimension is not set, the TextCategorizer component will set it when
`begin_training` is called.
| Name | Type | Description |
| --------------------------- | ----- | -------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `exclusive_classes` | bool | Whether or not categories are mutually exclusive. |
| `pretrained_vectors` | bool | Whether or not pretrained vectors will be used in addition to the feature vectors. |
| `width` | int | Output dimension of the feature encoding step. |
| `embed_size` | int | Input dimension of the feature encoding step. |
| `conv_depth` | int | Depth of the Tok2Vec layer. |
| `window_size` | int | The number of contextual vectors to [concatenate](https://thinc.ai/docs/api-layers#expand_window) from the left and from the right. |
| `ngram_size` | int | Determines the maximum length of the n-grams in the BOW model. For instance, `ngram_size=3`would give unigram, trigram and bigram features. |
| `dropout` | float | The dropout rate. |
| `nO` | int | Output dimension, determined by the number of different labels. If not set, the the [`TextCategorizer`](/api/textcategorizer) component will set it when |
| `begin_training` is called. |
### spacy.TextCatCNN.v1 {#TextCatCNN}
@ -457,14 +455,12 @@ A neural network model where token vectors are calculated using a CNN. The
vectors are mean pooled and used as features in a feed-forward network. This
architecture is usually less accurate than the ensemble, but runs faster.
| Name | Type | Description |
| ------------------- | ------------------------------------------ | --------------------------------------------------------------- |
| `exclusive_classes` | bool | Whether or not categories are mutually exclusive. |
| `tok2vec` | [`Model`](https://thinc.ai/docs/api-model) | The [`tok2vec`](#tok2vec) layer of the model. |
| `nO` | int | Output dimension, determined by the number of different labels. |
If the `nO` dimension is not set, the TextCategorizer component will set it when
`begin_training` is called.
| Name | Type | Description |
| --------------------------- | ------------------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `exclusive_classes` | bool | Whether or not categories are mutually exclusive. |
| `tok2vec` | [`Model`](https://thinc.ai/docs/api-model) | The [`tok2vec`](#tok2vec) layer of the model. |
| `nO` | int | Output dimension, determined by the number of different labels. If not set, the the [`TextCategorizer`](/api/textcategorizer) component will set it when |
| `begin_training` is called. |
### spacy.TextCatBOW.v1 {#TextCatBOW}
@ -482,17 +478,17 @@ If the `nO` dimension is not set, the TextCategorizer component will set it when
An ngram "bag-of-words" model. This architecture should run much faster than the
others, but may not be as accurate, especially if texts are short.
| Name | Type | Description |
| ------------------- | ----- | ------------------------------------------------------------------------------------------------------------------------------------------- |
| `exclusive_classes` | bool | Whether or not categories are mutually exclusive. |
| `ngram_size` | int | Determines the maximum length of the n-grams in the BOW model. For instance, `ngram_size=3`would give unigram, trigram and bigram features. |
| `no_output_layer` | float | Whether or not to add an output layer to the model (`Softmax` activation if `exclusive_classes=True`, else `Logistic`. |
| `nO` | int | Output dimension, determined by the number of different labels. |
If the `nO` dimension is not set, the TextCategorizer component will set it when
`begin_training` is called.
| Name | Type | Description |
| --------------------------- | ----- | -------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `exclusive_classes` | bool | Whether or not categories are mutually exclusive. |
| `ngram_size` | int | Determines the maximum length of the n-grams in the BOW model. For instance, `ngram_size=3`would give unigram, trigram and bigram features. |
| `no_output_layer` | float | Whether or not to add an output layer to the model (`Softmax` activation if `exclusive_classes=True`, else `Logistic`. |
| `nO` | int | Output dimension, determined by the number of different labels. If not set, the the [`TextCategorizer`](/api/textcategorizer) component will set it when |
| `begin_training` is called. |
<!-- TODO:
### spacy.TextCatLowData.v1 {#TextCatLowData}
-->
## Entity linking architectures {#entitylinker source="spacy/ml/models/entity_linker.py"}

View File

@ -340,7 +340,7 @@ See the [`Transformer`](/api/transformer) API reference and
## Batchers {#batchers source="spacy/gold/batchers.py" new="3"}
<!-- TODO: intro and also describe signature of functions -->
<!-- TODO: intro -->
#### batch_by_words.v1 {#batch_by_words tag="registered function"}
@ -361,19 +361,16 @@ themselves, or be discarded if `discard_oversize` is set to `True`. The argument
> get_length = null
> ```
<!-- TODO: complete table -->
| Name | Type | Description |
| ------------------ | ---------------------- | ----------------------------------------------------------------------------------------------------------------------------------- |
| `size` | `Iterable[int]` / int | The batch size. Can also be a block referencing a schedule, e.g. [`compounding`](https://thinc.ai/docs/api-schedules/#compounding). |
| `tolerance` | float | |
| `discard_oversize` | bool | Discard items that are longer than the specified batch length. |
| `get_length` | `Callable[[Any], int]` | Optional function that receives a sequence and returns its length. Defaults to the built-in `len()` if not set. |
| Name | Type | Description |
| ------------------ | ---------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `seqs` | `Iterable[Any]` | The sequences to minibatch. |
| `size` | `Iterable[int]` / int | The target number of words per batch. Can also be a block referencing a schedule, e.g. [`compounding`](https://thinc.ai/docs/api-schedules/#compounding). |
| `tolerance` | float | What percentage of the size to allow batches to exceed. |
| `discard_oversize` | bool | Whether to discard sequences that by themselves exceed the tolerated size. |
| `get_length` | `Callable[[Any], int]` | Optional function that receives a sequence item and returns its length. Defaults to the built-in `len()` if not set. |
#### batch_by_sequence.v1 {#batch_by_sequence tag="registered function"}
<!-- TODO: -->
> #### Example config
>
> ```ini
@ -383,34 +380,37 @@ themselves, or be discarded if `discard_oversize` is set to `True`. The argument
> get_length = null
> ```
<!-- TODO: complete table -->
Create a batcher that creates batches of the specified size.
| Name | Type | Description |
| ------------ | ---------------------- | ----------------------------------------------------------------------------------------------------------------------------------- |
| `size` | `Iterable[int]` / int | The batch size. Can also be a block referencing a schedule, e.g. [`compounding`](https://thinc.ai/docs/api-schedules/#compounding). |
| `get_length` | `Callable[[Any], int]` | Optional function that receives a sequence and returns its length. Defaults to the built-in `len()` if not set. |
| Name | Type | Description |
| ------------ | ---------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `size` | `Iterable[int]` / int | The target number of items per batch. Can also be a block referencing a schedule, e.g. [`compounding`](https://thinc.ai/docs/api-schedules/#compounding). |
| `get_length` | `Callable[[Any], int]` | Optional function that receives a sequence item and returns its length. Defaults to the built-in `len()` if not set. |
#### batch_by_padded.v1 {#batch_by_padded tag="registered function"}
<!-- TODO: -->
> #### Example config
>
> ```ini
> [training.batcher]
> @batchers = "batch_by_words.v1"
> @batchers = "batch_by_padded.v1"
> size = 100
> buffer = TODO:
> buffer = 256
> discard_oversize = false
> get_length = null
> ```
| Name | Type | Description |
| ------------------ | ---------------------- | ----------------------------------------------------------------------------------------------------------------------------------- |
| `size` | `Iterable[int]` / int | The batch size. Can also be a block referencing a schedule, e.g. [`compounding`](https://thinc.ai/docs/api-schedules/#compounding). |
| `buffer` | int | |
| `discard_oversize` | bool | Discard items that are longer than the specified batch length. |
| `get_length` | `Callable[[Any], int]` | Optional function that receives a sequence and returns its length. Defaults to the built-in `len()` if not set. |
Minibatch a sequence by the size of padded batches that would result, with
sequences binned by length within a window. The padded size is defined as the
maximum length of sequences within the batch multiplied by the number of
sequences in the batch.
| Name | Type | Description |
| ------------------ | ---------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `size` | `Iterable[int]` / int | The largest padded size to batch sequences into. Can also be a block referencing a schedule, e.g. [`compounding`](https://thinc.ai/docs/api-schedules/#compounding). |
| `buffer` | int | The number of sequences to accumulate before sorting by length. A larger buffer will result in more even sizing, but if the buffer is very large, the iteration order will be less random, which can result in suboptimal training. |
| `discard_oversize` | bool | Whether to discard sequences that are by themselves longer than the largest padded batch size. |
| `get_length` | `Callable[[Any], int]` | Optional function that receives a sequence item and returns its length. Defaults to the built-in `len()` if not set. |
## Training data and alignment {#gold source="spacy/gold"}