Update docs [ci skip]

This commit is contained in:
Ines Montani 2020-08-07 20:14:31 +02:00
parent fd20f84927
commit 46bc513a4e
2 changed files with 53 additions and 57 deletions

View File

@ -417,20 +417,18 @@ network has an internal CNN Tok2Vec layer and uses attention.
> nO = null > nO = null
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| -------------------- | ----- | ------------------------------------------------------------------------------------------------------------------------------------------- | | --------------------------- | ----- | -------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `exclusive_classes` | bool | Whether or not categories are mutually exclusive. | | `exclusive_classes` | bool | Whether or not categories are mutually exclusive. |
| `pretrained_vectors` | bool | Whether or not pretrained vectors will be used in addition to the feature vectors. | | `pretrained_vectors` | bool | Whether or not pretrained vectors will be used in addition to the feature vectors. |
| `width` | int | Output dimension of the feature encoding step. | | `width` | int | Output dimension of the feature encoding step. |
| `embed_size` | int | Input dimension of the feature encoding step. | | `embed_size` | int | Input dimension of the feature encoding step. |
| `conv_depth` | int | Depth of the Tok2Vec layer. | | `conv_depth` | int | Depth of the Tok2Vec layer. |
| `window_size` | int | The number of contextual vectors to [concatenate](https://thinc.ai/docs/api-layers#expand_window) from the left and from the right. | | `window_size` | int | The number of contextual vectors to [concatenate](https://thinc.ai/docs/api-layers#expand_window) from the left and from the right. |
| `ngram_size` | int | Determines the maximum length of the n-grams in the BOW model. For instance, `ngram_size=3`would give unigram, trigram and bigram features. | | `ngram_size` | int | Determines the maximum length of the n-grams in the BOW model. For instance, `ngram_size=3`would give unigram, trigram and bigram features. |
| `dropout` | float | The dropout rate. | | `dropout` | float | The dropout rate. |
| `nO` | int | Output dimension, determined by the number of different labels. | | `nO` | int | Output dimension, determined by the number of different labels. If not set, the the [`TextCategorizer`](/api/textcategorizer) component will set it when |
| `begin_training` is called. |
If the `nO` dimension is not set, the TextCategorizer component will set it when
`begin_training` is called.
### spacy.TextCatCNN.v1 {#TextCatCNN} ### spacy.TextCatCNN.v1 {#TextCatCNN}
@ -457,14 +455,12 @@ A neural network model where token vectors are calculated using a CNN. The
vectors are mean pooled and used as features in a feed-forward network. This vectors are mean pooled and used as features in a feed-forward network. This
architecture is usually less accurate than the ensemble, but runs faster. architecture is usually less accurate than the ensemble, but runs faster.
| Name | Type | Description | | Name | Type | Description |
| ------------------- | ------------------------------------------ | --------------------------------------------------------------- | | --------------------------- | ------------------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `exclusive_classes` | bool | Whether or not categories are mutually exclusive. | | `exclusive_classes` | bool | Whether or not categories are mutually exclusive. |
| `tok2vec` | [`Model`](https://thinc.ai/docs/api-model) | The [`tok2vec`](#tok2vec) layer of the model. | | `tok2vec` | [`Model`](https://thinc.ai/docs/api-model) | The [`tok2vec`](#tok2vec) layer of the model. |
| `nO` | int | Output dimension, determined by the number of different labels. | | `nO` | int | Output dimension, determined by the number of different labels. If not set, the the [`TextCategorizer`](/api/textcategorizer) component will set it when |
| `begin_training` is called. |
If the `nO` dimension is not set, the TextCategorizer component will set it when
`begin_training` is called.
### spacy.TextCatBOW.v1 {#TextCatBOW} ### spacy.TextCatBOW.v1 {#TextCatBOW}
@ -482,17 +478,17 @@ If the `nO` dimension is not set, the TextCategorizer component will set it when
An ngram "bag-of-words" model. This architecture should run much faster than the An ngram "bag-of-words" model. This architecture should run much faster than the
others, but may not be as accurate, especially if texts are short. others, but may not be as accurate, especially if texts are short.
| Name | Type | Description | | Name | Type | Description |
| ------------------- | ----- | ------------------------------------------------------------------------------------------------------------------------------------------- | | --------------------------- | ----- | -------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `exclusive_classes` | bool | Whether or not categories are mutually exclusive. | | `exclusive_classes` | bool | Whether or not categories are mutually exclusive. |
| `ngram_size` | int | Determines the maximum length of the n-grams in the BOW model. For instance, `ngram_size=3`would give unigram, trigram and bigram features. | | `ngram_size` | int | Determines the maximum length of the n-grams in the BOW model. For instance, `ngram_size=3`would give unigram, trigram and bigram features. |
| `no_output_layer` | float | Whether or not to add an output layer to the model (`Softmax` activation if `exclusive_classes=True`, else `Logistic`. | | `no_output_layer` | float | Whether or not to add an output layer to the model (`Softmax` activation if `exclusive_classes=True`, else `Logistic`. |
| `nO` | int | Output dimension, determined by the number of different labels. | | `nO` | int | Output dimension, determined by the number of different labels. If not set, the the [`TextCategorizer`](/api/textcategorizer) component will set it when |
| `begin_training` is called. |
If the `nO` dimension is not set, the TextCategorizer component will set it when
`begin_training` is called.
<!-- TODO:
### spacy.TextCatLowData.v1 {#TextCatLowData} ### spacy.TextCatLowData.v1 {#TextCatLowData}
-->
## Entity linking architectures {#entitylinker source="spacy/ml/models/entity_linker.py"} ## Entity linking architectures {#entitylinker source="spacy/ml/models/entity_linker.py"}

View File

@ -340,7 +340,7 @@ See the [`Transformer`](/api/transformer) API reference and
## Batchers {#batchers source="spacy/gold/batchers.py" new="3"} ## Batchers {#batchers source="spacy/gold/batchers.py" new="3"}
<!-- TODO: intro and also describe signature of functions --> <!-- TODO: intro -->
#### batch_by_words.v1 {#batch_by_words tag="registered function"} #### batch_by_words.v1 {#batch_by_words tag="registered function"}
@ -361,19 +361,16 @@ themselves, or be discarded if `discard_oversize` is set to `True`. The argument
> get_length = null > get_length = null
> ``` > ```
<!-- TODO: complete table --> | Name | Type | Description |
| ------------------ | ---------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Name | Type | Description | | `seqs` | `Iterable[Any]` | The sequences to minibatch. |
| ------------------ | ---------------------- | ----------------------------------------------------------------------------------------------------------------------------------- | | `size` | `Iterable[int]` / int | The target number of words per batch. Can also be a block referencing a schedule, e.g. [`compounding`](https://thinc.ai/docs/api-schedules/#compounding). |
| `size` | `Iterable[int]` / int | The batch size. Can also be a block referencing a schedule, e.g. [`compounding`](https://thinc.ai/docs/api-schedules/#compounding). | | `tolerance` | float | What percentage of the size to allow batches to exceed. |
| `tolerance` | float | | | `discard_oversize` | bool | Whether to discard sequences that by themselves exceed the tolerated size. |
| `discard_oversize` | bool | Discard items that are longer than the specified batch length. | | `get_length` | `Callable[[Any], int]` | Optional function that receives a sequence item and returns its length. Defaults to the built-in `len()` if not set. |
| `get_length` | `Callable[[Any], int]` | Optional function that receives a sequence and returns its length. Defaults to the built-in `len()` if not set. |
#### batch_by_sequence.v1 {#batch_by_sequence tag="registered function"} #### batch_by_sequence.v1 {#batch_by_sequence tag="registered function"}
<!-- TODO: -->
> #### Example config > #### Example config
> >
> ```ini > ```ini
@ -383,34 +380,37 @@ themselves, or be discarded if `discard_oversize` is set to `True`. The argument
> get_length = null > get_length = null
> ``` > ```
<!-- TODO: complete table --> Create a batcher that creates batches of the specified size.
| Name | Type | Description | | Name | Type | Description |
| ------------ | ---------------------- | ----------------------------------------------------------------------------------------------------------------------------------- | | ------------ | ---------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `size` | `Iterable[int]` / int | The batch size. Can also be a block referencing a schedule, e.g. [`compounding`](https://thinc.ai/docs/api-schedules/#compounding). | | `size` | `Iterable[int]` / int | The target number of items per batch. Can also be a block referencing a schedule, e.g. [`compounding`](https://thinc.ai/docs/api-schedules/#compounding). |
| `get_length` | `Callable[[Any], int]` | Optional function that receives a sequence and returns its length. Defaults to the built-in `len()` if not set. | | `get_length` | `Callable[[Any], int]` | Optional function that receives a sequence item and returns its length. Defaults to the built-in `len()` if not set. |
#### batch_by_padded.v1 {#batch_by_padded tag="registered function"} #### batch_by_padded.v1 {#batch_by_padded tag="registered function"}
<!-- TODO: -->
> #### Example config > #### Example config
> >
> ```ini > ```ini
> [training.batcher] > [training.batcher]
> @batchers = "batch_by_words.v1" > @batchers = "batch_by_padded.v1"
> size = 100 > size = 100
> buffer = TODO: > buffer = 256
> discard_oversize = false > discard_oversize = false
> get_length = null > get_length = null
> ``` > ```
| Name | Type | Description | Minibatch a sequence by the size of padded batches that would result, with
| ------------------ | ---------------------- | ----------------------------------------------------------------------------------------------------------------------------------- | sequences binned by length within a window. The padded size is defined as the
| `size` | `Iterable[int]` / int | The batch size. Can also be a block referencing a schedule, e.g. [`compounding`](https://thinc.ai/docs/api-schedules/#compounding). | maximum length of sequences within the batch multiplied by the number of
| `buffer` | int | | sequences in the batch.
| `discard_oversize` | bool | Discard items that are longer than the specified batch length. |
| `get_length` | `Callable[[Any], int]` | Optional function that receives a sequence and returns its length. Defaults to the built-in `len()` if not set. | | Name | Type | Description |
| ------------------ | ---------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `size` | `Iterable[int]` / int | The largest padded size to batch sequences into. Can also be a block referencing a schedule, e.g. [`compounding`](https://thinc.ai/docs/api-schedules/#compounding). |
| `buffer` | int | The number of sequences to accumulate before sorting by length. A larger buffer will result in more even sizing, but if the buffer is very large, the iteration order will be less random, which can result in suboptimal training. |
| `discard_oversize` | bool | Whether to discard sequences that are by themselves longer than the largest padded batch size. |
| `get_length` | `Callable[[Any], int]` | Optional function that receives a sequence item and returns its length. Defaults to the built-in `len()` if not set. |
## Training data and alignment {#gold source="spacy/gold"} ## Training data and alignment {#gold source="spacy/gold"}