Update docs [ci skip]

2025-08-01 10:59:55 +03:00 · 2020-08-07 20:14:31 +02:00 · 2020-08-07 20:14:31 +02:00 · 46bc513a4e
commit 46bc513a4e
parent fd20f84927
2 changed files with 53 additions and 57 deletions
--- a/website/docs/api/architectures.md
+++ b/website/docs/api/architectures.md
@ -417,20 +417,18 @@ network has an internal CNN Tok2Vec layer and uses attention.
 > nO = null
 > ```

-| Name                 | Type  | Description                                                                                                                                 |
-| -------------------- | ----- | ------------------------------------------------------------------------------------------------------------------------------------------- |
-| `exclusive_classes`  | bool  | Whether or not categories are mutually exclusive.                                                                                           |
-| `pretrained_vectors` | bool  | Whether or not pretrained vectors will be used in addition to the feature vectors.                                                          |
-| `width`              | int   | Output dimension of the feature encoding step.                                                                                              |
-| `embed_size`         | int   | Input dimension of the feature encoding step.                                                                                               |
-| `conv_depth`         | int   | Depth of the Tok2Vec layer.                                                                                                                 |
-| `window_size`        | int   | The number of contextual vectors to [concatenate](https://thinc.ai/docs/api-layers#expand_window) from the left and from the right.         |
-| `ngram_size`         | int   | Determines the maximum length of the n-grams in the BOW model. For instance, `ngram_size=3`would give unigram, trigram and bigram features. |
-| `dropout`            | float | The dropout rate.                                                                                                                           |
-| `nO`                 | int   | Output dimension, determined by the number of different labels.                                                                             |
-
-If the `nO` dimension is not set, the TextCategorizer component will set it when
-`begin_training` is called.
+| Name                        | Type  | Description                                                                                                                                              |
+| --------------------------- | ----- | -------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `exclusive_classes`         | bool  | Whether or not categories are mutually exclusive.                                                                                                        |
+| `pretrained_vectors`        | bool  | Whether or not pretrained vectors will be used in addition to the feature vectors.                                                                       |
+| `width`                     | int   | Output dimension of the feature encoding step.                                                                                                           |
+| `embed_size`                | int   | Input dimension of the feature encoding step.                                                                                                            |
+| `conv_depth`                | int   | Depth of the Tok2Vec layer.                                                                                                                              |
+| `window_size`               | int   | The number of contextual vectors to [concatenate](https://thinc.ai/docs/api-layers#expand_window) from the left and from the right.                      |
+| `ngram_size`                | int   | Determines the maximum length of the n-grams in the BOW model. For instance, `ngram_size=3`would give unigram, trigram and bigram features.              |
+| `dropout`                   | float | The dropout rate.                                                                                                                                        |
+| `nO`                        | int   | Output dimension, determined by the number of different labels. If not set, the the [`TextCategorizer`](/api/textcategorizer) component will set it when |
+| `begin_training` is called. |

 ### spacy.TextCatCNN.v1 {#TextCatCNN}

@ -457,14 +455,12 @@ A neural network model where token vectors are calculated using a CNN. The
 vectors are mean pooled and used as features in a feed-forward network. This
 architecture is usually less accurate than the ensemble, but runs faster.

-| Name                | Type                                       | Description                                                     |
-| ------------------- | ------------------------------------------ | --------------------------------------------------------------- |
-| `exclusive_classes` | bool                                       | Whether or not categories are mutually exclusive.               |
-| `tok2vec`           | [`Model`](https://thinc.ai/docs/api-model) | The [`tok2vec`](#tok2vec) layer of the model.                   |
-| `nO`                | int                                        | Output dimension, determined by the number of different labels. |
-
-If the `nO` dimension is not set, the TextCategorizer component will set it when
-`begin_training` is called.
+| Name                        | Type                                       | Description                                                                                                                                              |
+| --------------------------- | ------------------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `exclusive_classes`         | bool                                       | Whether or not categories are mutually exclusive.                                                                                                        |
+| `tok2vec`                   | [`Model`](https://thinc.ai/docs/api-model) | The [`tok2vec`](#tok2vec) layer of the model.                                                                                                            |
+| `nO`                        | int                                        | Output dimension, determined by the number of different labels. If not set, the the [`TextCategorizer`](/api/textcategorizer) component will set it when |
+| `begin_training` is called. |

 ### spacy.TextCatBOW.v1 {#TextCatBOW}

@ -482,17 +478,17 @@ If the `nO` dimension is not set, the TextCategorizer component will set it when
 An ngram "bag-of-words" model. This architecture should run much faster than the
 others, but may not be as accurate, especially if texts are short.

-| Name                | Type  | Description                                                                                                                                 |
-| ------------------- | ----- | ------------------------------------------------------------------------------------------------------------------------------------------- |
-| `exclusive_classes` | bool  | Whether or not categories are mutually exclusive.                                                                                           |
-| `ngram_size`        | int   | Determines the maximum length of the n-grams in the BOW model. For instance, `ngram_size=3`would give unigram, trigram and bigram features. |
-| `no_output_layer`   | float | Whether or not to add an output layer to the model (`Softmax` activation if `exclusive_classes=True`, else `Logistic`.                      |
-| `nO`                | int   | Output dimension, determined by the number of different labels.                                                                             |
-
-If the `nO` dimension is not set, the TextCategorizer component will set it when
-`begin_training` is called.
+| Name                        | Type  | Description                                                                                                                                              |
+| --------------------------- | ----- | -------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `exclusive_classes`         | bool  | Whether or not categories are mutually exclusive.                                                                                                        |
+| `ngram_size`                | int   | Determines the maximum length of the n-grams in the BOW model. For instance, `ngram_size=3`would give unigram, trigram and bigram features.              |
+| `no_output_layer`           | float | Whether or not to add an output layer to the model (`Softmax` activation if `exclusive_classes=True`, else `Logistic`.                                   |
+| `nO`                        | int   | Output dimension, determined by the number of different labels. If not set, the the [`TextCategorizer`](/api/textcategorizer) component will set it when |
+| `begin_training` is called. |

+<!-- TODO:
 ### spacy.TextCatLowData.v1 {#TextCatLowData}
+-->

 ## Entity linking architectures {#entitylinker source="spacy/ml/models/entity_linker.py"}

--- a/website/docs/api/top-level.md
+++ b/website/docs/api/top-level.md
@ -340,7 +340,7 @@ See the [`Transformer`](/api/transformer) API reference and

 ## Batchers {#batchers source="spacy/gold/batchers.py" new="3"}

-<!-- TODO: intro and also describe signature of functions -->
+<!-- TODO: intro -->

 #### batch_by_words.v1 {#batch_by_words tag="registered function"}

@ -361,19 +361,16 @@ themselves, or be discarded if `discard_oversize` is set to `True`. The argument
 > get_length = null
 > ```

-<!-- TODO: complete table -->
-
-| Name               | Type                   | Description                                                                                                                         |
-| ------------------ | ---------------------- | ----------------------------------------------------------------------------------------------------------------------------------- |
-| `size`             | `Iterable[int]` / int  | The batch size. Can also be a block referencing a schedule, e.g. [`compounding`](https://thinc.ai/docs/api-schedules/#compounding). |
-| `tolerance`        | float                  |                                                                                                                                     |
-| `discard_oversize` | bool                   | Discard items that are longer than the specified batch length.                                                                      |
-| `get_length`       | `Callable[[Any], int]` | Optional function that receives a sequence and returns its length. Defaults to the built-in `len()` if not set.                     |
+| Name               | Type                   | Description                                                                                                                                               |
+| ------------------ | ---------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `seqs`             | `Iterable[Any]`        | The sequences to minibatch.                                                                                                                               |
+| `size`             | `Iterable[int]` / int  | The target number of words per batch. Can also be a block referencing a schedule, e.g. [`compounding`](https://thinc.ai/docs/api-schedules/#compounding). |
+| `tolerance`        | float                  | What percentage of the size to allow batches to exceed.                                                                                                   |
+| `discard_oversize` | bool                   | Whether to discard sequences that by themselves exceed the tolerated size.                                                                                |
+| `get_length`       | `Callable[[Any], int]` | Optional function that receives a sequence item and returns its length. Defaults to the built-in `len()` if not set.                                      |

 #### batch_by_sequence.v1 {#batch_by_sequence tag="registered function"}

-<!-- TODO: -->
-
 > #### Example config
 >
 > ```ini
@ -383,34 +380,37 @@ themselves, or be discarded if `discard_oversize` is set to `True`. The argument
 > get_length = null
 > ```

-<!-- TODO: complete table -->
+Create a batcher that creates batches of the specified size.

-| Name         | Type                   | Description                                                                                                                         |
-| ------------ | ---------------------- | ----------------------------------------------------------------------------------------------------------------------------------- |
-| `size`       | `Iterable[int]` / int  | The batch size. Can also be a block referencing a schedule, e.g. [`compounding`](https://thinc.ai/docs/api-schedules/#compounding). |
-| `get_length` | `Callable[[Any], int]` | Optional function that receives a sequence and returns its length. Defaults to the built-in `len()` if not set.                     |
+| Name         | Type                   | Description                                                                                                                                               |
+| ------------ | ---------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `size`       | `Iterable[int]` / int  | The target number of items per batch. Can also be a block referencing a schedule, e.g. [`compounding`](https://thinc.ai/docs/api-schedules/#compounding). |
+| `get_length` | `Callable[[Any], int]` | Optional function that receives a sequence item and returns its length. Defaults to the built-in `len()` if not set.                                      |

 #### batch_by_padded.v1 {#batch_by_padded tag="registered function"}

-<!-- TODO: -->
-
 > #### Example config
 >
 > ```ini
 > [training.batcher]
-> @batchers = "batch_by_words.v1"
+> @batchers = "batch_by_padded.v1"
 > size = 100
-> buffer = TODO:
+> buffer = 256
 > discard_oversize = false
 > get_length = null
 > ```

-| Name               | Type                   | Description                                                                                                                         |
-| ------------------ | ---------------------- | ----------------------------------------------------------------------------------------------------------------------------------- |
-| `size`             | `Iterable[int]` / int  | The batch size. Can also be a block referencing a schedule, e.g. [`compounding`](https://thinc.ai/docs/api-schedules/#compounding). |
-| `buffer`           | int                    |                                                                                                                                     |
-| `discard_oversize` | bool                   | Discard items that are longer than the specified batch length.                                                                      |
-| `get_length`       | `Callable[[Any], int]` | Optional function that receives a sequence and returns its length. Defaults to the built-in `len()` if not set.                     |
+Minibatch a sequence by the size of padded batches that would result, with
+sequences binned by length within a window. The padded size is defined as the
+maximum length of sequences within the batch multiplied by the number of
+sequences in the batch.
+
+| Name               | Type                   | Description                                                                                                                                                                                                                         |
+| ------------------ | ---------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `size`             | `Iterable[int]` / int  | The largest padded size to batch sequences into. Can also be a block referencing a schedule, e.g. [`compounding`](https://thinc.ai/docs/api-schedules/#compounding).                                                                |
+| `buffer`           | int                    | The number of sequences to accumulate before sorting by length. A larger buffer will result in more even sizing, but if the buffer is very large, the iteration order will be less random, which can result in suboptimal training. |
+| `discard_oversize` | bool                   | Whether to discard sequences that are by themselves longer than the largest padded batch size.                                                                                                                                      |
+| `get_length`       | `Callable[[Any], int]` | Optional function that receives a sequence item and returns its length. Defaults to the built-in `len()` if not set.                                                                                                                |

 ## Training data and alignment {#gold source="spacy/gold"}