Docs for pretrain architectures (#6605)

* document pretraining architectures * formatting * bit more info * small fixes
2025-08-03 03:40:24 +03:00 · 2021-01-06 06:12:30 +01:00 · 2021-01-06 06:12:30 +01:00 · 82ae95267a
commit 82ae95267a
parent bf9096437e
2 changed files with 86 additions and 15 deletions
--- a/website/docs/api/architectures.md
+++ b/website/docs/api/architectures.md
@ -5,6 +5,7 @@ source: spacy/ml/models
 menu:
  - ['Tok2Vec', 'tok2vec-arch']
  - ['Transformers', 'transformers']
+  - ['Pretraining', 'pretrain']
  - ['Parser & NER', 'parser']
  - ['Tagging', 'tagger']
  - ['Text Classification', 'textcat']
@ -426,6 +427,71 @@ one component.
 | `grad_factor`      | Reweight gradients from the component before passing them upstream. You can set this to `0` to "freeze" the transformer weights with respect to the component, or use it to make some components more significant than others. Leaving it at `1.0` is usually fine. ~~float~~ |
 | **CREATES**        | The model using the architecture. ~~Model[List[Doc], List[Floats2d]]~~                                                                                                                                                                                                        |

+## Pretraining architectures {#pretrain source="spacy/ml/models/multi_task.py"}
+
+The spacy `pretrain` command lets you initialize a `Tok2Vec` layer in your
+pipeline with information from raw text. To this end, additional layers are
+added to build a network for a temporary task that forces the `Tok2Vec` layer to
+learn something about sentence structure and word cooccurrence statistics. Two
+pretraining objectives are available, both of which are variants of the cloze
+task [Devlin et al. (2018)](https://arxiv.org/abs/1810.04805) introduced for
+BERT.
+
+For more information, see the section on
+[pretraining](/usage/embeddings-transformers#pretraining).
+
+### spacy.PretrainVectors.v1 {#pretrain_vectors}
+
+> #### Example config
+>
+> ```ini
+> [pretraining]
+> component = "tok2vec"
+> ...
+>
+> [pretraining.objective]
+> @architectures = "spacy.PretrainVectors.v1"
+> maxout_pieces = 3
+> hidden_size = 300
+> loss = "cosine"
+> ```
+
+Predict the word's vector from a static embeddings table as pretraining
+objective for a Tok2Vec layer.
+
+| Name            | Description                                                                                                                                               |
+| --------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `maxout_pieces` | The number of maxout pieces to use. Recommended values are `2` or `3`. ~~int~~                                                                            |
+| `hidden_size`   | Size of the hidden layer of the model. ~~int~~                                                                                                            |
+| `loss`          | The loss function can be either "cosine" or "L2". We typically recommend to use "cosine". ~~~str~~                                                        |
+| **CREATES**     | A callable function that can create the Model, given the `vocab` of the pipeline and the `tok2vec` layer to pretrain. ~~Callable[[Vocab, Model], Model]~~ |
+
+### spacy.PretrainCharacters.v1 {#pretrain_chars}
+
+> #### Example config
+>
+> ```ini
+> [pretraining]
+> component = "tok2vec"
+> ...
+>
+> [pretraining.objective]
+> @architectures = "spacy.PretrainCharacters.v1"
+> maxout_pieces = 3
+> hidden_size = 300
+> n_characters = 4
+> ```
+
+Predict some number of leading and trailing UTF-8 bytes as pretraining objective
+for a Tok2Vec layer.
+
+| Name            | Description                                                                                                                                               |
+| --------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `maxout_pieces` | The number of maxout pieces to use. Recommended values are `2` or `3`. ~~int~~                                                                            |
+| `hidden_size`   | Size of the hidden layer of the model. ~~int~~                                                                                                            |
+| `n_characters`  | The window of characters - e.g. if `n_characters = 2`, the model will try to predict the first two and last two characters of the word. ~~int~~           |
+| **CREATES**     | A callable function that can create the Model, given the `vocab` of the pipeline and the `tok2vec` layer to pretrain. ~~Callable[[Vocab, Model], Model]~~ |
+
 ## Parser & NER architectures {#parser}

 ### spacy.TransitionBasedParser.v2 {#TransitionBasedParser source="spacy/ml/models/parser.py"}
--- a/website/docs/usage/embeddings-transformers.md
+++ b/website/docs/usage/embeddings-transformers.md
@ -713,34 +713,39 @@ layer = "tok2vec"

 #### Pretraining objectives {#pretraining-details}

-Two pretraining objectives are available, both of which are variants of the
-cloze task [Devlin et al. (2018)](https://arxiv.org/abs/1810.04805) introduced
-for BERT. The objective can be defined and configured via the
-`[pretraining.objective]` config block.
-
 > ```ini
 > ### Characters objective
 > [pretraining.objective]
-> type = "characters"
+> @architectures = "spacy.PretrainCharacters.v1"
+> maxout_pieces = 3
+> hidden_size = 300
 > n_characters = 4
 > ```
 >
 > ```ini
 > ### Vectors objective
 > [pretraining.objective]
-> type = "vectors"
+> @architectures = "spacy.PretrainVectors.v1"
+> maxout_pieces = 3
+> hidden_size = 300
 > loss = "cosine"
 > ```

- **Characters:** The `"characters"` objective asks the model to predict some
-  number of leading and trailing UTF-8 bytes for the words. For instance,
-  setting `n_characters = 2`, the model will try to predict the first two and
-  last two characters of the word.
+Two pretraining objectives are available, both of which are variants of the
+cloze task [Devlin et al. (2018)](https://arxiv.org/abs/1810.04805) introduced
+for BERT. The objective can be defined and configured via the
+`[pretraining.objective]` config block.

- **Vectors:** The `"vectors"` objective asks the model to predict the word's
-  vector, from a static embeddings table. This requires a word vectors model to
-  be trained and loaded. The vectors objective can optimize either a cosine or
-  an L2 loss. We've generally found cosine loss to perform better.
+- [`PretrainCharacters`](/api/architectures#pretrain_chars): The `"characters"`
+  objective asks the model to predict some number of leading and trailing UTF-8
+  bytes for the words. For instance, setting `n_characters = 2`, the model will
+  try to predict the first two and last two characters of the word.
+
+- [`PretrainVectors`](/api/architectures#pretrain_vectors): The `"vectors"`
+  objective asks the model to predict the word's vector, from a static
+  embeddings table. This requires a word vectors model to be trained and loaded.
+  The vectors objective can optimize either a cosine or an L2 loss. We've
+  generally found cosine loss to perform better.

 These pretraining objectives use a trick that we term **language modelling with
 approximate outputs (LMAO)**. The motivation for the trick is that predicting an