Docs for pretrain architectures (#6605)

* document pretraining architectures * formatting * bit more info * small fixes
2025-08-23 21:44:54 +03:00 · 2021-01-06 06:12:30 +01:00 · 2021-01-06 06:12:30 +01:00 · 82ae95267a
commit 82ae95267a
parent bf9096437e
2 changed files with 86 additions and 15 deletions
--- a/website/docs/api/architectures.md
+++ b/website/docs/api/architectures.md
@ -5,6 +5,7 @@ source: spacy/ml/models
 menu:
  - ['Tok2Vec', 'tok2vec-arch']
  - ['Transformers', 'transformers']
  - ['Pretraining', 'pretrain']
  - ['Parser & NER', 'parser']
  - ['Tagging', 'tagger']
  - ['Text Classification', 'textcat']
@ -426,6 +427,71 @@ one component.
 | `grad_factor`      | Reweight gradients from the component before passing them upstream. You can set this to `0` to "freeze" the transformer weights with respect to the component, or use it to make some components more significant than others. Leaving it at `1.0` is usually fine. ~~float~~ |
 | **CREATES**        | The model using the architecture. ~~Model[List[Doc], List[Floats2d]]~~                                                                                                                                                                                                        |
 ## Pretraining architectures {#pretrain source="spacy/ml/models/multi_task.py"}
 The spacy `pretrain` command lets you initialize a `Tok2Vec` layer in your
 pipeline with information from raw text. To this end, additional layers are
 added to build a network for a temporary task that forces the `Tok2Vec` layer to
 learn something about sentence structure and word cooccurrence statistics. Two
 pretraining objectives are available, both of which are variants of the cloze
 task [Devlin et al. (2018)](https://arxiv.org/abs/1810.04805) introduced for
 BERT.
 For more information, see the section on
 [pretraining](/usage/embeddings-transformers#pretraining).
 ### spacy.PretrainVectors.v1 {#pretrain_vectors}
 > #### Example config
 >
 > ```ini
 > [pretraining]
 > component = "tok2vec"
 > ...
 >
 > [pretraining.objective]
 > @architectures = "spacy.PretrainVectors.v1"
 > maxout_pieces = 3
 > hidden_size = 300
 > loss = "cosine"
 > ```
 Predict the word's vector from a static embeddings table as pretraining
 objective for a Tok2Vec layer.
 | Name            | Description                                                                                                                                               |
 | --------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------- |
 | `maxout_pieces` | The number of maxout pieces to use. Recommended values are `2` or `3`. ~~int~~                                                                            |
 | `hidden_size`   | Size of the hidden layer of the model. ~~int~~                                                                                                            |
 | `loss`          | The loss function can be either "cosine" or "L2". We typically recommend to use "cosine". ~~~str~~                                                        |
 | **CREATES**     | A callable function that can create the Model, given the `vocab` of the pipeline and the `tok2vec` layer to pretrain. ~~Callable[[Vocab, Model], Model]~~ |
 ### spacy.PretrainCharacters.v1 {#pretrain_chars}
 > #### Example config
 >
 > ```ini
 > [pretraining]
 > component = "tok2vec"
 > ...
 >
 > [pretraining.objective]
 > @architectures = "spacy.PretrainCharacters.v1"
 > maxout_pieces = 3
 > hidden_size = 300
 > n_characters = 4
 > ```
 Predict some number of leading and trailing UTF-8 bytes as pretraining objective
 for a Tok2Vec layer.
 | Name            | Description                                                                                                                                               |
 | --------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------- |
 | `maxout_pieces` | The number of maxout pieces to use. Recommended values are `2` or `3`. ~~int~~                                                                            |
 | `hidden_size`   | Size of the hidden layer of the model. ~~int~~                                                                                                            |
 | `n_characters`  | The window of characters - e.g. if `n_characters = 2`, the model will try to predict the first two and last two characters of the word. ~~int~~           |
 | **CREATES**     | A callable function that can create the Model, given the `vocab` of the pipeline and the `tok2vec` layer to pretrain. ~~Callable[[Vocab, Model], Model]~~ |
 ## Parser & NER architectures {#parser}
 ### spacy.TransitionBasedParser.v2 {#TransitionBasedParser source="spacy/ml/models/parser.py"}
--- a/website/docs/usage/embeddings-transformers.md
+++ b/website/docs/usage/embeddings-transformers.md
@ -713,34 +713,39 @@ layer = "tok2vec"
 #### Pretraining objectives {#pretraining-details}
 Two pretraining objectives are available, both of which are variants of the
 cloze task [Devlin et al. (2018)](https://arxiv.org/abs/1810.04805) introduced
 for BERT. The objective can be defined and configured via the
 `[pretraining.objective]` config block.
 > ```ini
 > ### Characters objective
 > [pretraining.objective]
-> type = "characters"
+> @architectures = "spacy.PretrainCharacters.v1"
 > maxout_pieces = 3
 > hidden_size = 300
 > n_characters = 4
 > ```
 >
 > ```ini
 > ### Vectors objective
 > [pretraining.objective]
-> type = "vectors"
+> @architectures = "spacy.PretrainVectors.v1"
 > maxout_pieces = 3
 > hidden_size = 300
 > loss = "cosine"
 > ```
- **Characters:** The `"characters"` objective asks the model to predict some
+Two pretraining objectives are available, both of which are variants of the
-  number of leading and trailing UTF-8 bytes for the words. For instance,
+cloze task [Devlin et al. (2018)](https://arxiv.org/abs/1810.04805) introduced
-  setting `n_characters = 2`, the model will try to predict the first two and
+for BERT. The objective can be defined and configured via the
-  last two characters of the word.
+`[pretraining.objective]` config block.
- **Vectors:** The `"vectors"` objective asks the model to predict the word's
+- [`PretrainCharacters`](/api/architectures#pretrain_chars): The `"characters"`
-  vector, from a static embeddings table. This requires a word vectors model to
+  objective asks the model to predict some number of leading and trailing UTF-8
-  be trained and loaded. The vectors objective can optimize either a cosine or
+  bytes for the words. For instance, setting `n_characters = 2`, the model will
-  an L2 loss. We've generally found cosine loss to perform better.
+  try to predict the first two and last two characters of the word.
 - [`PretrainVectors`](/api/architectures#pretrain_vectors): The `"vectors"`
  objective asks the model to predict the word's vector, from a static
  embeddings table. This requires a word vectors model to be trained and loaded.
  The vectors objective can optimize either a cosine or an L2 loss. We've
  generally found cosine loss to perform better.
 These pretraining objectives use a trick that we term **language modelling with
 approximate outputs (LMAO)**. The motivation for the trick is that predicting an