Draft pretrain usage

2025-07-10 00:02:19 +03:00 · 2020-09-17 18:17:03 +02:00 · 2020-09-17 18:17:03 +02:00 · 6efb7688a6
commit 6efb7688a6
parent 1bb8b4f824
1 changed files with 76 additions and 10 deletions
--- a/website/docs/usage/embeddings-transformers.md
+++ b/website/docs/usage/embeddings-transformers.md
@ -610,17 +610,83 @@ def MyCustomVectors(

 ## Pretraining {#pretraining}

-<Infobox title="This section is still under construction" emoji="🚧" variant="warning">
-</Infobox>
+The `spacy pretrain` command lets you initialize your models with information
+from raw text. Without pretraining, the models for your components will usually
+be initialized randomly. The idea behind pretraining is simple: random probably
+isn't optimal, so if we have some text to learn from, we can probably find
+a way to get the model off to a better start. The impact of `spacy pretrain` varies,
+but it will usually be worth trying if you're not using a transformer model and
+you have relatively little training data (for instance, fewer than 5,000 sentence).
+A good rule of thumb is that pretraining will generally give you a similar accuracy
+improvement to using word vectors in your model. If word vectors have given you
+a 10% error reduction, the `spacy pretrain` command might give you another 10%,
+for a 20% error reduction in total.

-<!--
- explain general concept and idea (short!)
- present it as a separate lightweight mechanism for pretraining the tok2vec
-  layer
- advantages (could also be pros/cons table)
- explain how it generates a separate file (!) and how it depends on the same
-  vectors
-->
+The `spacy pretrain` command will take a specific subnetwork within one of your
+components, and add additional layers to build a network for a temporary task,
+that forces the model to learn something about sentence structure and word
+cooccurrence statistics. Pretraining produces a binary weights file that can be
+loaded back in at the start of training. The weights file specifies an initial
+set of weights. Training then proceeds as normal.
+
+You can only pretrain one subnetwork from your pipeline at a time, and the subnetwork
+must be typed `Model[List[Doc], List[Floats2d]]` (i.e., it has to be a "tok2vec" layer).
+The most common workflow is to use the `Tok2Vec` component to create a shared
+token-to-vector layer for several components of your pipeline, and apply
+pretraining to its whole model. 
+
+The `spacy pretrain` command is configured using the `[pretraining]` section of
+your config file. The `pretraining.component` and `pretraining.layer` settings
+tell spaCy how to find the subnetwork to pretrain. The `pretraining.layer`
+setting should be either the empty string (to use the whole model), or a 
+[node reference](https://thinc.ai/docs/usage-models#model-state). Most of spaCy's
+built-in model architectures have a reference named `"tok2vec"` that will refer
+to the right layer.
+
+```ini
+# Pretrain nlp.get_pipe("tok2vec").model
+[pretraining]
+component = "tok2vec"
+layer = ""
+
+[pretraining]
+# Pretrain nlp.get_pipe("textcat").model.get_ref("tok2vec")
+component = "textcat"
+layer = "tok2vec"
+```
+
+two pretraining objectives are available, both of which are variants of the cloze
+task Devlin et al (2018) introduced for BERT.
+
+* The *characters* objective asks the model to predict some number of leading and
+  trailing UTF-8 bytes for the words. For instance, setting `n_characters=2`, the
+  model will try to predict the first two and last two characters of the word.
+
+* The *vectors* objective asks the model to predict the word's vector, from
+  a static embeddings table. This requires a word vectors model to be trained
+  and loaded. The vectors objective can optimize either a cosine or an L2 loss.
+  We've generally found cosine loss to perform better.
+
+These pretraining objectives use a trick that we term _language modelling with
+approximate outputs (LMAO)_. The motivation for the trick is that predicting
+an exact word ID introduces a lot of incidental complexity. You need a large
+output layer, and even then, the vocabulary is too large, which motivates
+tokenization schemes that do not align to actual word boundaries. At the end of
+training, the output layer will be thrown away regardless: we just want a task
+that forces the network to model something about word cooccurrence statistics.
+Predicting leading and trailing characters does that more than adequately, as
+the exact word sequence could be recovered with high accuracy if the initial
+and trailing characters are predicted accurately. With the vectors objective,
+the pretraining is use the embedding space learned by an algorithm such as
+GloVe or word2vec, allowing the model to focus on the contextual
+modelling we actual care about.
+
+The `[pretraining]` section has several configuration subsections that are
+familiar from the training block: the `[pretraining.batcher]`,
+[pretraining.optimizer]` and `[pretraining.corpus]` all work the same way and
+expect the same types of objects, although for pretraining your corpus does not
+need to have any annotations, so you will often use a different reader, such as 
+`spacy.training.JsonlReader1`.

 > #### Raw text format
 >