clarify how to connect pretraining to training (#9450)

* clarify how to connect pretraining to training

Signed-off-by: Elia Robyn Speer <elia@explosion.ai>

* Update website/docs/usage/embeddings-transformers.md

* Update website/docs/usage/embeddings-transformers.md

* Update website/docs/usage/embeddings-transformers.md

* Update website/docs/usage/embeddings-transformers.md

Co-authored-by: Elia Robyn Speer <elia@explosion.ai>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
This commit is contained in:
Elia Robyn Lake (Robyn Speer) 2021-10-22 07:15:47 -04:00 committed by GitHub
parent b0b115ff39
commit fa70837f28
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

View File

@ -712,9 +712,11 @@ given you a 10% error reduction, pretraining with spaCy might give you another
The [`spacy pretrain`](/api/cli#pretrain) command will take a **specific The [`spacy pretrain`](/api/cli#pretrain) command will take a **specific
subnetwork** within one of your components, and add additional layers to build a subnetwork** within one of your components, and add additional layers to build a
network for a temporary task that forces the model to learn something about network for a temporary task that forces the model to learn something about
sentence structure and word cooccurrence statistics. Pretraining produces a sentence structure and word cooccurrence statistics.
**binary weights file** that can be loaded back in at the start of training. The
weights file specifies an initial set of weights. Training then proceeds as Pretraining produces a **binary weights file** that can be loaded back in at the
start of training, using the configuration option `initialize.init_tok2vec`.
The weights file specifies an initial set of weights. Training then proceeds as
normal. normal.
You can only pretrain one subnetwork from your pipeline at a time, and the You can only pretrain one subnetwork from your pipeline at a time, and the
@ -747,6 +749,40 @@ component = "textcat"
layer = "tok2vec" layer = "tok2vec"
``` ```
#### Connecting pretraining to training {#pretraining-training}
To benefit from pretraining, your training step needs to know to initialize
its `tok2vec` component with the weights learned from the pretraining step.
You do this by setting `initialize.init_tok2vec` to the filename of the
`.bin` file that you want to use from pretraining.
A pretraining step that runs for 5 epochs with an output path of `pretrain/`,
as an example, produces `pretrain/model0.bin` through `pretrain/model4.bin`.
To make use of the final output, you could fill in this value in your config
file:
```ini
### config.cfg
[paths]
init_tok2vec = "pretrain/model4.bin"
[initialize]
init_tok2vec = ${paths.init_tok2vec}
```
<Infobox variant="warning">
The outputs of `spacy pretrain` are not the same data format as the
pre-packaged static word vectors that would go into
[`initialize.vectors`](/api/data-formats#config-initialize).
The pretraining output consists of the weights that the `tok2vec`
component should start with in an existing pipeline, so it goes in
`initialize.init_tok2vec`.
</Infobox>
#### Pretraining objectives {#pretraining-objectives} #### Pretraining objectives {#pretraining-objectives}
> ```ini > ```ini