mirror of
https://github.com/explosion/spaCy.git
synced 2024-12-25 17:36:30 +03:00
clarify how to connect pretraining to training (#9450)
* clarify how to connect pretraining to training Signed-off-by: Elia Robyn Speer <elia@explosion.ai> * Update website/docs/usage/embeddings-transformers.md * Update website/docs/usage/embeddings-transformers.md * Update website/docs/usage/embeddings-transformers.md * Update website/docs/usage/embeddings-transformers.md Co-authored-by: Elia Robyn Speer <elia@explosion.ai> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
This commit is contained in:
parent
b0b115ff39
commit
fa70837f28
|
@ -712,9 +712,11 @@ given you a 10% error reduction, pretraining with spaCy might give you another
|
|||
The [`spacy pretrain`](/api/cli#pretrain) command will take a **specific
|
||||
subnetwork** within one of your components, and add additional layers to build a
|
||||
network for a temporary task that forces the model to learn something about
|
||||
sentence structure and word cooccurrence statistics. Pretraining produces a
|
||||
**binary weights file** that can be loaded back in at the start of training. The
|
||||
weights file specifies an initial set of weights. Training then proceeds as
|
||||
sentence structure and word cooccurrence statistics.
|
||||
|
||||
Pretraining produces a **binary weights file** that can be loaded back in at the
|
||||
start of training, using the configuration option `initialize.init_tok2vec`.
|
||||
The weights file specifies an initial set of weights. Training then proceeds as
|
||||
normal.
|
||||
|
||||
You can only pretrain one subnetwork from your pipeline at a time, and the
|
||||
|
@ -747,6 +749,40 @@ component = "textcat"
|
|||
layer = "tok2vec"
|
||||
```
|
||||
|
||||
#### Connecting pretraining to training {#pretraining-training}
|
||||
|
||||
To benefit from pretraining, your training step needs to know to initialize
|
||||
its `tok2vec` component with the weights learned from the pretraining step.
|
||||
You do this by setting `initialize.init_tok2vec` to the filename of the
|
||||
`.bin` file that you want to use from pretraining.
|
||||
|
||||
A pretraining step that runs for 5 epochs with an output path of `pretrain/`,
|
||||
as an example, produces `pretrain/model0.bin` through `pretrain/model4.bin`.
|
||||
To make use of the final output, you could fill in this value in your config
|
||||
file:
|
||||
|
||||
```ini
|
||||
### config.cfg
|
||||
|
||||
[paths]
|
||||
init_tok2vec = "pretrain/model4.bin"
|
||||
|
||||
[initialize]
|
||||
init_tok2vec = ${paths.init_tok2vec}
|
||||
```
|
||||
|
||||
<Infobox variant="warning">
|
||||
|
||||
The outputs of `spacy pretrain` are not the same data format as the
|
||||
pre-packaged static word vectors that would go into
|
||||
[`initialize.vectors`](/api/data-formats#config-initialize).
|
||||
The pretraining output consists of the weights that the `tok2vec`
|
||||
component should start with in an existing pipeline, so it goes in
|
||||
`initialize.init_tok2vec`.
|
||||
|
||||
</Infobox>
|
||||
|
||||
|
||||
#### Pretraining objectives {#pretraining-objectives}
|
||||
|
||||
> ```ini
|
||||
|
|
Loading…
Reference in New Issue
Block a user