Clarify how to fill in init_tok2vec after pretraining (#9639)

* Clarify how to fill in init_tok2vec after pretraining * Ignore init_tok2vec arg in pretraining * Update docs, config setting * Remove obsolete note about not filling init_tok2vec early This seems to have also caught some lines that needed cleanup.
2025-12-17 07:04:29 +03:00 · 2021-11-18 14:38:30 +00:00 · 2021-11-18 14:38:30 +00:00 · f3981bd0c8
commit f3981bd0c8
parent 86fa37e8ba
3 changed files with 19 additions and 20 deletions
--- a/spacy/training/pretrain.py
+++ b/spacy/training/pretrain.py
@ -31,6 +31,8 @@ def pretrain(
    allocator = config["training"]["gpu_allocator"]
    if use_gpu >= 0 and allocator:
        set_gpu_allocator(allocator)
    # ignore in pretraining because we're creating it now
    config["initialize"]["init_tok2vec"] = None
    nlp = load_model_from_config(config)
    _config = nlp.config.interpolate()
    P = registry.resolve(_config["pretraining"], schema=ConfigSchemaPretrain)
--- a/website/docs/api/data-formats.md
+++ b/website/docs/api/data-formats.md
@ -248,7 +248,7 @@ Also see the usage guides on the
 | `after_init`   | Optional callback to modify the `nlp` object after initialization. ~~Optional[Callable[[Language], Language]]~~                                                                                                                                                                                                                                                                                                |
 | `before_init`  | Optional callback to modify the `nlp` object before initialization. ~~Optional[Callable[[Language], Language]]~~                                                                                                                                                                                                                                                                                               |
 | `components`   | Additional arguments passed to the `initialize` method of a pipeline component, keyed by component name. If type annotations are available on the method, the config will be validated against them. The `initialize` methods will always receive the `get_examples` callback and the current `nlp` object. ~~Dict[str, Dict[str, Any]]~~                                                                      |
-| `init_tok2vec` | Optional path to pretrained tok2vec weights created with [`spacy pretrain`](/api/cli#pretrain). Defaults to variable `${paths.init_tok2vec}`. ~~Optional[str]~~                                                                                                                                                                                                                                                |
+| `init_tok2vec` | Optional path to pretrained tok2vec weights created with [`spacy pretrain`](/api/cli#pretrain). Defaults to variable `${paths.init_tok2vec}`. Ignored when actually running pretraining, as you're creating the file to be used later. ~~Optional[str]~~                                                                                                                                                       |
 | `lookups`      | Additional lexeme and vocab data from [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data). Defaults to `null`. ~~Optional[Lookups]~~                                                                                                                                                                                                                                                       |
 | `tokenizer`    | Additional arguments passed to the `initialize` method of the specified tokenizer. Can be used for languages like Chinese that depend on dictionaries or trained models for tokenization. If type annotations are available on the method, the config will be validated against them. The `initialize` method will always receive the `get_examples` callback and the current `nlp` object. ~~Dict[str, Any]~~ |
 | `vectors`      | Name or path of pipeline containing pretrained word vectors to use, e.g. created with [`init vectors`](/api/cli#init-vectors). Defaults to `null`. ~~Optional[str]~~                                                                                                                                                                                                                                           |
--- a/website/docs/usage/embeddings-transformers.md
+++ b/website/docs/usage/embeddings-transformers.md
@ -391,8 +391,8 @@ A wide variety of PyTorch models are supported, but some might not work. If a
 model doesn't seem to work feel free to open an
 [issue](https://github.com/explosion/spacy/issues). Additionally note that
 Transformers loaded in spaCy can only be used for tensors, and pretrained
-task-specific heads or text generation features cannot be used as part of 
+task-specific heads or text generation features cannot be used as part of the
-the `transformer` pipeline component.
+`transformer` pipeline component.
 <Infobox variant="warning">
@ -715,8 +715,8 @@ network for a temporary task that forces the model to learn something about
 sentence structure and word cooccurrence statistics.
 Pretraining produces a **binary weights file** that can be loaded back in at the
-start of training, using the configuration option `initialize.init_tok2vec`.
+start of training, using the configuration option `initialize.init_tok2vec`. The
-The weights file specifies an initial set of weights. Training then proceeds as
+weights file specifies an initial set of weights. Training then proceeds as
 normal.
 You can only pretrain one subnetwork from your pipeline at a time, and the
@ -751,15 +751,14 @@ layer = "tok2vec"
 #### Connecting pretraining to training {#pretraining-training}
-To benefit from pretraining, your training step needs to know to initialize
+To benefit from pretraining, your training step needs to know to initialize its
-its `tok2vec` component with the weights learned from the pretraining step.
+`tok2vec` component with the weights learned from the pretraining step. You do
-You do this by setting `initialize.init_tok2vec` to the filename of the
+this by setting `initialize.init_tok2vec` to the filename of the `.bin` file
-`.bin` file that you want to use from pretraining.
+that you want to use from pretraining.
-A pretraining step that runs for 5 epochs with an output path of `pretrain/`,
+A pretraining step that runs for 5 epochs with an output path of `pretrain/`, as
-as an example, produces `pretrain/model0.bin` through `pretrain/model4.bin`.
+an example, produces `pretrain/model0.bin` through `pretrain/model4.bin`. To
-To make use of the final output, you could fill in this value in your config
+make use of the final output, you could fill in this value in your config file:
 file:
 ```ini
 ### config.cfg
@ -773,16 +772,14 @@ init_tok2vec = ${paths.init_tok2vec}
 <Infobox variant="warning">
-The outputs of `spacy pretrain` are not the same data format as the
+The outputs of `spacy pretrain` are not the same data format as the pre-packaged
-pre-packaged static word vectors that would go into 
+static word vectors that would go into
-[`initialize.vectors`](/api/data-formats#config-initialize).
+[`initialize.vectors`](/api/data-formats#config-initialize). The pretraining
-The pretraining output consists of the weights that the `tok2vec`
+output consists of the weights that the `tok2vec` component should start with in
-component should start with in an existing pipeline, so it goes in
+an existing pipeline, so it goes in `initialize.init_tok2vec`.
 `initialize.init_tok2vec`.
 </Infobox>
 #### Pretraining objectives {#pretraining-objectives}
 > ```ini