From f3981bd0c87b5f686593e51a53825b2c718eac6e Mon Sep 17 00:00:00 2001
From: Paul O'Leary McCann <polm@dampfkraft.com>
Date: Thu, 18 Nov 2021 14:38:30 +0000
Subject: [PATCH] Clarify how to fill in init_tok2vec after pretraining (#9639)

* Clarify how to fill in init_tok2vec after pretraining

* Ignore init_tok2vec arg in pretraining

* Update docs, config setting

* Remove obsolete note about not filling init_tok2vec early

This seems to have also caught some lines that needed cleanup.
---
 spacy/training/pretrain.py                    |  2 ++
 website/docs/api/data-formats.md              |  2 +-
 website/docs/usage/embeddings-transformers.md | 35 +++++++++----------
 3 files changed, 19 insertions(+), 20 deletions(-)

diff --git a/spacy/training/pretrain.py b/spacy/training/pretrain.py
index 465406a49..52af84aaf 100644
--- a/spacy/training/pretrain.py
+++ b/spacy/training/pretrain.py
@@ -31,6 +31,8 @@ def pretrain(
     allocator = config["training"]["gpu_allocator"]
     if use_gpu >= 0 and allocator:
         set_gpu_allocator(allocator)
+    # ignore in pretraining because we're creating it now
+    config["initialize"]["init_tok2vec"] = None
     nlp = load_model_from_config(config)
     _config = nlp.config.interpolate()
     P = registry.resolve(_config["pretraining"], schema=ConfigSchemaPretrain)
diff --git a/website/docs/api/data-formats.md b/website/docs/api/data-formats.md
index 001455f33..c6cd92799 100644
--- a/website/docs/api/data-formats.md
+++ b/website/docs/api/data-formats.md
@@ -248,7 +248,7 @@ Also see the usage guides on the
 | `after_init`   | Optional callback to modify the `nlp` object after initialization. ~~Optional[Callable[[Language], Language]]~~                                                                                                                                                                                                                                                                                                |
 | `before_init`  | Optional callback to modify the `nlp` object before initialization. ~~Optional[Callable[[Language], Language]]~~                                                                                                                                                                                                                                                                                               |
 | `components`   | Additional arguments passed to the `initialize` method of a pipeline component, keyed by component name. If type annotations are available on the method, the config will be validated against them. The `initialize` methods will always receive the `get_examples` callback and the current `nlp` object. ~~Dict[str, Dict[str, Any]]~~                                                                      |
-| `init_tok2vec` | Optional path to pretrained tok2vec weights created with [`spacy pretrain`](/api/cli#pretrain). Defaults to variable `${paths.init_tok2vec}`. ~~Optional[str]~~                                                                                                                                                                                                                                                |
+| `init_tok2vec` | Optional path to pretrained tok2vec weights created with [`spacy pretrain`](/api/cli#pretrain). Defaults to variable `${paths.init_tok2vec}`. Ignored when actually running pretraining, as you're creating the file to be used later. ~~Optional[str]~~                                                                                                                                                       |
 | `lookups`      | Additional lexeme and vocab data from [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data). Defaults to `null`. ~~Optional[Lookups]~~                                                                                                                                                                                                                                                       |
 | `tokenizer`    | Additional arguments passed to the `initialize` method of the specified tokenizer. Can be used for languages like Chinese that depend on dictionaries or trained models for tokenization. If type annotations are available on the method, the config will be validated against them. The `initialize` method will always receive the `get_examples` callback and the current `nlp` object. ~~Dict[str, Any]~~ |
 | `vectors`      | Name or path of pipeline containing pretrained word vectors to use, e.g. created with [`init vectors`](/api/cli#init-vectors). Defaults to `null`. ~~Optional[str]~~                                                                                                                                                                                                                                           |
diff --git a/website/docs/usage/embeddings-transformers.md b/website/docs/usage/embeddings-transformers.md
index febed6f2f..708cdd8bf 100644
--- a/website/docs/usage/embeddings-transformers.md
+++ b/website/docs/usage/embeddings-transformers.md
@@ -391,8 +391,8 @@ A wide variety of PyTorch models are supported, but some might not work. If a
 model doesn't seem to work feel free to open an
 [issue](https://github.com/explosion/spacy/issues). Additionally note that
 Transformers loaded in spaCy can only be used for tensors, and pretrained
-task-specific heads or text generation features cannot be used as part of 
-the `transformer` pipeline component.
+task-specific heads or text generation features cannot be used as part of the
+`transformer` pipeline component.
 
 <Infobox variant="warning">
 
@@ -715,8 +715,8 @@ network for a temporary task that forces the model to learn something about
 sentence structure and word cooccurrence statistics.
 
 Pretraining produces a **binary weights file** that can be loaded back in at the
-start of training, using the configuration option `initialize.init_tok2vec`.
-The weights file specifies an initial set of weights. Training then proceeds as
+start of training, using the configuration option `initialize.init_tok2vec`. The
+weights file specifies an initial set of weights. Training then proceeds as
 normal.
 
 You can only pretrain one subnetwork from your pipeline at a time, and the
@@ -751,15 +751,14 @@ layer = "tok2vec"
 
 #### Connecting pretraining to training {#pretraining-training}
 
-To benefit from pretraining, your training step needs to know to initialize
-its `tok2vec` component with the weights learned from the pretraining step.
-You do this by setting `initialize.init_tok2vec` to the filename of the
-`.bin` file that you want to use from pretraining.
+To benefit from pretraining, your training step needs to know to initialize its
+`tok2vec` component with the weights learned from the pretraining step. You do
+this by setting `initialize.init_tok2vec` to the filename of the `.bin` file
+that you want to use from pretraining.
 
-A pretraining step that runs for 5 epochs with an output path of `pretrain/`,
-as an example, produces `pretrain/model0.bin` through `pretrain/model4.bin`.
-To make use of the final output, you could fill in this value in your config
-file:
+A pretraining step that runs for 5 epochs with an output path of `pretrain/`, as
+an example, produces `pretrain/model0.bin` through `pretrain/model4.bin`. To
+make use of the final output, you could fill in this value in your config file:
 
 ```ini
 ### config.cfg
@@ -773,16 +772,14 @@ init_tok2vec = ${paths.init_tok2vec}
 
 <Infobox variant="warning">
 
-The outputs of `spacy pretrain` are not the same data format as the
-pre-packaged static word vectors that would go into 
-[`initialize.vectors`](/api/data-formats#config-initialize).
-The pretraining output consists of the weights that the `tok2vec`
-component should start with in an existing pipeline, so it goes in
-`initialize.init_tok2vec`.
+The outputs of `spacy pretrain` are not the same data format as the pre-packaged
+static word vectors that would go into
+[`initialize.vectors`](/api/data-formats#config-initialize). The pretraining
+output consists of the weights that the `tok2vec` component should start with in
+an existing pipeline, so it goes in `initialize.init_tok2vec`.
 
 </Infobox>
 
-
 #### Pretraining objectives {#pretraining-objectives}
 
 > ```ini