Clarify how to fill in init_tok2vec after pretraining (#9639)

* Clarify how to fill in init_tok2vec after pretraining

* Ignore init_tok2vec arg in pretraining

* Update docs, config setting

* Remove obsolete note about not filling init_tok2vec early

This seems to have also caught some lines that needed cleanup.
This commit is contained in:
Paul O'Leary McCann 2021-11-18 14:38:30 +00:00 committed by GitHub
parent 86fa37e8ba
commit f3981bd0c8
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
3 changed files with 19 additions and 20 deletions

View File

@ -31,6 +31,8 @@ def pretrain(
allocator = config["training"]["gpu_allocator"]
if use_gpu >= 0 and allocator:
set_gpu_allocator(allocator)
# ignore in pretraining because we're creating it now
config["initialize"]["init_tok2vec"] = None
nlp = load_model_from_config(config)
_config = nlp.config.interpolate()
P = registry.resolve(_config["pretraining"], schema=ConfigSchemaPretrain)

View File

@ -248,7 +248,7 @@ Also see the usage guides on the
| `after_init` | Optional callback to modify the `nlp` object after initialization. ~~Optional[Callable[[Language], Language]]~~ |
| `before_init` | Optional callback to modify the `nlp` object before initialization. ~~Optional[Callable[[Language], Language]]~~ |
| `components` | Additional arguments passed to the `initialize` method of a pipeline component, keyed by component name. If type annotations are available on the method, the config will be validated against them. The `initialize` methods will always receive the `get_examples` callback and the current `nlp` object. ~~Dict[str, Dict[str, Any]]~~ |
| `init_tok2vec` | Optional path to pretrained tok2vec weights created with [`spacy pretrain`](/api/cli#pretrain). Defaults to variable `${paths.init_tok2vec}`. ~~Optional[str]~~ |
| `init_tok2vec` | Optional path to pretrained tok2vec weights created with [`spacy pretrain`](/api/cli#pretrain). Defaults to variable `${paths.init_tok2vec}`. Ignored when actually running pretraining, as you're creating the file to be used later. ~~Optional[str]~~ |
| `lookups` | Additional lexeme and vocab data from [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data). Defaults to `null`. ~~Optional[Lookups]~~ |
| `tokenizer` | Additional arguments passed to the `initialize` method of the specified tokenizer. Can be used for languages like Chinese that depend on dictionaries or trained models for tokenization. If type annotations are available on the method, the config will be validated against them. The `initialize` method will always receive the `get_examples` callback and the current `nlp` object. ~~Dict[str, Any]~~ |
| `vectors` | Name or path of pipeline containing pretrained word vectors to use, e.g. created with [`init vectors`](/api/cli#init-vectors). Defaults to `null`. ~~Optional[str]~~ |

View File

@ -391,8 +391,8 @@ A wide variety of PyTorch models are supported, but some might not work. If a
model doesn't seem to work feel free to open an
[issue](https://github.com/explosion/spacy/issues). Additionally note that
Transformers loaded in spaCy can only be used for tensors, and pretrained
task-specific heads or text generation features cannot be used as part of
the `transformer` pipeline component.
task-specific heads or text generation features cannot be used as part of the
`transformer` pipeline component.
<Infobox variant="warning">
@ -715,8 +715,8 @@ network for a temporary task that forces the model to learn something about
sentence structure and word cooccurrence statistics.
Pretraining produces a **binary weights file** that can be loaded back in at the
start of training, using the configuration option `initialize.init_tok2vec`.
The weights file specifies an initial set of weights. Training then proceeds as
start of training, using the configuration option `initialize.init_tok2vec`. The
weights file specifies an initial set of weights. Training then proceeds as
normal.
You can only pretrain one subnetwork from your pipeline at a time, and the
@ -751,15 +751,14 @@ layer = "tok2vec"
#### Connecting pretraining to training {#pretraining-training}
To benefit from pretraining, your training step needs to know to initialize
its `tok2vec` component with the weights learned from the pretraining step.
You do this by setting `initialize.init_tok2vec` to the filename of the
`.bin` file that you want to use from pretraining.
To benefit from pretraining, your training step needs to know to initialize its
`tok2vec` component with the weights learned from the pretraining step. You do
this by setting `initialize.init_tok2vec` to the filename of the `.bin` file
that you want to use from pretraining.
A pretraining step that runs for 5 epochs with an output path of `pretrain/`,
as an example, produces `pretrain/model0.bin` through `pretrain/model4.bin`.
To make use of the final output, you could fill in this value in your config
file:
A pretraining step that runs for 5 epochs with an output path of `pretrain/`, as
an example, produces `pretrain/model0.bin` through `pretrain/model4.bin`. To
make use of the final output, you could fill in this value in your config file:
```ini
### config.cfg
@ -773,16 +772,14 @@ init_tok2vec = ${paths.init_tok2vec}
<Infobox variant="warning">
The outputs of `spacy pretrain` are not the same data format as the
pre-packaged static word vectors that would go into
[`initialize.vectors`](/api/data-formats#config-initialize).
The pretraining output consists of the weights that the `tok2vec`
component should start with in an existing pipeline, so it goes in
`initialize.init_tok2vec`.
The outputs of `spacy pretrain` are not the same data format as the pre-packaged
static word vectors that would go into
[`initialize.vectors`](/api/data-formats#config-initialize). The pretraining
output consists of the weights that the `tok2vec` component should start with in
an existing pipeline, so it goes in `initialize.init_tok2vec`.
</Infobox>
#### Pretraining objectives {#pretraining-objectives}
> ```ini