Clarify how to fill in init_tok2vec after pretraining (#9639)

* Clarify how to fill in init_tok2vec after pretraining

* Ignore init_tok2vec arg in pretraining

* Update docs, config setting

* Remove obsolete note about not filling init_tok2vec early

This seems to have also caught some lines that needed cleanup.
This commit is contained in:
Paul O'Leary McCann 2021-11-18 14:38:30 +00:00 committed by GitHub
parent 86fa37e8ba
commit f3981bd0c8
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
3 changed files with 19 additions and 20 deletions

View File

@ -31,6 +31,8 @@ def pretrain(
allocator = config["training"]["gpu_allocator"] allocator = config["training"]["gpu_allocator"]
if use_gpu >= 0 and allocator: if use_gpu >= 0 and allocator:
set_gpu_allocator(allocator) set_gpu_allocator(allocator)
# ignore in pretraining because we're creating it now
config["initialize"]["init_tok2vec"] = None
nlp = load_model_from_config(config) nlp = load_model_from_config(config)
_config = nlp.config.interpolate() _config = nlp.config.interpolate()
P = registry.resolve(_config["pretraining"], schema=ConfigSchemaPretrain) P = registry.resolve(_config["pretraining"], schema=ConfigSchemaPretrain)

View File

@ -248,7 +248,7 @@ Also see the usage guides on the
| `after_init` | Optional callback to modify the `nlp` object after initialization. ~~Optional[Callable[[Language], Language]]~~ | | `after_init` | Optional callback to modify the `nlp` object after initialization. ~~Optional[Callable[[Language], Language]]~~ |
| `before_init` | Optional callback to modify the `nlp` object before initialization. ~~Optional[Callable[[Language], Language]]~~ | | `before_init` | Optional callback to modify the `nlp` object before initialization. ~~Optional[Callable[[Language], Language]]~~ |
| `components` | Additional arguments passed to the `initialize` method of a pipeline component, keyed by component name. If type annotations are available on the method, the config will be validated against them. The `initialize` methods will always receive the `get_examples` callback and the current `nlp` object. ~~Dict[str, Dict[str, Any]]~~ | | `components` | Additional arguments passed to the `initialize` method of a pipeline component, keyed by component name. If type annotations are available on the method, the config will be validated against them. The `initialize` methods will always receive the `get_examples` callback and the current `nlp` object. ~~Dict[str, Dict[str, Any]]~~ |
| `init_tok2vec` | Optional path to pretrained tok2vec weights created with [`spacy pretrain`](/api/cli#pretrain). Defaults to variable `${paths.init_tok2vec}`. ~~Optional[str]~~ | | `init_tok2vec` | Optional path to pretrained tok2vec weights created with [`spacy pretrain`](/api/cli#pretrain). Defaults to variable `${paths.init_tok2vec}`. Ignored when actually running pretraining, as you're creating the file to be used later. ~~Optional[str]~~ |
| `lookups` | Additional lexeme and vocab data from [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data). Defaults to `null`. ~~Optional[Lookups]~~ | | `lookups` | Additional lexeme and vocab data from [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data). Defaults to `null`. ~~Optional[Lookups]~~ |
| `tokenizer` | Additional arguments passed to the `initialize` method of the specified tokenizer. Can be used for languages like Chinese that depend on dictionaries or trained models for tokenization. If type annotations are available on the method, the config will be validated against them. The `initialize` method will always receive the `get_examples` callback and the current `nlp` object. ~~Dict[str, Any]~~ | | `tokenizer` | Additional arguments passed to the `initialize` method of the specified tokenizer. Can be used for languages like Chinese that depend on dictionaries or trained models for tokenization. If type annotations are available on the method, the config will be validated against them. The `initialize` method will always receive the `get_examples` callback and the current `nlp` object. ~~Dict[str, Any]~~ |
| `vectors` | Name or path of pipeline containing pretrained word vectors to use, e.g. created with [`init vectors`](/api/cli#init-vectors). Defaults to `null`. ~~Optional[str]~~ | | `vectors` | Name or path of pipeline containing pretrained word vectors to use, e.g. created with [`init vectors`](/api/cli#init-vectors). Defaults to `null`. ~~Optional[str]~~ |

View File

@ -391,8 +391,8 @@ A wide variety of PyTorch models are supported, but some might not work. If a
model doesn't seem to work feel free to open an model doesn't seem to work feel free to open an
[issue](https://github.com/explosion/spacy/issues). Additionally note that [issue](https://github.com/explosion/spacy/issues). Additionally note that
Transformers loaded in spaCy can only be used for tensors, and pretrained Transformers loaded in spaCy can only be used for tensors, and pretrained
task-specific heads or text generation features cannot be used as part of task-specific heads or text generation features cannot be used as part of the
the `transformer` pipeline component. `transformer` pipeline component.
<Infobox variant="warning"> <Infobox variant="warning">
@ -715,8 +715,8 @@ network for a temporary task that forces the model to learn something about
sentence structure and word cooccurrence statistics. sentence structure and word cooccurrence statistics.
Pretraining produces a **binary weights file** that can be loaded back in at the Pretraining produces a **binary weights file** that can be loaded back in at the
start of training, using the configuration option `initialize.init_tok2vec`. start of training, using the configuration option `initialize.init_tok2vec`. The
The weights file specifies an initial set of weights. Training then proceeds as weights file specifies an initial set of weights. Training then proceeds as
normal. normal.
You can only pretrain one subnetwork from your pipeline at a time, and the You can only pretrain one subnetwork from your pipeline at a time, and the
@ -751,15 +751,14 @@ layer = "tok2vec"
#### Connecting pretraining to training {#pretraining-training} #### Connecting pretraining to training {#pretraining-training}
To benefit from pretraining, your training step needs to know to initialize To benefit from pretraining, your training step needs to know to initialize its
its `tok2vec` component with the weights learned from the pretraining step. `tok2vec` component with the weights learned from the pretraining step. You do
You do this by setting `initialize.init_tok2vec` to the filename of the this by setting `initialize.init_tok2vec` to the filename of the `.bin` file
`.bin` file that you want to use from pretraining. that you want to use from pretraining.
A pretraining step that runs for 5 epochs with an output path of `pretrain/`, A pretraining step that runs for 5 epochs with an output path of `pretrain/`, as
as an example, produces `pretrain/model0.bin` through `pretrain/model4.bin`. an example, produces `pretrain/model0.bin` through `pretrain/model4.bin`. To
To make use of the final output, you could fill in this value in your config make use of the final output, you could fill in this value in your config file:
file:
```ini ```ini
### config.cfg ### config.cfg
@ -773,16 +772,14 @@ init_tok2vec = ${paths.init_tok2vec}
<Infobox variant="warning"> <Infobox variant="warning">
The outputs of `spacy pretrain` are not the same data format as the The outputs of `spacy pretrain` are not the same data format as the pre-packaged
pre-packaged static word vectors that would go into static word vectors that would go into
[`initialize.vectors`](/api/data-formats#config-initialize). [`initialize.vectors`](/api/data-formats#config-initialize). The pretraining
The pretraining output consists of the weights that the `tok2vec` output consists of the weights that the `tok2vec` component should start with in
component should start with in an existing pipeline, so it goes in an existing pipeline, so it goes in `initialize.init_tok2vec`.
`initialize.init_tok2vec`.
</Infobox> </Infobox>
#### Pretraining objectives {#pretraining-objectives} #### Pretraining objectives {#pretraining-objectives}
> ```ini > ```ini