mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-25 00:34:20 +03:00
Draft pretrain usage
This commit is contained in:
parent
1bb8b4f824
commit
6efb7688a6
|
@ -610,17 +610,83 @@ def MyCustomVectors(
|
||||||
|
|
||||||
## Pretraining {#pretraining}
|
## Pretraining {#pretraining}
|
||||||
|
|
||||||
<Infobox title="This section is still under construction" emoji="🚧" variant="warning">
|
The `spacy pretrain` command lets you initialize your models with information
|
||||||
</Infobox>
|
from raw text. Without pretraining, the models for your components will usually
|
||||||
|
be initialized randomly. The idea behind pretraining is simple: random probably
|
||||||
|
isn't optimal, so if we have some text to learn from, we can probably find
|
||||||
|
a way to get the model off to a better start. The impact of `spacy pretrain` varies,
|
||||||
|
but it will usually be worth trying if you're not using a transformer model and
|
||||||
|
you have relatively little training data (for instance, fewer than 5,000 sentence).
|
||||||
|
A good rule of thumb is that pretraining will generally give you a similar accuracy
|
||||||
|
improvement to using word vectors in your model. If word vectors have given you
|
||||||
|
a 10% error reduction, the `spacy pretrain` command might give you another 10%,
|
||||||
|
for a 20% error reduction in total.
|
||||||
|
|
||||||
<!--
|
The `spacy pretrain` command will take a specific subnetwork within one of your
|
||||||
- explain general concept and idea (short!)
|
components, and add additional layers to build a network for a temporary task,
|
||||||
- present it as a separate lightweight mechanism for pretraining the tok2vec
|
that forces the model to learn something about sentence structure and word
|
||||||
layer
|
cooccurrence statistics. Pretraining produces a binary weights file that can be
|
||||||
- advantages (could also be pros/cons table)
|
loaded back in at the start of training. The weights file specifies an initial
|
||||||
- explain how it generates a separate file (!) and how it depends on the same
|
set of weights. Training then proceeds as normal.
|
||||||
vectors
|
|
||||||
-->
|
You can only pretrain one subnetwork from your pipeline at a time, and the subnetwork
|
||||||
|
must be typed `Model[List[Doc], List[Floats2d]]` (i.e., it has to be a "tok2vec" layer).
|
||||||
|
The most common workflow is to use the `Tok2Vec` component to create a shared
|
||||||
|
token-to-vector layer for several components of your pipeline, and apply
|
||||||
|
pretraining to its whole model.
|
||||||
|
|
||||||
|
The `spacy pretrain` command is configured using the `[pretraining]` section of
|
||||||
|
your config file. The `pretraining.component` and `pretraining.layer` settings
|
||||||
|
tell spaCy how to find the subnetwork to pretrain. The `pretraining.layer`
|
||||||
|
setting should be either the empty string (to use the whole model), or a
|
||||||
|
[node reference](https://thinc.ai/docs/usage-models#model-state). Most of spaCy's
|
||||||
|
built-in model architectures have a reference named `"tok2vec"` that will refer
|
||||||
|
to the right layer.
|
||||||
|
|
||||||
|
```ini
|
||||||
|
# Pretrain nlp.get_pipe("tok2vec").model
|
||||||
|
[pretraining]
|
||||||
|
component = "tok2vec"
|
||||||
|
layer = ""
|
||||||
|
|
||||||
|
[pretraining]
|
||||||
|
# Pretrain nlp.get_pipe("textcat").model.get_ref("tok2vec")
|
||||||
|
component = "textcat"
|
||||||
|
layer = "tok2vec"
|
||||||
|
```
|
||||||
|
|
||||||
|
two pretraining objectives are available, both of which are variants of the cloze
|
||||||
|
task Devlin et al (2018) introduced for BERT.
|
||||||
|
|
||||||
|
* The *characters* objective asks the model to predict some number of leading and
|
||||||
|
trailing UTF-8 bytes for the words. For instance, setting `n_characters=2`, the
|
||||||
|
model will try to predict the first two and last two characters of the word.
|
||||||
|
|
||||||
|
* The *vectors* objective asks the model to predict the word's vector, from
|
||||||
|
a static embeddings table. This requires a word vectors model to be trained
|
||||||
|
and loaded. The vectors objective can optimize either a cosine or an L2 loss.
|
||||||
|
We've generally found cosine loss to perform better.
|
||||||
|
|
||||||
|
These pretraining objectives use a trick that we term _language modelling with
|
||||||
|
approximate outputs (LMAO)_. The motivation for the trick is that predicting
|
||||||
|
an exact word ID introduces a lot of incidental complexity. You need a large
|
||||||
|
output layer, and even then, the vocabulary is too large, which motivates
|
||||||
|
tokenization schemes that do not align to actual word boundaries. At the end of
|
||||||
|
training, the output layer will be thrown away regardless: we just want a task
|
||||||
|
that forces the network to model something about word cooccurrence statistics.
|
||||||
|
Predicting leading and trailing characters does that more than adequately, as
|
||||||
|
the exact word sequence could be recovered with high accuracy if the initial
|
||||||
|
and trailing characters are predicted accurately. With the vectors objective,
|
||||||
|
the pretraining is use the embedding space learned by an algorithm such as
|
||||||
|
GloVe or word2vec, allowing the model to focus on the contextual
|
||||||
|
modelling we actual care about.
|
||||||
|
|
||||||
|
The `[pretraining]` section has several configuration subsections that are
|
||||||
|
familiar from the training block: the `[pretraining.batcher]`,
|
||||||
|
[pretraining.optimizer]` and `[pretraining.corpus]` all work the same way and
|
||||||
|
expect the same types of objects, although for pretraining your corpus does not
|
||||||
|
need to have any annotations, so you will often use a different reader, such as
|
||||||
|
`spacy.training.JsonlReader1`.
|
||||||
|
|
||||||
> #### Raw text format
|
> #### Raw text format
|
||||||
>
|
>
|
||||||
|
|
Loading…
Reference in New Issue
Block a user