Docs for pretrain architectures (#6605)

* document pretraining architectures

* formatting

* bit more info

* small fixes
This commit is contained in:
Sofie Van Landeghem 2021-01-06 06:12:30 +01:00 committed by GitHub
parent bf9096437e
commit 82ae95267a
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
2 changed files with 86 additions and 15 deletions

View File

@ -5,6 +5,7 @@ source: spacy/ml/models
menu:
- ['Tok2Vec', 'tok2vec-arch']
- ['Transformers', 'transformers']
- ['Pretraining', 'pretrain']
- ['Parser & NER', 'parser']
- ['Tagging', 'tagger']
- ['Text Classification', 'textcat']
@ -426,6 +427,71 @@ one component.
| `grad_factor` | Reweight gradients from the component before passing them upstream. You can set this to `0` to "freeze" the transformer weights with respect to the component, or use it to make some components more significant than others. Leaving it at `1.0` is usually fine. ~~float~~ |
| **CREATES** | The model using the architecture. ~~Model[List[Doc], List[Floats2d]]~~ |
## Pretraining architectures {#pretrain source="spacy/ml/models/multi_task.py"}
The spacy `pretrain` command lets you initialize a `Tok2Vec` layer in your
pipeline with information from raw text. To this end, additional layers are
added to build a network for a temporary task that forces the `Tok2Vec` layer to
learn something about sentence structure and word cooccurrence statistics. Two
pretraining objectives are available, both of which are variants of the cloze
task [Devlin et al. (2018)](https://arxiv.org/abs/1810.04805) introduced for
BERT.
For more information, see the section on
[pretraining](/usage/embeddings-transformers#pretraining).
### spacy.PretrainVectors.v1 {#pretrain_vectors}
> #### Example config
>
> ```ini
> [pretraining]
> component = "tok2vec"
> ...
>
> [pretraining.objective]
> @architectures = "spacy.PretrainVectors.v1"
> maxout_pieces = 3
> hidden_size = 300
> loss = "cosine"
> ```
Predict the word's vector from a static embeddings table as pretraining
objective for a Tok2Vec layer.
| Name | Description |
| --------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `maxout_pieces` | The number of maxout pieces to use. Recommended values are `2` or `3`. ~~int~~ |
| `hidden_size` | Size of the hidden layer of the model. ~~int~~ |
| `loss` | The loss function can be either "cosine" or "L2". We typically recommend to use "cosine". ~~~str~~ |
| **CREATES** | A callable function that can create the Model, given the `vocab` of the pipeline and the `tok2vec` layer to pretrain. ~~Callable[[Vocab, Model], Model]~~ |
### spacy.PretrainCharacters.v1 {#pretrain_chars}
> #### Example config
>
> ```ini
> [pretraining]
> component = "tok2vec"
> ...
>
> [pretraining.objective]
> @architectures = "spacy.PretrainCharacters.v1"
> maxout_pieces = 3
> hidden_size = 300
> n_characters = 4
> ```
Predict some number of leading and trailing UTF-8 bytes as pretraining objective
for a Tok2Vec layer.
| Name | Description |
| --------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `maxout_pieces` | The number of maxout pieces to use. Recommended values are `2` or `3`. ~~int~~ |
| `hidden_size` | Size of the hidden layer of the model. ~~int~~ |
| `n_characters` | The window of characters - e.g. if `n_characters = 2`, the model will try to predict the first two and last two characters of the word. ~~int~~ |
| **CREATES** | A callable function that can create the Model, given the `vocab` of the pipeline and the `tok2vec` layer to pretrain. ~~Callable[[Vocab, Model], Model]~~ |
## Parser & NER architectures {#parser}
### spacy.TransitionBasedParser.v2 {#TransitionBasedParser source="spacy/ml/models/parser.py"}

View File

@ -713,34 +713,39 @@ layer = "tok2vec"
#### Pretraining objectives {#pretraining-details}
Two pretraining objectives are available, both of which are variants of the
cloze task [Devlin et al. (2018)](https://arxiv.org/abs/1810.04805) introduced
for BERT. The objective can be defined and configured via the
`[pretraining.objective]` config block.
> ```ini
> ### Characters objective
> [pretraining.objective]
> type = "characters"
> @architectures = "spacy.PretrainCharacters.v1"
> maxout_pieces = 3
> hidden_size = 300
> n_characters = 4
> ```
>
> ```ini
> ### Vectors objective
> [pretraining.objective]
> type = "vectors"
> @architectures = "spacy.PretrainVectors.v1"
> maxout_pieces = 3
> hidden_size = 300
> loss = "cosine"
> ```
- **Characters:** The `"characters"` objective asks the model to predict some
number of leading and trailing UTF-8 bytes for the words. For instance,
setting `n_characters = 2`, the model will try to predict the first two and
last two characters of the word.
Two pretraining objectives are available, both of which are variants of the
cloze task [Devlin et al. (2018)](https://arxiv.org/abs/1810.04805) introduced
for BERT. The objective can be defined and configured via the
`[pretraining.objective]` config block.
- **Vectors:** The `"vectors"` objective asks the model to predict the word's
vector, from a static embeddings table. This requires a word vectors model to
be trained and loaded. The vectors objective can optimize either a cosine or
an L2 loss. We've generally found cosine loss to perform better.
- [`PretrainCharacters`](/api/architectures#pretrain_chars): The `"characters"`
objective asks the model to predict some number of leading and trailing UTF-8
bytes for the words. For instance, setting `n_characters = 2`, the model will
try to predict the first two and last two characters of the word.
- [`PretrainVectors`](/api/architectures#pretrain_vectors): The `"vectors"`
objective asks the model to predict the word's vector, from a static
embeddings table. This requires a word vectors model to be trained and loaded.
The vectors objective can optimize either a cosine or an L2 loss. We've
generally found cosine loss to perform better.
These pretraining objectives use a trick that we term **language modelling with
approximate outputs (LMAO)**. The motivation for the trick is that predicting an