mirror of
https://github.com/explosion/spaCy.git
synced 2024-11-14 05:37:03 +03:00
Docs for pretrain architectures (#6605)
* document pretraining architectures * formatting * bit more info * small fixes
This commit is contained in:
parent
bf9096437e
commit
82ae95267a
|
@ -5,6 +5,7 @@ source: spacy/ml/models
|
||||||
menu:
|
menu:
|
||||||
- ['Tok2Vec', 'tok2vec-arch']
|
- ['Tok2Vec', 'tok2vec-arch']
|
||||||
- ['Transformers', 'transformers']
|
- ['Transformers', 'transformers']
|
||||||
|
- ['Pretraining', 'pretrain']
|
||||||
- ['Parser & NER', 'parser']
|
- ['Parser & NER', 'parser']
|
||||||
- ['Tagging', 'tagger']
|
- ['Tagging', 'tagger']
|
||||||
- ['Text Classification', 'textcat']
|
- ['Text Classification', 'textcat']
|
||||||
|
@ -426,6 +427,71 @@ one component.
|
||||||
| `grad_factor` | Reweight gradients from the component before passing them upstream. You can set this to `0` to "freeze" the transformer weights with respect to the component, or use it to make some components more significant than others. Leaving it at `1.0` is usually fine. ~~float~~ |
|
| `grad_factor` | Reweight gradients from the component before passing them upstream. You can set this to `0` to "freeze" the transformer weights with respect to the component, or use it to make some components more significant than others. Leaving it at `1.0` is usually fine. ~~float~~ |
|
||||||
| **CREATES** | The model using the architecture. ~~Model[List[Doc], List[Floats2d]]~~ |
|
| **CREATES** | The model using the architecture. ~~Model[List[Doc], List[Floats2d]]~~ |
|
||||||
|
|
||||||
|
## Pretraining architectures {#pretrain source="spacy/ml/models/multi_task.py"}
|
||||||
|
|
||||||
|
The spacy `pretrain` command lets you initialize a `Tok2Vec` layer in your
|
||||||
|
pipeline with information from raw text. To this end, additional layers are
|
||||||
|
added to build a network for a temporary task that forces the `Tok2Vec` layer to
|
||||||
|
learn something about sentence structure and word cooccurrence statistics. Two
|
||||||
|
pretraining objectives are available, both of which are variants of the cloze
|
||||||
|
task [Devlin et al. (2018)](https://arxiv.org/abs/1810.04805) introduced for
|
||||||
|
BERT.
|
||||||
|
|
||||||
|
For more information, see the section on
|
||||||
|
[pretraining](/usage/embeddings-transformers#pretraining).
|
||||||
|
|
||||||
|
### spacy.PretrainVectors.v1 {#pretrain_vectors}
|
||||||
|
|
||||||
|
> #### Example config
|
||||||
|
>
|
||||||
|
> ```ini
|
||||||
|
> [pretraining]
|
||||||
|
> component = "tok2vec"
|
||||||
|
> ...
|
||||||
|
>
|
||||||
|
> [pretraining.objective]
|
||||||
|
> @architectures = "spacy.PretrainVectors.v1"
|
||||||
|
> maxout_pieces = 3
|
||||||
|
> hidden_size = 300
|
||||||
|
> loss = "cosine"
|
||||||
|
> ```
|
||||||
|
|
||||||
|
Predict the word's vector from a static embeddings table as pretraining
|
||||||
|
objective for a Tok2Vec layer.
|
||||||
|
|
||||||
|
| Name | Description |
|
||||||
|
| --------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
|
| `maxout_pieces` | The number of maxout pieces to use. Recommended values are `2` or `3`. ~~int~~ |
|
||||||
|
| `hidden_size` | Size of the hidden layer of the model. ~~int~~ |
|
||||||
|
| `loss` | The loss function can be either "cosine" or "L2". We typically recommend to use "cosine". ~~~str~~ |
|
||||||
|
| **CREATES** | A callable function that can create the Model, given the `vocab` of the pipeline and the `tok2vec` layer to pretrain. ~~Callable[[Vocab, Model], Model]~~ |
|
||||||
|
|
||||||
|
### spacy.PretrainCharacters.v1 {#pretrain_chars}
|
||||||
|
|
||||||
|
> #### Example config
|
||||||
|
>
|
||||||
|
> ```ini
|
||||||
|
> [pretraining]
|
||||||
|
> component = "tok2vec"
|
||||||
|
> ...
|
||||||
|
>
|
||||||
|
> [pretraining.objective]
|
||||||
|
> @architectures = "spacy.PretrainCharacters.v1"
|
||||||
|
> maxout_pieces = 3
|
||||||
|
> hidden_size = 300
|
||||||
|
> n_characters = 4
|
||||||
|
> ```
|
||||||
|
|
||||||
|
Predict some number of leading and trailing UTF-8 bytes as pretraining objective
|
||||||
|
for a Tok2Vec layer.
|
||||||
|
|
||||||
|
| Name | Description |
|
||||||
|
| --------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
|
| `maxout_pieces` | The number of maxout pieces to use. Recommended values are `2` or `3`. ~~int~~ |
|
||||||
|
| `hidden_size` | Size of the hidden layer of the model. ~~int~~ |
|
||||||
|
| `n_characters` | The window of characters - e.g. if `n_characters = 2`, the model will try to predict the first two and last two characters of the word. ~~int~~ |
|
||||||
|
| **CREATES** | A callable function that can create the Model, given the `vocab` of the pipeline and the `tok2vec` layer to pretrain. ~~Callable[[Vocab, Model], Model]~~ |
|
||||||
|
|
||||||
## Parser & NER architectures {#parser}
|
## Parser & NER architectures {#parser}
|
||||||
|
|
||||||
### spacy.TransitionBasedParser.v2 {#TransitionBasedParser source="spacy/ml/models/parser.py"}
|
### spacy.TransitionBasedParser.v2 {#TransitionBasedParser source="spacy/ml/models/parser.py"}
|
||||||
|
|
|
@ -713,34 +713,39 @@ layer = "tok2vec"
|
||||||
|
|
||||||
#### Pretraining objectives {#pretraining-details}
|
#### Pretraining objectives {#pretraining-details}
|
||||||
|
|
||||||
Two pretraining objectives are available, both of which are variants of the
|
|
||||||
cloze task [Devlin et al. (2018)](https://arxiv.org/abs/1810.04805) introduced
|
|
||||||
for BERT. The objective can be defined and configured via the
|
|
||||||
`[pretraining.objective]` config block.
|
|
||||||
|
|
||||||
> ```ini
|
> ```ini
|
||||||
> ### Characters objective
|
> ### Characters objective
|
||||||
> [pretraining.objective]
|
> [pretraining.objective]
|
||||||
> type = "characters"
|
> @architectures = "spacy.PretrainCharacters.v1"
|
||||||
|
> maxout_pieces = 3
|
||||||
|
> hidden_size = 300
|
||||||
> n_characters = 4
|
> n_characters = 4
|
||||||
> ```
|
> ```
|
||||||
>
|
>
|
||||||
> ```ini
|
> ```ini
|
||||||
> ### Vectors objective
|
> ### Vectors objective
|
||||||
> [pretraining.objective]
|
> [pretraining.objective]
|
||||||
> type = "vectors"
|
> @architectures = "spacy.PretrainVectors.v1"
|
||||||
|
> maxout_pieces = 3
|
||||||
|
> hidden_size = 300
|
||||||
> loss = "cosine"
|
> loss = "cosine"
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
- **Characters:** The `"characters"` objective asks the model to predict some
|
Two pretraining objectives are available, both of which are variants of the
|
||||||
number of leading and trailing UTF-8 bytes for the words. For instance,
|
cloze task [Devlin et al. (2018)](https://arxiv.org/abs/1810.04805) introduced
|
||||||
setting `n_characters = 2`, the model will try to predict the first two and
|
for BERT. The objective can be defined and configured via the
|
||||||
last two characters of the word.
|
`[pretraining.objective]` config block.
|
||||||
|
|
||||||
- **Vectors:** The `"vectors"` objective asks the model to predict the word's
|
- [`PretrainCharacters`](/api/architectures#pretrain_chars): The `"characters"`
|
||||||
vector, from a static embeddings table. This requires a word vectors model to
|
objective asks the model to predict some number of leading and trailing UTF-8
|
||||||
be trained and loaded. The vectors objective can optimize either a cosine or
|
bytes for the words. For instance, setting `n_characters = 2`, the model will
|
||||||
an L2 loss. We've generally found cosine loss to perform better.
|
try to predict the first two and last two characters of the word.
|
||||||
|
|
||||||
|
- [`PretrainVectors`](/api/architectures#pretrain_vectors): The `"vectors"`
|
||||||
|
objective asks the model to predict the word's vector, from a static
|
||||||
|
embeddings table. This requires a word vectors model to be trained and loaded.
|
||||||
|
The vectors objective can optimize either a cosine or an L2 loss. We've
|
||||||
|
generally found cosine loss to perform better.
|
||||||
|
|
||||||
These pretraining objectives use a trick that we term **language modelling with
|
These pretraining objectives use a trick that we term **language modelling with
|
||||||
approximate outputs (LMAO)**. The motivation for the trick is that predicting an
|
approximate outputs (LMAO)**. The motivation for the trick is that predicting an
|
||||||
|
|
Loading…
Reference in New Issue
Block a user