Update docs [ci skip]

This commit is contained in:
Ines Montani 2020-08-22 16:47:03 +02:00
parent 8685229891
commit 37ebff6997
2 changed files with 40 additions and 45 deletions

View File

@ -9,23 +9,23 @@ menu:
next: /usage/training
---
spaCy supports a number of transfer and multi-task learning workflows that can
often help improve your pipeline's efficiency or accuracy. Transfer learning
spaCy supports a number of **transfer and multi-task learning** workflows that
can often help improve your pipeline's efficiency or accuracy. Transfer learning
refers to techniques such as word vector tables and language model pretraining.
These techniques can be used to import knowledge from raw text into your
pipeline, so that your models are able to generalize better from your
annotated examples.
pipeline, so that your models are able to generalize better from your annotated
examples.
You can convert word vectors from popular tools like FastText and Gensim, or
you can load in any pretrained transformer model if you install our
`spacy-transformers` integration. You can also do your own language model pretraining
via the `spacy pretrain` command. You can even share your transformer or other
contextual embedding model across multiple components, which can make long
pipelines several times more efficient.
In order to use transfer learning, you'll need to have at least a few annotated
examples for all of the classes you're trying to predict. If you don't, you
could try using a "one-shot learning" approach using
You can convert **word vectors** from popular tools like
[FastText](https://fasttext.cc) and [Gensim](https://radimrehurek.com/gensim),
or you can load in any pretrained **transformer model** if you install
[`spacy-transformers`](https://github.com/explosion/spacy-transformers). You can
also do your own language model pretraining via the
[`spacy pretrain`](/api/cli#pretrain) command. You can even **share** your
transformer or other contextual embedding model across multiple components,
which can make long pipelines several times more efficient. To use transfer
learning, you'll need at least a few annotated examples for what you're trying
to predict. Otherwise, you could try using a "one-shot learning" approach using
[vectors and similarity](/usage/linguistic-features#vectors-similarity).
<Accordion title="Whats the difference between word vectors and language models?" id="vectors-vs-language-models">
@ -70,35 +70,13 @@ of performance.
## Shared embedding layers {#embedding-layers}
You can share a single token-to-vector embedding model between multiple
components using the `Tok2Vec` component. Other components in
your pipeline can "connect" to the `Tok2Vec` component by including a _listener layer_
within their model. At the beginning of training, the `Tok2Vec` component will
grab a reference to the relevant listener layers in the rest of your pipeline.
Then, when the `Tok2Vec` component processes a batch of documents, it will pass
forward its predictions to the listeners, allowing the listeners to reuse the
predictions when they are eventually called. A similar mechanism is used to
pass gradients from the listeners back to the `Tok2Vec` model. The
`Transformer` component and `TransformerListener` layer do the same thing for
transformer models, making it easy to share a single transformer model across
your whole pipeline.
Training a single transformer or other embedding layer for use with multiple
components is termed _multi-task learning_. Multi-task learning is sometimes
less consistent, and the results are generally harder to reason about (as there's
more going on). You'll usually want to compare your accuracy against a single-task
approach to understand whether the weight-sharing is impacting your accuracy,
and whether you can address the problem by adjusting the hyper-parameters. We
are not currently aware of any foolproof recipe.
The main disadvantage of sharing weights between components is flexibility.
If your components are independent, you can train pipelines separately and
merge them together much more easily. Shared weights also make it more
difficult to resume training of only part of your pipeline. If you train only
part of your pipeline, you risk hurting the accuracy of the other components,
as you'll be changing the shared embedding layer those components are relying
on. <!-- TODO: Once rehearsal is tested, mention it here. -->
spaCy lets you share a single embedding layer and reuse it across multiple
components. This is also known as **multi-task learning**. Sharing weights
between components can make your pipeline run a lot faster and result in a much
smaller models size, as you only need a single copy of the embeddings. However,
it can make the pipeline less modular and make it more difficult to swap
components or retrain parts of the pipeline, since all components depend on the
same weights.
![Pipeline components using a shared embedding component vs. independent embedding layers](../images/tok2vec.svg)
@ -107,10 +85,27 @@ on. <!-- TODO: Once rehearsal is tested, mention it here. -->
| ✅ **smaller:** models only need to include a single copy of the embeddings | ❌ **larger:** models need to include the embeddings for each component |
| ✅ **faster:** embed the documents once for your whole pipeline | ❌ **slower:** rerun the embedding for each component |
| ❌ **less composable:** all components require the same embedding component in the pipeline | ✅ **modular:** components can be moved and swapped freely |
| ?? **accuracy:** weight sharing may increase or decrease accuracy, depending on your task and data, but usually the impact is small |
a single token-to-vector embedding model between multiple components using the
[`Tok2Vec`](/api/tok2vec) component. Other components in your pipeline can
"connect" this component by including a **listener layer** like
[Tok2VecListener](/api/architectures#Tok2VecListener) within their model.
![Pipeline components listening to shared embedding component](../images/tok2vec-listener.svg)
At the beginning of training, the [`Tok2Vec`](/api/tok2vec) component will grab
a reference to the relevant listener layers in the rest of your pipeline. When
it processes a batch of documents, it will pass forward its predictions to the
listeners, allowing the listeners to **reuse the predictions** when they are
eventually called. A similar mechanism is used to pass gradients from the
listeners back to the model. The [`Transformer`](/api/transformer) component and
[TransformerListener](/api/architectures#TransformerListener) layer do the same
thing for transformer models, making it easy to share a single transformer model
across your whole pipeline.
<!-- TODO: show example of implementation via config, side by side -->
<!-- TODO: Once rehearsal is tested, mention it here. -->
## Using transformer models {#transformers}

View File

@ -1550,7 +1550,7 @@ import Vectors101 from 'usage/101/\_vectors-similarity.md'
### Adding word vectors {#adding-vectors}
Custom word vectors can be trained using a number of open-source libraries, such
as [Gensim](https://radimrehurek.com/gensim), [Fast Text](https://fasttext.cc),
as [Gensim](https://radimrehurek.com/gensim), [FastText](https://fasttext.cc),
or Tomas Mikolov's original
[Word2vec implementation](https://code.google.com/archive/p/word2vec/). Most
word vector libraries output an easy-to-read text-based format, where each line