diff --git a/website/docs/usage/embeddings-transformers.md b/website/docs/usage/embeddings-transformers.md
index 04b79007c..d535982d7 100644
--- a/website/docs/usage/embeddings-transformers.md
+++ b/website/docs/usage/embeddings-transformers.md
@@ -9,23 +9,23 @@ menu:
 next: /usage/training
 ---
 
-spaCy supports a number of transfer and multi-task learning workflows that can
-often help improve your pipeline's efficiency or accuracy. Transfer learning 
+spaCy supports a number of **transfer and multi-task learning** workflows that
+can often help improve your pipeline's efficiency or accuracy. Transfer learning
 refers to techniques such as word vector tables and language model pretraining.
 These techniques can be used to import knowledge from raw text into your
-pipeline, so that your models are able to generalize better from your
-annotated examples.
+pipeline, so that your models are able to generalize better from your annotated
+examples.
 
-You can convert word vectors from popular tools like FastText and Gensim, or
-you can load in any pretrained transformer model if you install our
-`spacy-transformers` integration. You can also do your own language model pretraining
-via the `spacy pretrain` command. You can even share your transformer or other
-contextual embedding model across multiple components, which can make long
-pipelines several times more efficient.
-
-In order to use transfer learning, you'll need to have at least a few annotated
-examples for all of the classes you're trying to predict. If you don't, you
-could try using a "one-shot learning" approach using 
+You can convert **word vectors** from popular tools like
+[FastText](https://fasttext.cc) and [Gensim](https://radimrehurek.com/gensim),
+or you can load in any pretrained **transformer model** if you install
+[`spacy-transformers`](https://github.com/explosion/spacy-transformers). You can
+also do your own language model pretraining via the
+[`spacy pretrain`](/api/cli#pretrain) command. You can even **share** your
+transformer or other contextual embedding model across multiple components,
+which can make long pipelines several times more efficient. To use transfer
+learning, you'll need at least a few annotated examples for what you're trying
+to predict. Otherwise, you could try using a "one-shot learning" approach using
 [vectors and similarity](/usage/linguistic-features#vectors-similarity).
 
 <Accordion title="What’s the difference between word vectors and language models?" id="vectors-vs-language-models">
@@ -70,35 +70,13 @@ of performance.
 
 ## Shared embedding layers {#embedding-layers}
 
-You can share a single token-to-vector embedding model between multiple
-components using the `Tok2Vec` component. Other components in
-your pipeline can "connect" to the `Tok2Vec` component by including a _listener layer_
-within their model. At the beginning of training, the `Tok2Vec` component will
-grab a reference to the relevant listener layers in the rest of your pipeline.
-Then, when the `Tok2Vec` component processes a batch of documents, it will pass
-forward its predictions to the listeners, allowing the listeners to reuse the
-predictions when they are eventually called. A similar mechanism is used to
-pass gradients from the listeners back to the `Tok2Vec` model. The
-`Transformer` component and `TransformerListener` layer do the same thing for
-transformer models, making it easy to share a single transformer model across
-your whole pipeline.
-
-Training a single transformer or other embedding layer for use with multiple
-components is termed _multi-task learning_. Multi-task learning is sometimes
-less consistent, and the results are generally harder to reason about (as there's
-more going on). You'll usually want to compare your accuracy against a single-task
-approach to understand whether the weight-sharing is impacting your accuracy,
-and whether you can address the problem by adjusting the hyper-parameters. We
-are not currently aware of any foolproof recipe.
-
-The main disadvantage of sharing weights between components is flexibility.
-If your components are independent, you can train pipelines separately and
-merge them together much more easily. Shared weights also make it more
-difficult to resume training of only part of your pipeline. If you train only
-part of your pipeline, you risk hurting the accuracy of the other components,
-as you'll be changing the shared embedding layer those components are relying
-on. <!-- TODO: Once rehearsal is tested, mention it here. -->
-
+spaCy lets you share a single embedding layer and reuse it across multiple
+components. This is also known as **multi-task learning**. Sharing weights
+between components can make your pipeline run a lot faster and result in a much
+smaller models size, as you only need a single copy of the embeddings. However,
+it can make the pipeline less modular and make it more difficult to swap
+components or retrain parts of the pipeline, since all components depend on the
+same weights.
 
 ![Pipeline components using a shared embedding component vs. independent embedding layers](../images/tok2vec.svg)
 
@@ -107,10 +85,27 @@ on. <!-- TODO: Once rehearsal is tested, mention it here. -->
 | ✅ **smaller:** models only need to include a single copy of the embeddings                 | ❌ **larger:** models need to include the embeddings for each component |
 | ✅ **faster:** embed the documents once for your whole pipeline                             | ❌ **slower:** rerun the embedding for each component                   |
 | ❌ **less composable:** all components require the same embedding component in the pipeline | ✅ **modular:** components can be moved and swapped freely              |
-| ?? **accuracy:** weight sharing may increase or decrease accuracy, depending on your task and data, but usually the impact is small                                   |
+
+a single token-to-vector embedding model between multiple components using the
+[`Tok2Vec`](/api/tok2vec) component. Other components in your pipeline can
+"connect" this component by including a **listener layer** like
+[Tok2VecListener](/api/architectures#Tok2VecListener) within their model.
 
 ![Pipeline components listening to shared embedding component](../images/tok2vec-listener.svg)
 
+At the beginning of training, the [`Tok2Vec`](/api/tok2vec) component will grab
+a reference to the relevant listener layers in the rest of your pipeline. When
+it processes a batch of documents, it will pass forward its predictions to the
+listeners, allowing the listeners to **reuse the predictions** when they are
+eventually called. A similar mechanism is used to pass gradients from the
+listeners back to the model. The [`Transformer`](/api/transformer) component and
+[TransformerListener](/api/architectures#TransformerListener) layer do the same
+thing for transformer models, making it easy to share a single transformer model
+across your whole pipeline.
+
+<!-- TODO: show example of implementation via config, side by side -->
+
+<!-- TODO: Once rehearsal is tested, mention it here. -->
 
 ## Using transformer models {#transformers}
 
diff --git a/website/docs/usage/linguistic-features.md b/website/docs/usage/linguistic-features.md
index f52c2b2ad..f2ec48d63 100644
--- a/website/docs/usage/linguistic-features.md
+++ b/website/docs/usage/linguistic-features.md
@@ -1550,7 +1550,7 @@ import Vectors101 from 'usage/101/\_vectors-similarity.md'
 ### Adding word vectors {#adding-vectors}
 
 Custom word vectors can be trained using a number of open-source libraries, such
-as [Gensim](https://radimrehurek.com/gensim), [Fast Text](https://fasttext.cc),
+as [Gensim](https://radimrehurek.com/gensim), [FastText](https://fasttext.cc),
 or Tomas Mikolov's original
 [Word2vec implementation](https://code.google.com/archive/p/word2vec/). Most
 word vector libraries output an easy-to-read text-based format, where each line