diff --git a/website/docs/api/architectures.mdx b/website/docs/api/architectures.mdx
index 268c04a07..cb217af15 100644
--- a/website/docs/api/architectures.mdx
+++ b/website/docs/api/architectures.mdx
@@ -481,6 +481,318 @@ The other arguments are shared between all versions.
 
 </Accordion>
 
+## Curated transformer architectures {id="curated-trf",source="https://github.com/explosion/spacy-curated-transformers/blob/main/spacy_curated_transformers/models/architectures.py"}
+
+The following architectures are provided by the package
+[`spacy-curated-transformers`](https://github.com/explosion/spacy-curated-transformers). See the
+[usage documentation](/usage/embeddings-transformers#transformers) for how to
+integrate the architectures into your training config.
+
+<Infobox variant="warning">
+
+Note that in order to use these architectures in your config, you need to
+install the
+[`spacy-transformers`](https://github.com/explosion/spacy-transformers). See the
+[installation docs](/usage/embeddings-transformers#transformers-installation)
+for details and system requirements.
+
+</Infobox>
+
+### spacy-curated-transformers.AlbertTransformer.v1
+
+Construct an ALBERT transformer model.
+
+| Name                           | Description                                                                 |
+|--------------------------------|-----------------------------------------------------------------------------|
+| `vocab_size`                   | Vocabulary size. ~~int~~                                                    |
+| `with_spans`                   | Callback that constructs a span generator model. ~~Callable~~               |
+| `with_spans`                   | piece_encoder (Model) ~~Callable~~                                          |
+| `with_spans`                   | The piece encoder to segment input tokens. ~~Callable~~                     |
+| `attention_probs_dropout_prob` | Dropout probabilty of the self-attention layers. ~~float~~                  |
+| `embedding_width`              | Width of the embedding representations. ~~int~~                             |
+| `hidden_act`                   | Activation used by the point-wise feed-forward layers. ~~str~~              |
+| `hidden_dropout_prob`          | Dropout probabilty of the point-wise feed-forward and ~~float~~             |
+| `hidden_dropout_prob`          | embedding layers. ~~float~~                                                 |
+| `hidden_width`                 | Width of the final representations. ~~int~~                                 |
+| `intermediate_width`           | Width of the intermediate projection layer in the ~~int~~                   |
+| `intermediate_width`           | point-wise feed-forward layer. ~~int~~                                      |
+| `layer_norm_eps`               | Epsilon for layer normalization. ~~float~~                                  |
+| `max_position_embeddings`      | Maximum length of position embeddings. ~~int~~                              |
+| `model_max_length`             | Maximum length of model inputs. ~~int~~                                     |
+| `num_attention_heads`          | Number of self-attention heads. ~~int~~                                     |
+| `num_hidden_groups`            | Number of layer groups whose constituents share parameters. ~~int~~         |
+| `num_hidden_layers`            | Number of hidden layers. ~~int~~                                            |
+| `padding_idx`                  | Index of the padding meta-token. ~~int~~                                    |
+| `type_vocab_size`              | Type vocabulary size. ~~int~~                                               |
+| `torchscript`                  | Set to `True` when loading TorchScript models, `False` otherwise. ~~bool~~  |
+| `mixed_precision`              | Use mixed-precision training. ~~bool~~                                      |
+| `grad_scaler_config`           | Configuration passed to the PyTorch gradient scaler. ~~dict~~               |
+| **CREATES**                    | The model using the architecture ~~Model[TransformerInT, TransformerOutT]~~ |
+
+### spacy-curated-transformers.BertTransformer.v1
+
+Construct a BERT transformer model.
+
+| Name                           | Description                                                                 |
+|--------------------------------|-----------------------------------------------------------------------------|
+| `vocab_size`                   | Vocabulary size. ~~int~~                                                    |
+| `with_spans`                   | Callback that constructs a span generator model. ~~Callable~~               |
+| `with_spans`                   | piece_encoder (Model) ~~Callable~~                                          |
+| `with_spans`                   | The piece encoder to segment input tokens. ~~Callable~~                     |
+| `attention_probs_dropout_prob` | Dropout probabilty of the self-attention layers. ~~float~~                  |
+| `hidden_act`                   | Activation used by the point-wise feed-forward layers. ~~str~~              |
+| `hidden_dropout_prob`          | Dropout probabilty of the point-wise feed-forward and ~~float~~             |
+| `hidden_dropout_prob`          | embedding layers. ~~float~~                                                 |
+| `hidden_width`                 | Width of the final representations. ~~int~~                                 |
+| `intermediate_width`           | Width of the intermediate projection layer in the ~~int~~                   |
+| `intermediate_width`           | point-wise feed-forward layer. ~~int~~                                      |
+| `layer_norm_eps`               | Epsilon for layer normalization. ~~float~~                                  |
+| `max_position_embeddings`      | Maximum length of position embeddings. ~~int~~                              |
+| `model_max_length`             | Maximum length of model inputs. ~~int~~                                     |
+| `num_attention_heads`          | Number of self-attention heads. ~~int~~                                     |
+| `num_hidden_layers`            | Number of hidden layers. ~~int~~                                            |
+| `padding_idx`                  | Index of the padding meta-token. ~~int~~                                    |
+| `type_vocab_size`              | Type vocabulary size. ~~int~~                                               |
+| `torchscript`                  | Set to `True` when loading TorchScript models, `False` otherwise. ~~bool~~  |
+| `mixed_precision`              | Use mixed-precision training. ~~bool~~                                      |
+| `grad_scaler_config`           | Configuration passed to the PyTorch gradient scaler. ~~dict~~               |
+| **CREATES**                    | The model using the architecture ~~Model[TransformerInT, TransformerOutT]~~ |
+
+### spacy-curated-transformers.CamembertTransformer.v1
+
+Construct a CamemBERT transformer model.
+
+| Name                           | Description                                                                 |
+|--------------------------------|-----------------------------------------------------------------------------|
+| `vocab_size`                   | Vocabulary size. ~~int~~                                                    |
+| `with_spans`                   | Callback that constructs a span generator model. ~~Callable~~               |
+| `with_spans`                   | piece_encoder (Model) ~~Callable~~                                          |
+| `with_spans`                   | The piece encoder to segment input tokens. ~~Callable~~                     |
+| `attention_probs_dropout_prob` | Dropout probabilty of the self-attention layers. ~~float~~                  |
+| `hidden_act`                   | Activation used by the point-wise feed-forward layers. ~~str~~              |
+| `hidden_dropout_prob`          | Dropout probabilty of the point-wise feed-forward and ~~float~~             |
+| `hidden_dropout_prob`          | embedding layers. ~~float~~                                                 |
+| `hidden_width`                 | Width of the final representations. ~~int~~                                 |
+| `intermediate_width`           | Width of the intermediate projection layer in the ~~int~~                   |
+| `intermediate_width`           | point-wise feed-forward layer. ~~int~~                                      |
+| `layer_norm_eps`               | Epsilon for layer normalization. ~~float~~                                  |
+| `max_position_embeddings`      | Maximum length of position embeddings. ~~int~~                              |
+| `model_max_length`             | Maximum length of model inputs. ~~int~~                                     |
+| `num_attention_heads`          | Number of self-attention heads. ~~int~~                                     |
+| `num_hidden_layers`            | Number of hidden layers. ~~int~~                                            |
+| `padding_idx`                  | Index of the padding meta-token. ~~int~~                                    |
+| `type_vocab_size`              | Type vocabulary size. ~~int~~                                               |
+| `torchscript`                  | Set to `True` when loading TorchScript models, `False` otherwise. ~~bool~~  |
+| `mixed_precision`              | Use mixed-precision training. ~~bool~~                                      |
+| `grad_scaler_config`           | Configuration passed to the PyTorch gradient scaler. ~~dict~~               |
+| **CREATES**                    | The model using the architecture ~~Model[TransformerInT, TransformerOutT]~~ |
+
+### spacy-curated-transformers.RobertaTransformer.v1
+
+Construct a RoBERTa transformer model.
+
+| Name                           | Description                                                                 |
+|--------------------------------|-----------------------------------------------------------------------------|
+| `vocab_size`                   | Vocabulary size. ~~int~~                                                    |
+| `with_spans`                   | Callback that constructs a span generator model. ~~Callable~~               |
+| `with_spans`                   | piece_encoder (Model) ~~Callable~~                                          |
+| `with_spans`                   | The piece encoder to segment input tokens. ~~Callable~~                     |
+| `attention_probs_dropout_prob` | Dropout probabilty of the self-attention layers. ~~float~~                  |
+| `hidden_act`                   | Activation used by the point-wise feed-forward layers. ~~str~~              |
+| `hidden_dropout_prob`          | Dropout probabilty of the point-wise feed-forward and ~~float~~             |
+| `hidden_dropout_prob`          | embedding layers. ~~float~~                                                 |
+| `hidden_width`                 | Width of the final representations. ~~int~~                                 |
+| `intermediate_width`           | Width of the intermediate projection layer in the ~~int~~                   |
+| `intermediate_width`           | point-wise feed-forward layer. ~~int~~                                      |
+| `layer_norm_eps`               | Epsilon for layer normalization. ~~float~~                                  |
+| `max_position_embeddings`      | Maximum length of position embeddings. ~~int~~                              |
+| `model_max_length`             | Maximum length of model inputs. ~~int~~                                     |
+| `num_attention_heads`          | Number of self-attention heads. ~~int~~                                     |
+| `num_hidden_layers`            | Number of hidden layers. ~~int~~                                            |
+| `padding_idx`                  | Index of the padding meta-token. ~~int~~                                    |
+| `type_vocab_size`              | Type vocabulary size. ~~int~~                                               |
+| `torchscript`                  | Set to `True` when loading TorchScript models, `False` otherwise. ~~bool~~  |
+| `mixed_precision`              | Use mixed-precision training. ~~bool~~                                      |
+| `grad_scaler_config`           | Configuration passed to the PyTorch gradient scaler. ~~dict~~               |
+| **CREATES**                    | The model using the architecture ~~Model[TransformerInT, TransformerOutT]~~ |
+
+
+### spacy-curated-transformers.XlmrTransformer.v1
+
+Construct a XLM-RoBERTa transformer model.
+
+| Name                           | Description                                                                 |
+|--------------------------------|-----------------------------------------------------------------------------|
+| `vocab_size`                   | Vocabulary size. ~~int~~                                                    |
+| `with_spans`                   | Callback that constructs a span generator model. ~~Callable~~               |
+| `with_spans`                   | piece_encoder (Model) ~~Callable~~                                          |
+| `with_spans`                   | The piece encoder to segment input tokens. ~~Callable~~                     |
+| `attention_probs_dropout_prob` | Dropout probabilty of the self-attention layers. ~~float~~                  |
+| `hidden_act`                   | Activation used by the point-wise feed-forward layers. ~~str~~              |
+| `hidden_dropout_prob`          | Dropout probabilty of the point-wise feed-forward and ~~float~~             |
+| `hidden_dropout_prob`          | embedding layers. ~~float~~                                                 |
+| `hidden_width`                 | Width of the final representations. ~~int~~                                 |
+| `intermediate_width`           | Width of the intermediate projection layer in the ~~int~~                   |
+| `intermediate_width`           | point-wise feed-forward layer. ~~int~~                                      |
+| `layer_norm_eps`               | Epsilon for layer normalization. ~~float~~                                  |
+| `max_position_embeddings`      | Maximum length of position embeddings. ~~int~~                              |
+| `model_max_length`             | Maximum length of model inputs. ~~int~~                                     |
+| `num_attention_heads`          | Number of self-attention heads. ~~int~~                                     |
+| `num_hidden_layers`            | Number of hidden layers. ~~int~~                                            |
+| `padding_idx`                  | Index of the padding meta-token. ~~int~~                                    |
+| `type_vocab_size`              | Type vocabulary size. ~~int~~                                               |
+| `torchscript`                  | Set to `True` when loading TorchScript models, `False` otherwise. ~~bool~~  |
+| `mixed_precision`              | Use mixed-precision training. ~~bool~~                                      |
+| `grad_scaler_config`           | Configuration passed to the PyTorch gradient scaler. ~~dict~~               |
+| **CREATES**                    | The model using the architecture ~~Model[TransformerInT, TransformerOutT]~~ |
+
+
+### spacy-curated-transformers.ScalarWeight.v1
+
+Construct a model that accepts a list of transformer layer outputs and returns a weighted representation of the same.
+
+| Name                 | Description                                                                   |
+|----------------------|-------------------------------------------------------------------------------|
+| `num_layers`         | Number of transformer hidden layers. ~~int~~                                  |
+| `dropout_prob`       | Dropout probability. ~~float~~                                                |
+| `mixed_precision`    | Use mixed-precision training. ~~bool~~                                        |
+| `grad_scaler_config` | Configuration passed to the PyTorch gradient scaler. ~~dict~~                 |
+| **CREATES**          | The model using the architecture ~~Model[ScalarWeightInT, ScalarWeightOutT]~~ |
+
+### spacy-curated-transformers.TransformerLayersListener.v1
+
+Construct a listener layer that communicates with one or more upstream Transformer
+components. This layer extracts the output of the last transformer layer and performs
+pooling over the individual pieces of each Doc token, returning their corresponding
+representations. The upstream name should either be the wildcard string '*', or the name of the Transformer component.
+
+In almost all cases, the wildcard string will suffice as there'll only be one
+upstream Transformer component. But in certain situations, e.g: you have disjoint
+datasets for certain tasks, or you'd like to use a pre-trained pipeline but a
+downstream task requires its own token representations, you could end up with
+more than one Transformer component in the pipeline.
+
+
+| Name            | Description                                                                                                            |
+|-----------------|------------------------------------------------------------------------------------------------------------------------|
+| `layers`        | The the number of layers produced by the upstream transformer component, excluding the embedding layer. ~~int~~        |
+| `width`         | The width of the vectors produced by the upstream transformer component. ~~int~~                                       |
+| `pooling`       | Model that is used to perform pooling over the piece representations. ~~Model~~                                        |
+| `upstream_name` | A string to identify the 'upstream' Transformer component to communicate with.  ~~str~~                                |
+| `grad_factor`   | Factor to multiply gradients with. ~~float~~                                                                           |
+| **CREATES**     | A model that returns the relevant vectors from an upstream transformer component. ~~Model[List[Doc], List[Floats2d]]~~ |
+
+
+### spacy-curated-transformers.LastTransformerLayerListener.v1
+
+Construct a listener layer that communicates with one or more upstream Transformer
+components. This layer extracts the output of the last transformer layer and performs
+pooling over the individual pieces of each Doc token, returning their corresponding
+representations. The upstream name should either be the wildcard string '*', or the name of the Transformer component.
+
+In almost all cases, the wildcard string will suffice as there'll only be one
+upstream Transformer component. But in certain situations, e.g: you have disjoint
+datasets for certain tasks, or you'd like to use a pre-trained pipeline but a
+downstream task requires its own token representations, you could end up with
+more than one Transformer component in the pipeline.
+
+| Name            | Description                                                                                                            |
+|-----------------|------------------------------------------------------------------------------------------------------------------------|
+| `width`         | The width of the vectors produced by the upstream transformer component. ~~int~~                                       |
+| `pooling`       | Model that is used to perform pooling over the piece representations. ~~Model~~                                        |
+| `upstream_name` | A string to identify the 'upstream' Transformer component to communicate with.  ~~str~~                                |
+| `grad_factor`   | Factor to multiply gradients with. ~~float~~                                                                           |
+| **CREATES**     | A model that returns the relevant vectors from an upstream transformer component. ~~Model[List[Doc], List[Floats2d]]~~ |
+
+
+### spacy-curated-transformers.ScalarWeightingListener.v1
+
+Construct a listener layer that communicates with one or more upstream Transformer
+components. This layer calculates a weighted representation of all transformer layer
+outputs and performs pooling over the individual pieces of each Doc token, returning
+their corresponding representations.
+
+Requires its upstream Transformer components to return all layer outputs from
+their models.  The upstream name should either be the wildcard string '*', or the name of the Transformer component.
+
+In almost all cases, the wildcard string will suffice as there'll only be one
+upstream Transformer component. But in certain situations, e.g: you have disjoint
+datasets for certain tasks, or you'd like to use a pre-trained pipeline but a
+downstream task requires its own token representations, you could end up with
+more than one Transformer component in the pipeline.
+
+| Name            | Description                                                                                                            |
+|-----------------|------------------------------------------------------------------------------------------------------------------------|
+| `width`         | The width of the vectors produced by the upstream transformer component. ~~int~~                                       |
+| `weighting`     | Model that is used to perform the weighting of the different layer outputs. ~~Model~~                                  |
+| `pooling`       | Model that is used to perform pooling over the piece representations. ~~Model~~                                        |
+| `upstream_name` | A string to identify the 'upstream' Transformer component to communicate with.  ~~str~~                                |
+| `grad_factor`   | Factor to multiply gradients with. ~~float~~                                                                           |
+| **CREATES**     | A model that returns the relevant vectors from an upstream transformer component. ~~Model[List[Doc], List[Floats2d]]~~ |
+
+### spacy-curated-transformers.BertWordpieceEncoder.v1
+
+Construct a WordPiece piece encoder model that accepts a list
+of token sequences or documents and returns a corresponding list
+of piece identifiers. This encoder also splits each token
+on punctuation characters, as expected by most BERT models.
+
+This model must be separately initialized using an appropriate
+loader.
+
+### spacy-curated-transformers.ByteBpeEncoder.v1
+
+Construct a Byte-BPE piece encoder model that accepts a list
+of token sequences or documents and returns a corresponding list
+of piece identifiers.
+
+This model must be separately initialized using an appropriate
+loader.
+
+### spacy-curated-transformers.CamembertSentencepieceEncoder.v1
+Construct a SentencePiece piece encoder model that accepts a list
+of token sequences or documents and returns a corresponding list
+of piece identifiers with CamemBERT post-processing applied.
+
+This model must be separately initialized using an appropriate
+loader.
+
+### spacy-curated-transformers.CharEncoder.v1
+Construct a character piece encoder model that accepts a list
+of token sequences or documents and returns a corresponding list
+of piece identifiers.
+
+This model must be separately initialized using an appropriate
+loader.
+
+### spacy-curated-transformers.SentencepieceEncoder.v1
+Construct a SentencePiece piece encoder model that accepts a list
+of token sequences or documents and returns a corresponding list
+of piece identifiers with CamemBERT post-processing applied.
+
+This model must be separately initialized using an appropriate
+loader.
+
+### spacy-curated-transformers.WordpieceEncoder.v1
+Construct a WordPiece piece encoder model that accepts a list
+of token sequences or documents and returns a corresponding list
+of piece identifiers. This encoder also splits each token
+on punctuation characters, as expected by most BERT models.
+
+This model must be separately initialized using an appropriate
+loader.
+
+### spacy-curated-transformers.XlmrSentencepieceEncoder.v1
+Construct a SentencePiece piece encoder model that accepts a list
+of token sequences or documents and returns a corresponding list
+of piece identifiers with XLM-RoBERTa post-processing applied.
+
+This model must be separately initialized using an appropriate
+loader.
+
+
+
+
 ## Pretraining architectures {id="pretrain",source="spacy/ml/models/multi_task.py"}
 
 The spacy `pretrain` command lets you initialize a `Tok2Vec` layer in your
@@ -519,7 +831,7 @@ objective for a Tok2Vec layer. To use this objective, make sure that the
 vectors.
 
 | Name            | Description                                                                                                                                               |
-| --------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------- |
+|-----------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------|
 | `maxout_pieces` | The number of maxout pieces to use. Recommended values are `2` or `3`. ~~int~~                                                                            |
 | `hidden_size`   | Size of the hidden layer of the model. ~~int~~                                                                                                            |
 | `loss`          | The loss function can be either "cosine" or "L2". We typically recommend to use "cosine". ~~~str~~                                                        |
diff --git a/website/docs/api/cli.mdx b/website/docs/api/cli.mdx
index 5b4bca1ce..86a5d026d 100644
--- a/website/docs/api/cli.mdx
+++ b/website/docs/api/cli.mdx
@@ -1018,6 +1018,54 @@ $ python -m spacy debug model ./config.cfg tagger -l "5,15" -DIM -PAR -P0 -P1 -P
 | overrides               | Config parameters to override. Should be options starting with `--` that correspond to the config section and value to override, e.g. `--paths.train ./train.spacy`. ~~Any (option/flag)~~                         |
 | **PRINTS**              | Debugging information.                                                                                                                                                                                             |
 
+### debug pieces {id="debug-pieces",version="3.6",tag="command"}
+
+Analyze word- or sentencepiece stats.
+
+```bash
+$ python -m spacy debug pieces [config_path] [code_path] [transformer_name]
+```
+
+| Name               | Description                                                                                                                                                                                |
+|--------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| `config_path`      | Path to config file. ~~Union[Path, str] (positional)~~                                                                                                                                     |
+| `code_path`        | Path to Python file with additional code (registered functions) to be imported. ~~Union[Path, str] (option)~~                                                                              |
+| `transformer_name` | Name of the transformer pipe to gather piece statistics for (default: first transformer pipe). ~~str (option)~~                                                                            |
+| overrides          | Config parameters to override. Should be options starting with `--` that correspond to the config section and value to override, e.g. `--paths.train ./train.spacy`. ~~Any (option/flag)~~ |
+| **PRINTS**         | Debugging information.                                                                                                                                                                     |
+
+
+<Accordion title="Example outputs" spaced>
+
+```bash
+$ python -m spacy debug pieces ./config.cfg
+```
+
+```
+========================= Training corpus statistics =========================
+Median token length: 1.0
+Mean token length: 1.54
+Token length range: [1, 13]
+
+======================= Development corpus statistics =======================
+Median token length: 1.0
+Mean token length: 1.44
+Token length range: [1, 8]
+```
+</Accordion>
+
+## quantize {id="quantize",tag="command",version="3.6"}
+
+Quantize a curated transformers model to reduce its size.
+
+| Name           | Description                                                         |
+|----------------|---------------------------------------------------------------------|
+| `model_path`   | Model to quantize. ~~Path (positional)~~                            |
+| `output_path`  | Output directory to store quantized model in. ~~Path (positional)~~ |
+| `max_mse_loss` | Maximum MSE loss of quantized parameters. ~~float (option)~~        |
+| `skip_embeds`  | Do not quantize embeddings. ~~bool (option)~~                       |
+| `skip_linear`  | Do not quantize linear layers. ~~bool (option)~~                    |
+
 ## train {id="train",tag="command"}
 
 Train a pipeline. Expects data in spaCy's
diff --git a/website/docs/api/curated-transformer.mdx b/website/docs/api/curated-transformer.mdx
index 6ac44cd80..2ba03a4c8 100644
--- a/website/docs/api/curated-transformer.mdx
+++ b/website/docs/api/curated-transformer.mdx
@@ -483,7 +483,6 @@ Construct a callback that initializes a character piece encoder model.
 | `eos_piece` | Piece used as a end-of-sentence token. Defaults to `"[EOS]"`. ~~str~~       |
 | `unk_piece` | Piece used as a stand-in for unknown tokens. Defaults to `"[UNK]"`. ~~str~~ |
 | `normalize` | Unicode normalization form to use. Defaults to `"NFKC"`. ~~str~~            |
-| `vocab`     | The shared vocabulary to use. ~~Optional[Vocab]~~                           |
 
 
 ### HFPieceEncoderLoader.v1 {id="hf_pieceencoder_loader",tag="registered_function"}
@@ -531,3 +530,17 @@ Construct a callback that initializes a supported transformer model with weights
 |--------|------------------------------------------|
 | `path` | Path to the PyTorch checkpoint. ~~Path~~ |
 
+## Callbacks
+
+### gradual_transformer_unfreezing.v1 {id="gradual_transformer_unfreezing",tag="registered_function"}
+
+Construct a callback that can be used to gradually unfreeze the
+weights of one or more Transformer components during training. This
+can be used to prevent catastrophic forgetting during fine-tuning.
+
+
+| Name           | Description                                                                                                                                                                  |
+|----------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| `target_pipes` | A dictionary whose keys and values correspond to the names of Transformer components and the training step at which they should be unfrozen respectively. ~~Dict[str, int]~~ |
+
+