diff --git a/website/docs/api/architectures.mdx b/website/docs/api/architectures.mdx
index 86151c3fc..6dda7d095 100644
--- a/website/docs/api/architectures.mdx
+++ b/website/docs/api/architectures.mdx
@@ -484,9 +484,9 @@ The other arguments are shared between all versions.
 ## Curated transformer architectures {id="curated-trf",source="https://github.com/explosion/spacy-curated-transformers/blob/main/spacy_curated_transformers/models/architectures.py"}
 
 The following architectures are provided by the package
-[`spacy-curated-transformers`](https://github.com/explosion/spacy-curated-transformers). See the
-[usage documentation](/usage/embeddings-transformers#transformers) for how to
-integrate the architectures into your training config.
+[`spacy-curated-transformers`](https://github.com/explosion/spacy-curated-transformers).
+See the [usage documentation](/usage/embeddings-transformers#transformers) for
+how to integrate the architectures into your training config.
 
 <Infobox variant="warning">
 
@@ -503,11 +503,10 @@ for details and system requirements.
 Construct an ALBERT transformer model.
 
 | Name                           | Description                                                                 |
-|--------------------------------|-----------------------------------------------------------------------------|
+| ------------------------------ | --------------------------------------------------------------------------- |
 | `vocab_size`                   | Vocabulary size. ~~int~~                                                    |
 | `with_spans`                   | Callback that constructs a span generator model. ~~Callable~~               |
-| `with_spans`                   | piece_encoder (Model) ~~Callable~~                                          |
-| `with_spans`                   | The piece encoder to segment input tokens. ~~Callable~~                     |
+| `piece_encoder`                | The piece encoder to segment input tokens. ~~Model~~                        |
 | `attention_probs_dropout_prob` | Dropout probabilty of the self-attention layers. ~~float~~                  |
 | `embedding_width`              | Width of the embedding representations. ~~int~~                             |
 | `hidden_act`                   | Activation used by the point-wise feed-forward layers. ~~str~~              |
@@ -533,11 +532,10 @@ Construct an ALBERT transformer model.
 Construct a BERT transformer model.
 
 | Name                           | Description                                                                 |
-|--------------------------------|-----------------------------------------------------------------------------|
+| ------------------------------ | --------------------------------------------------------------------------- |
 | `vocab_size`                   | Vocabulary size. ~~int~~                                                    |
 | `with_spans`                   | Callback that constructs a span generator model. ~~Callable~~               |
-| `with_spans`                   | piece_encoder (Model) ~~Callable~~                                          |
-| `with_spans`                   | The piece encoder to segment input tokens. ~~Callable~~                     |
+| `piece_encoder`                | The piece encoder to segment input tokens. ~~Model~~                        |
 | `attention_probs_dropout_prob` | Dropout probabilty of the self-attention layers. ~~float~~                  |
 | `hidden_act`                   | Activation used by the point-wise feed-forward layers. ~~str~~              |
 | `hidden_dropout_prob`          | Dropout probabilty of the point-wise feed-forward and ~~float~~             |
@@ -561,11 +559,10 @@ Construct a BERT transformer model.
 Construct a CamemBERT transformer model.
 
 | Name                           | Description                                                                 |
-|--------------------------------|-----------------------------------------------------------------------------|
+| ------------------------------ | --------------------------------------------------------------------------- |
 | `vocab_size`                   | Vocabulary size. ~~int~~                                                    |
 | `with_spans`                   | Callback that constructs a span generator model. ~~Callable~~               |
-| `with_spans`                   | piece_encoder (Model) ~~Callable~~                                          |
-| `with_spans`                   | The piece encoder to segment input tokens. ~~Callable~~                     |
+| `piece_encoder`                | The piece encoder to segment input tokens. ~~Model~~                        |
 | `attention_probs_dropout_prob` | Dropout probabilty of the self-attention layers. ~~float~~                  |
 | `hidden_act`                   | Activation used by the point-wise feed-forward layers. ~~str~~              |
 | `hidden_dropout_prob`          | Dropout probabilty of the point-wise feed-forward and ~~float~~             |
@@ -589,11 +586,10 @@ Construct a CamemBERT transformer model.
 Construct a RoBERTa transformer model.
 
 | Name                           | Description                                                                 |
-|--------------------------------|-----------------------------------------------------------------------------|
+| ------------------------------ | --------------------------------------------------------------------------- |
 | `vocab_size`                   | Vocabulary size. ~~int~~                                                    |
 | `with_spans`                   | Callback that constructs a span generator model. ~~Callable~~               |
-| `with_spans`                   | piece_encoder (Model) ~~Callable~~                                          |
-| `with_spans`                   | The piece encoder to segment input tokens. ~~Callable~~                     |
+| `piece_encoder`                | The piece encoder to segment input tokens. ~~Model~~                        |
 | `attention_probs_dropout_prob` | Dropout probabilty of the self-attention layers. ~~float~~                  |
 | `hidden_act`                   | Activation used by the point-wise feed-forward layers. ~~str~~              |
 | `hidden_dropout_prob`          | Dropout probabilty of the point-wise feed-forward and ~~float~~             |
@@ -612,17 +608,15 @@ Construct a RoBERTa transformer model.
 | `grad_scaler_config`           | Configuration passed to the PyTorch gradient scaler. ~~dict~~               |
 | **CREATES**                    | The model using the architecture ~~Model[TransformerInT, TransformerOutT]~~ |
 
-
 ### spacy-curated-transformers.XlmrTransformer.v1
 
 Construct a XLM-RoBERTa transformer model.
 
 | Name                           | Description                                                                 |
-|--------------------------------|-----------------------------------------------------------------------------|
+| ------------------------------ | --------------------------------------------------------------------------- |
 | `vocab_size`                   | Vocabulary size. ~~int~~                                                    |
 | `with_spans`                   | Callback that constructs a span generator model. ~~Callable~~               |
-| `with_spans`                   | piece_encoder (Model) ~~Callable~~                                          |
-| `with_spans`                   | The piece encoder to segment input tokens. ~~Callable~~                     |
+| `piece_encoder`                | The piece encoder to segment input tokens. ~~Model~~                        |
 | `attention_probs_dropout_prob` | Dropout probabilty of the self-attention layers. ~~float~~                  |
 | `hidden_act`                   | Activation used by the point-wise feed-forward layers. ~~str~~              |
 | `hidden_dropout_prob`          | Dropout probabilty of the point-wise feed-forward and ~~float~~             |
@@ -641,13 +635,13 @@ Construct a XLM-RoBERTa transformer model.
 | `grad_scaler_config`           | Configuration passed to the PyTorch gradient scaler. ~~dict~~               |
 | **CREATES**                    | The model using the architecture ~~Model[TransformerInT, TransformerOutT]~~ |
 
-
 ### spacy-curated-transformers.ScalarWeight.v1
 
-Construct a model that accepts a list of transformer layer outputs and returns a weighted representation of the same.
+Construct a model that accepts a list of transformer layer outputs and returns a
+weighted representation of the same.
 
 | Name                 | Description                                                                   |
-|----------------------|-------------------------------------------------------------------------------|
+| -------------------- | ----------------------------------------------------------------------------- |
 | `num_layers`         | Number of transformer hidden layers. ~~int~~                                  |
 | `dropout_prob`       | Dropout probability. ~~float~~                                                |
 | `mixed_precision`    | Use mixed-precision training. ~~bool~~                                        |
@@ -656,137 +650,130 @@ Construct a model that accepts a list of transformer layer outputs and returns a
 
 ### spacy-curated-transformers.TransformerLayersListener.v1
 
-Construct a listener layer that communicates with one or more upstream Transformer
-components. This layer extracts the output of the last transformer layer and performs
-pooling over the individual pieces of each Doc token, returning their corresponding
-representations. The upstream name should either be the wildcard string '*', or the name of the Transformer component.
+Construct a listener layer that communicates with one or more upstream
+Transformer components. This layer extracts the output of the last transformer
+layer and performs pooling over the individual pieces of each Doc token,
+returning their corresponding representations. The upstream name should either
+be the wildcard string '\*', or the name of the Transformer component.
 
 In almost all cases, the wildcard string will suffice as there'll only be one
-upstream Transformer component. But in certain situations, e.g: you have disjoint
-datasets for certain tasks, or you'd like to use a pre-trained pipeline but a
-downstream task requires its own token representations, you could end up with
-more than one Transformer component in the pipeline.
-
+upstream Transformer component. But in certain situations, e.g: you have
+disjoint datasets for certain tasks, or you'd like to use a pre-trained pipeline
+but a downstream task requires its own token representations, you could end up
+with more than one Transformer component in the pipeline.
 
 | Name            | Description                                                                                                            |
-|-----------------|------------------------------------------------------------------------------------------------------------------------|
+| --------------- | ---------------------------------------------------------------------------------------------------------------------- |
 | `layers`        | The the number of layers produced by the upstream transformer component, excluding the embedding layer. ~~int~~        |
 | `width`         | The width of the vectors produced by the upstream transformer component. ~~int~~                                       |
 | `pooling`       | Model that is used to perform pooling over the piece representations. ~~Model~~                                        |
-| `upstream_name` | A string to identify the 'upstream' Transformer component to communicate with.  ~~str~~                                |
+| `upstream_name` | A string to identify the 'upstream' Transformer component to communicate with. ~~str~~                                 |
 | `grad_factor`   | Factor to multiply gradients with. ~~float~~                                                                           |
 | **CREATES**     | A model that returns the relevant vectors from an upstream transformer component. ~~Model[List[Doc], List[Floats2d]]~~ |
 
-
 ### spacy-curated-transformers.LastTransformerLayerListener.v1
 
-Construct a listener layer that communicates with one or more upstream Transformer
-components. This layer extracts the output of the last transformer layer and performs
-pooling over the individual pieces of each Doc token, returning their corresponding
-representations. The upstream name should either be the wildcard string '*', or the name of the Transformer component.
+Construct a listener layer that communicates with one or more upstream
+Transformer components. This layer extracts the output of the last transformer
+layer and performs pooling over the individual pieces of each Doc token,
+returning their corresponding representations. The upstream name should either
+be the wildcard string '\*', or the name of the Transformer component.
 
 In almost all cases, the wildcard string will suffice as there'll only be one
-upstream Transformer component. But in certain situations, e.g: you have disjoint
-datasets for certain tasks, or you'd like to use a pre-trained pipeline but a
-downstream task requires its own token representations, you could end up with
-more than one Transformer component in the pipeline.
+upstream Transformer component. But in certain situations, e.g: you have
+disjoint datasets for certain tasks, or you'd like to use a pre-trained pipeline
+but a downstream task requires its own token representations, you could end up
+with more than one Transformer component in the pipeline.
 
 | Name            | Description                                                                                                            |
-|-----------------|------------------------------------------------------------------------------------------------------------------------|
+| --------------- | ---------------------------------------------------------------------------------------------------------------------- |
 | `width`         | The width of the vectors produced by the upstream transformer component. ~~int~~                                       |
 | `pooling`       | Model that is used to perform pooling over the piece representations. ~~Model~~                                        |
-| `upstream_name` | A string to identify the 'upstream' Transformer component to communicate with.  ~~str~~                                |
+| `upstream_name` | A string to identify the 'upstream' Transformer component to communicate with. ~~str~~                                 |
 | `grad_factor`   | Factor to multiply gradients with. ~~float~~                                                                           |
 | **CREATES**     | A model that returns the relevant vectors from an upstream transformer component. ~~Model[List[Doc], List[Floats2d]]~~ |
 
-
 ### spacy-curated-transformers.ScalarWeightingListener.v1
 
-Construct a listener layer that communicates with one or more upstream Transformer
-components. This layer calculates a weighted representation of all transformer layer
-outputs and performs pooling over the individual pieces of each Doc token, returning
-their corresponding representations.
+Construct a listener layer that communicates with one or more upstream
+Transformer components. This layer calculates a weighted representation of all
+transformer layer outputs and performs pooling over the individual pieces of
+each Doc token, returning their corresponding representations.
 
 Requires its upstream Transformer components to return all layer outputs from
-their models.  The upstream name should either be the wildcard string '*', or the name of the Transformer component.
+their models. The upstream name should either be the wildcard string '\*', or
+the name of the Transformer component.
 
 In almost all cases, the wildcard string will suffice as there'll only be one
-upstream Transformer component. But in certain situations, e.g: you have disjoint
-datasets for certain tasks, or you'd like to use a pre-trained pipeline but a
-downstream task requires its own token representations, you could end up with
-more than one Transformer component in the pipeline.
+upstream Transformer component. But in certain situations, e.g: you have
+disjoint datasets for certain tasks, or you'd like to use a pre-trained pipeline
+but a downstream task requires its own token representations, you could end up
+with more than one Transformer component in the pipeline.
 
 | Name            | Description                                                                                                            |
-|-----------------|------------------------------------------------------------------------------------------------------------------------|
+| --------------- | ---------------------------------------------------------------------------------------------------------------------- |
 | `width`         | The width of the vectors produced by the upstream transformer component. ~~int~~                                       |
 | `weighting`     | Model that is used to perform the weighting of the different layer outputs. ~~Model~~                                  |
 | `pooling`       | Model that is used to perform pooling over the piece representations. ~~Model~~                                        |
-| `upstream_name` | A string to identify the 'upstream' Transformer component to communicate with.  ~~str~~                                |
+| `upstream_name` | A string to identify the 'upstream' Transformer component to communicate with. ~~str~~                                 |
 | `grad_factor`   | Factor to multiply gradients with. ~~float~~                                                                           |
 | **CREATES**     | A model that returns the relevant vectors from an upstream transformer component. ~~Model[List[Doc], List[Floats2d]]~~ |
 
 ### spacy-curated-transformers.BertWordpieceEncoder.v1
 
-Construct a WordPiece piece encoder model that accepts a list
-of token sequences or documents and returns a corresponding list
-of piece identifiers. This encoder also splits each token
-on punctuation characters, as expected by most BERT models.
+Construct a WordPiece piece encoder model that accepts a list of token sequences
+or documents and returns a corresponding list of piece identifiers. This encoder
+also splits each token on punctuation characters, as expected by most BERT
+models.
 
-This model must be separately initialized using an appropriate
-loader.
+This model must be separately initialized using an appropriate loader.
 
 ### spacy-curated-transformers.ByteBpeEncoder.v1
 
-Construct a Byte-BPE piece encoder model that accepts a list
-of token sequences or documents and returns a corresponding list
-of piece identifiers.
+Construct a Byte-BPE piece encoder model that accepts a list of token sequences
+or documents and returns a corresponding list of piece identifiers.
 
-This model must be separately initialized using an appropriate
-loader.
+This model must be separately initialized using an appropriate loader.
 
 ### spacy-curated-transformers.CamembertSentencepieceEncoder.v1
-Construct a SentencePiece piece encoder model that accepts a list
-of token sequences or documents and returns a corresponding list
-of piece identifiers with CamemBERT post-processing applied.
 
-This model must be separately initialized using an appropriate
-loader.
+Construct a SentencePiece piece encoder model that accepts a list of token
+sequences or documents and returns a corresponding list of piece identifiers
+with CamemBERT post-processing applied.
+
+This model must be separately initialized using an appropriate loader.
 
 ### spacy-curated-transformers.CharEncoder.v1
-Construct a character piece encoder model that accepts a list
-of token sequences or documents and returns a corresponding list
-of piece identifiers.
 
-This model must be separately initialized using an appropriate
-loader.
+Construct a character piece encoder model that accepts a list of token sequences
+or documents and returns a corresponding list of piece identifiers.
+
+This model must be separately initialized using an appropriate loader.
 
 ### spacy-curated-transformers.SentencepieceEncoder.v1
-Construct a SentencePiece piece encoder model that accepts a list
-of token sequences or documents and returns a corresponding list
-of piece identifiers with CamemBERT post-processing applied.
 
-This model must be separately initialized using an appropriate
-loader.
+Construct a SentencePiece piece encoder model that accepts a list of token
+sequences or documents and returns a corresponding list of piece identifiers
+with CamemBERT post-processing applied.
+
+This model must be separately initialized using an appropriate loader.
 
 ### spacy-curated-transformers.WordpieceEncoder.v1
-Construct a WordPiece piece encoder model that accepts a list
-of token sequences or documents and returns a corresponding list
-of piece identifiers. This encoder also splits each token
-on punctuation characters, as expected by most BERT models.
 
-This model must be separately initialized using an appropriate
-loader.
+Construct a WordPiece piece encoder model that accepts a list of token sequences
+or documents and returns a corresponding list of piece identifiers. This encoder
+also splits each token on punctuation characters, as expected by most BERT
+models.
+
+This model must be separately initialized using an appropriate loader.
 
 ### spacy-curated-transformers.XlmrSentencepieceEncoder.v1
-Construct a SentencePiece piece encoder model that accepts a list
-of token sequences or documents and returns a corresponding list
-of piece identifiers with XLM-RoBERTa post-processing applied.
-
-This model must be separately initialized using an appropriate
-loader.
-
 
+Construct a SentencePiece piece encoder model that accepts a list of token
+sequences or documents and returns a corresponding list of piece identifiers
+with XLM-RoBERTa post-processing applied.
 
+This model must be separately initialized using an appropriate loader.
 
 ## Pretraining architectures {id="pretrain",source="spacy/ml/models/multi_task.py"}
 
@@ -826,7 +813,7 @@ objective for a Tok2Vec layer. To use this objective, make sure that the
 vectors.
 
 | Name            | Description                                                                                                                                               |
-|-----------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------|
+| --------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------- |
 | `maxout_pieces` | The number of maxout pieces to use. Recommended values are `2` or `3`. ~~int~~                                                                            |
 | `hidden_size`   | Size of the hidden layer of the model. ~~int~~                                                                                                            |
 | `loss`          | The loss function can be either "cosine" or "L2". We typically recommend to use "cosine". ~~~str~~                                                        |