diff --git a/website/docs/api/architectures.mdx b/website/docs/api/architectures.mdx index 86151c3fc..6dda7d095 100644 --- a/website/docs/api/architectures.mdx +++ b/website/docs/api/architectures.mdx @@ -484,9 +484,9 @@ The other arguments are shared between all versions. ## Curated transformer architectures {id="curated-trf",source="https://github.com/explosion/spacy-curated-transformers/blob/main/spacy_curated_transformers/models/architectures.py"} The following architectures are provided by the package -[`spacy-curated-transformers`](https://github.com/explosion/spacy-curated-transformers). See the -[usage documentation](/usage/embeddings-transformers#transformers) for how to -integrate the architectures into your training config. +[`spacy-curated-transformers`](https://github.com/explosion/spacy-curated-transformers). +See the [usage documentation](/usage/embeddings-transformers#transformers) for +how to integrate the architectures into your training config. @@ -503,11 +503,10 @@ for details and system requirements. Construct an ALBERT transformer model. | Name | Description | -|--------------------------------|-----------------------------------------------------------------------------| +| ------------------------------ | --------------------------------------------------------------------------- | | `vocab_size` | Vocabulary size. ~~int~~ | | `with_spans` | Callback that constructs a span generator model. ~~Callable~~ | -| `with_spans` | piece_encoder (Model) ~~Callable~~ | -| `with_spans` | The piece encoder to segment input tokens. ~~Callable~~ | +| `piece_encoder` | The piece encoder to segment input tokens. ~~Model~~ | | `attention_probs_dropout_prob` | Dropout probabilty of the self-attention layers. ~~float~~ | | `embedding_width` | Width of the embedding representations. ~~int~~ | | `hidden_act` | Activation used by the point-wise feed-forward layers. ~~str~~ | @@ -533,11 +532,10 @@ Construct an ALBERT transformer model. Construct a BERT transformer model. | Name | Description | -|--------------------------------|-----------------------------------------------------------------------------| +| ------------------------------ | --------------------------------------------------------------------------- | | `vocab_size` | Vocabulary size. ~~int~~ | | `with_spans` | Callback that constructs a span generator model. ~~Callable~~ | -| `with_spans` | piece_encoder (Model) ~~Callable~~ | -| `with_spans` | The piece encoder to segment input tokens. ~~Callable~~ | +| `piece_encoder` | The piece encoder to segment input tokens. ~~Model~~ | | `attention_probs_dropout_prob` | Dropout probabilty of the self-attention layers. ~~float~~ | | `hidden_act` | Activation used by the point-wise feed-forward layers. ~~str~~ | | `hidden_dropout_prob` | Dropout probabilty of the point-wise feed-forward and ~~float~~ | @@ -561,11 +559,10 @@ Construct a BERT transformer model. Construct a CamemBERT transformer model. | Name | Description | -|--------------------------------|-----------------------------------------------------------------------------| +| ------------------------------ | --------------------------------------------------------------------------- | | `vocab_size` | Vocabulary size. ~~int~~ | | `with_spans` | Callback that constructs a span generator model. ~~Callable~~ | -| `with_spans` | piece_encoder (Model) ~~Callable~~ | -| `with_spans` | The piece encoder to segment input tokens. ~~Callable~~ | +| `piece_encoder` | The piece encoder to segment input tokens. ~~Model~~ | | `attention_probs_dropout_prob` | Dropout probabilty of the self-attention layers. ~~float~~ | | `hidden_act` | Activation used by the point-wise feed-forward layers. ~~str~~ | | `hidden_dropout_prob` | Dropout probabilty of the point-wise feed-forward and ~~float~~ | @@ -589,11 +586,10 @@ Construct a CamemBERT transformer model. Construct a RoBERTa transformer model. | Name | Description | -|--------------------------------|-----------------------------------------------------------------------------| +| ------------------------------ | --------------------------------------------------------------------------- | | `vocab_size` | Vocabulary size. ~~int~~ | | `with_spans` | Callback that constructs a span generator model. ~~Callable~~ | -| `with_spans` | piece_encoder (Model) ~~Callable~~ | -| `with_spans` | The piece encoder to segment input tokens. ~~Callable~~ | +| `piece_encoder` | The piece encoder to segment input tokens. ~~Model~~ | | `attention_probs_dropout_prob` | Dropout probabilty of the self-attention layers. ~~float~~ | | `hidden_act` | Activation used by the point-wise feed-forward layers. ~~str~~ | | `hidden_dropout_prob` | Dropout probabilty of the point-wise feed-forward and ~~float~~ | @@ -612,17 +608,15 @@ Construct a RoBERTa transformer model. | `grad_scaler_config` | Configuration passed to the PyTorch gradient scaler. ~~dict~~ | | **CREATES** | The model using the architecture ~~Model[TransformerInT, TransformerOutT]~~ | - ### spacy-curated-transformers.XlmrTransformer.v1 Construct a XLM-RoBERTa transformer model. | Name | Description | -|--------------------------------|-----------------------------------------------------------------------------| +| ------------------------------ | --------------------------------------------------------------------------- | | `vocab_size` | Vocabulary size. ~~int~~ | | `with_spans` | Callback that constructs a span generator model. ~~Callable~~ | -| `with_spans` | piece_encoder (Model) ~~Callable~~ | -| `with_spans` | The piece encoder to segment input tokens. ~~Callable~~ | +| `piece_encoder` | The piece encoder to segment input tokens. ~~Model~~ | | `attention_probs_dropout_prob` | Dropout probabilty of the self-attention layers. ~~float~~ | | `hidden_act` | Activation used by the point-wise feed-forward layers. ~~str~~ | | `hidden_dropout_prob` | Dropout probabilty of the point-wise feed-forward and ~~float~~ | @@ -641,13 +635,13 @@ Construct a XLM-RoBERTa transformer model. | `grad_scaler_config` | Configuration passed to the PyTorch gradient scaler. ~~dict~~ | | **CREATES** | The model using the architecture ~~Model[TransformerInT, TransformerOutT]~~ | - ### spacy-curated-transformers.ScalarWeight.v1 -Construct a model that accepts a list of transformer layer outputs and returns a weighted representation of the same. +Construct a model that accepts a list of transformer layer outputs and returns a +weighted representation of the same. | Name | Description | -|----------------------|-------------------------------------------------------------------------------| +| -------------------- | ----------------------------------------------------------------------------- | | `num_layers` | Number of transformer hidden layers. ~~int~~ | | `dropout_prob` | Dropout probability. ~~float~~ | | `mixed_precision` | Use mixed-precision training. ~~bool~~ | @@ -656,137 +650,130 @@ Construct a model that accepts a list of transformer layer outputs and returns a ### spacy-curated-transformers.TransformerLayersListener.v1 -Construct a listener layer that communicates with one or more upstream Transformer -components. This layer extracts the output of the last transformer layer and performs -pooling over the individual pieces of each Doc token, returning their corresponding -representations. The upstream name should either be the wildcard string '*', or the name of the Transformer component. +Construct a listener layer that communicates with one or more upstream +Transformer components. This layer extracts the output of the last transformer +layer and performs pooling over the individual pieces of each Doc token, +returning their corresponding representations. The upstream name should either +be the wildcard string '\*', or the name of the Transformer component. In almost all cases, the wildcard string will suffice as there'll only be one -upstream Transformer component. But in certain situations, e.g: you have disjoint -datasets for certain tasks, or you'd like to use a pre-trained pipeline but a -downstream task requires its own token representations, you could end up with -more than one Transformer component in the pipeline. - +upstream Transformer component. But in certain situations, e.g: you have +disjoint datasets for certain tasks, or you'd like to use a pre-trained pipeline +but a downstream task requires its own token representations, you could end up +with more than one Transformer component in the pipeline. | Name | Description | -|-----------------|------------------------------------------------------------------------------------------------------------------------| +| --------------- | ---------------------------------------------------------------------------------------------------------------------- | | `layers` | The the number of layers produced by the upstream transformer component, excluding the embedding layer. ~~int~~ | | `width` | The width of the vectors produced by the upstream transformer component. ~~int~~ | | `pooling` | Model that is used to perform pooling over the piece representations. ~~Model~~ | -| `upstream_name` | A string to identify the 'upstream' Transformer component to communicate with. ~~str~~ | +| `upstream_name` | A string to identify the 'upstream' Transformer component to communicate with. ~~str~~ | | `grad_factor` | Factor to multiply gradients with. ~~float~~ | | **CREATES** | A model that returns the relevant vectors from an upstream transformer component. ~~Model[List[Doc], List[Floats2d]]~~ | - ### spacy-curated-transformers.LastTransformerLayerListener.v1 -Construct a listener layer that communicates with one or more upstream Transformer -components. This layer extracts the output of the last transformer layer and performs -pooling over the individual pieces of each Doc token, returning their corresponding -representations. The upstream name should either be the wildcard string '*', or the name of the Transformer component. +Construct a listener layer that communicates with one or more upstream +Transformer components. This layer extracts the output of the last transformer +layer and performs pooling over the individual pieces of each Doc token, +returning their corresponding representations. The upstream name should either +be the wildcard string '\*', or the name of the Transformer component. In almost all cases, the wildcard string will suffice as there'll only be one -upstream Transformer component. But in certain situations, e.g: you have disjoint -datasets for certain tasks, or you'd like to use a pre-trained pipeline but a -downstream task requires its own token representations, you could end up with -more than one Transformer component in the pipeline. +upstream Transformer component. But in certain situations, e.g: you have +disjoint datasets for certain tasks, or you'd like to use a pre-trained pipeline +but a downstream task requires its own token representations, you could end up +with more than one Transformer component in the pipeline. | Name | Description | -|-----------------|------------------------------------------------------------------------------------------------------------------------| +| --------------- | ---------------------------------------------------------------------------------------------------------------------- | | `width` | The width of the vectors produced by the upstream transformer component. ~~int~~ | | `pooling` | Model that is used to perform pooling over the piece representations. ~~Model~~ | -| `upstream_name` | A string to identify the 'upstream' Transformer component to communicate with. ~~str~~ | +| `upstream_name` | A string to identify the 'upstream' Transformer component to communicate with. ~~str~~ | | `grad_factor` | Factor to multiply gradients with. ~~float~~ | | **CREATES** | A model that returns the relevant vectors from an upstream transformer component. ~~Model[List[Doc], List[Floats2d]]~~ | - ### spacy-curated-transformers.ScalarWeightingListener.v1 -Construct a listener layer that communicates with one or more upstream Transformer -components. This layer calculates a weighted representation of all transformer layer -outputs and performs pooling over the individual pieces of each Doc token, returning -their corresponding representations. +Construct a listener layer that communicates with one or more upstream +Transformer components. This layer calculates a weighted representation of all +transformer layer outputs and performs pooling over the individual pieces of +each Doc token, returning their corresponding representations. Requires its upstream Transformer components to return all layer outputs from -their models. The upstream name should either be the wildcard string '*', or the name of the Transformer component. +their models. The upstream name should either be the wildcard string '\*', or +the name of the Transformer component. In almost all cases, the wildcard string will suffice as there'll only be one -upstream Transformer component. But in certain situations, e.g: you have disjoint -datasets for certain tasks, or you'd like to use a pre-trained pipeline but a -downstream task requires its own token representations, you could end up with -more than one Transformer component in the pipeline. +upstream Transformer component. But in certain situations, e.g: you have +disjoint datasets for certain tasks, or you'd like to use a pre-trained pipeline +but a downstream task requires its own token representations, you could end up +with more than one Transformer component in the pipeline. | Name | Description | -|-----------------|------------------------------------------------------------------------------------------------------------------------| +| --------------- | ---------------------------------------------------------------------------------------------------------------------- | | `width` | The width of the vectors produced by the upstream transformer component. ~~int~~ | | `weighting` | Model that is used to perform the weighting of the different layer outputs. ~~Model~~ | | `pooling` | Model that is used to perform pooling over the piece representations. ~~Model~~ | -| `upstream_name` | A string to identify the 'upstream' Transformer component to communicate with. ~~str~~ | +| `upstream_name` | A string to identify the 'upstream' Transformer component to communicate with. ~~str~~ | | `grad_factor` | Factor to multiply gradients with. ~~float~~ | | **CREATES** | A model that returns the relevant vectors from an upstream transformer component. ~~Model[List[Doc], List[Floats2d]]~~ | ### spacy-curated-transformers.BertWordpieceEncoder.v1 -Construct a WordPiece piece encoder model that accepts a list -of token sequences or documents and returns a corresponding list -of piece identifiers. This encoder also splits each token -on punctuation characters, as expected by most BERT models. +Construct a WordPiece piece encoder model that accepts a list of token sequences +or documents and returns a corresponding list of piece identifiers. This encoder +also splits each token on punctuation characters, as expected by most BERT +models. -This model must be separately initialized using an appropriate -loader. +This model must be separately initialized using an appropriate loader. ### spacy-curated-transformers.ByteBpeEncoder.v1 -Construct a Byte-BPE piece encoder model that accepts a list -of token sequences or documents and returns a corresponding list -of piece identifiers. +Construct a Byte-BPE piece encoder model that accepts a list of token sequences +or documents and returns a corresponding list of piece identifiers. -This model must be separately initialized using an appropriate -loader. +This model must be separately initialized using an appropriate loader. ### spacy-curated-transformers.CamembertSentencepieceEncoder.v1 -Construct a SentencePiece piece encoder model that accepts a list -of token sequences or documents and returns a corresponding list -of piece identifiers with CamemBERT post-processing applied. -This model must be separately initialized using an appropriate -loader. +Construct a SentencePiece piece encoder model that accepts a list of token +sequences or documents and returns a corresponding list of piece identifiers +with CamemBERT post-processing applied. + +This model must be separately initialized using an appropriate loader. ### spacy-curated-transformers.CharEncoder.v1 -Construct a character piece encoder model that accepts a list -of token sequences or documents and returns a corresponding list -of piece identifiers. -This model must be separately initialized using an appropriate -loader. +Construct a character piece encoder model that accepts a list of token sequences +or documents and returns a corresponding list of piece identifiers. + +This model must be separately initialized using an appropriate loader. ### spacy-curated-transformers.SentencepieceEncoder.v1 -Construct a SentencePiece piece encoder model that accepts a list -of token sequences or documents and returns a corresponding list -of piece identifiers with CamemBERT post-processing applied. -This model must be separately initialized using an appropriate -loader. +Construct a SentencePiece piece encoder model that accepts a list of token +sequences or documents and returns a corresponding list of piece identifiers +with CamemBERT post-processing applied. + +This model must be separately initialized using an appropriate loader. ### spacy-curated-transformers.WordpieceEncoder.v1 -Construct a WordPiece piece encoder model that accepts a list -of token sequences or documents and returns a corresponding list -of piece identifiers. This encoder also splits each token -on punctuation characters, as expected by most BERT models. -This model must be separately initialized using an appropriate -loader. +Construct a WordPiece piece encoder model that accepts a list of token sequences +or documents and returns a corresponding list of piece identifiers. This encoder +also splits each token on punctuation characters, as expected by most BERT +models. + +This model must be separately initialized using an appropriate loader. ### spacy-curated-transformers.XlmrSentencepieceEncoder.v1 -Construct a SentencePiece piece encoder model that accepts a list -of token sequences or documents and returns a corresponding list -of piece identifiers with XLM-RoBERTa post-processing applied. - -This model must be separately initialized using an appropriate -loader. - +Construct a SentencePiece piece encoder model that accepts a list of token +sequences or documents and returns a corresponding list of piece identifiers +with XLM-RoBERTa post-processing applied. +This model must be separately initialized using an appropriate loader. ## Pretraining architectures {id="pretrain",source="spacy/ml/models/multi_task.py"} @@ -826,7 +813,7 @@ objective for a Tok2Vec layer. To use this objective, make sure that the vectors. | Name | Description | -|-----------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------| +| --------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------- | | `maxout_pieces` | The number of maxout pieces to use. Recommended values are `2` or `3`. ~~int~~ | | `hidden_size` | Size of the hidden layer of the model. ~~int~~ | | `loss` | The loss function can be either "cosine" or "L2". We typically recommend to use "cosine". ~~~str~~ |