initial documentation run

This commit is contained in:
vinit 2023-05-26 11:46:34 +02:00
parent 1cbad4f3c9
commit a633b88ef2
3 changed files with 375 additions and 2 deletions

View File

@ -481,6 +481,318 @@ The other arguments are shared between all versions.
</Accordion>
## Curated transformer architectures {id="curated-trf",source="https://github.com/explosion/spacy-curated-transformers/blob/main/spacy_curated_transformers/models/architectures.py"}
The following architectures are provided by the package
[`spacy-curated-transformers`](https://github.com/explosion/spacy-curated-transformers). See the
[usage documentation](/usage/embeddings-transformers#transformers) for how to
integrate the architectures into your training config.
<Infobox variant="warning">
Note that in order to use these architectures in your config, you need to
install the
[`spacy-transformers`](https://github.com/explosion/spacy-transformers). See the
[installation docs](/usage/embeddings-transformers#transformers-installation)
for details and system requirements.
</Infobox>
### spacy-curated-transformers.AlbertTransformer.v1
Construct an ALBERT transformer model.
| Name | Description |
|--------------------------------|-----------------------------------------------------------------------------|
| `vocab_size` | Vocabulary size. ~~int~~ |
| `with_spans` | Callback that constructs a span generator model. ~~Callable~~ |
| `with_spans` | piece_encoder (Model) ~~Callable~~ |
| `with_spans` | The piece encoder to segment input tokens. ~~Callable~~ |
| `attention_probs_dropout_prob` | Dropout probabilty of the self-attention layers. ~~float~~ |
| `embedding_width` | Width of the embedding representations. ~~int~~ |
| `hidden_act` | Activation used by the point-wise feed-forward layers. ~~str~~ |
| `hidden_dropout_prob` | Dropout probabilty of the point-wise feed-forward and ~~float~~ |
| `hidden_dropout_prob` | embedding layers. ~~float~~ |
| `hidden_width` | Width of the final representations. ~~int~~ |
| `intermediate_width` | Width of the intermediate projection layer in the ~~int~~ |
| `intermediate_width` | point-wise feed-forward layer. ~~int~~ |
| `layer_norm_eps` | Epsilon for layer normalization. ~~float~~ |
| `max_position_embeddings` | Maximum length of position embeddings. ~~int~~ |
| `model_max_length` | Maximum length of model inputs. ~~int~~ |
| `num_attention_heads` | Number of self-attention heads. ~~int~~ |
| `num_hidden_groups` | Number of layer groups whose constituents share parameters. ~~int~~ |
| `num_hidden_layers` | Number of hidden layers. ~~int~~ |
| `padding_idx` | Index of the padding meta-token. ~~int~~ |
| `type_vocab_size` | Type vocabulary size. ~~int~~ |
| `torchscript` | Set to `True` when loading TorchScript models, `False` otherwise. ~~bool~~ |
| `mixed_precision` | Use mixed-precision training. ~~bool~~ |
| `grad_scaler_config` | Configuration passed to the PyTorch gradient scaler. ~~dict~~ |
| **CREATES** | The model using the architecture ~~Model[TransformerInT, TransformerOutT]~~ |
### spacy-curated-transformers.BertTransformer.v1
Construct a BERT transformer model.
| Name | Description |
|--------------------------------|-----------------------------------------------------------------------------|
| `vocab_size` | Vocabulary size. ~~int~~ |
| `with_spans` | Callback that constructs a span generator model. ~~Callable~~ |
| `with_spans` | piece_encoder (Model) ~~Callable~~ |
| `with_spans` | The piece encoder to segment input tokens. ~~Callable~~ |
| `attention_probs_dropout_prob` | Dropout probabilty of the self-attention layers. ~~float~~ |
| `hidden_act` | Activation used by the point-wise feed-forward layers. ~~str~~ |
| `hidden_dropout_prob` | Dropout probabilty of the point-wise feed-forward and ~~float~~ |
| `hidden_dropout_prob` | embedding layers. ~~float~~ |
| `hidden_width` | Width of the final representations. ~~int~~ |
| `intermediate_width` | Width of the intermediate projection layer in the ~~int~~ |
| `intermediate_width` | point-wise feed-forward layer. ~~int~~ |
| `layer_norm_eps` | Epsilon for layer normalization. ~~float~~ |
| `max_position_embeddings` | Maximum length of position embeddings. ~~int~~ |
| `model_max_length` | Maximum length of model inputs. ~~int~~ |
| `num_attention_heads` | Number of self-attention heads. ~~int~~ |
| `num_hidden_layers` | Number of hidden layers. ~~int~~ |
| `padding_idx` | Index of the padding meta-token. ~~int~~ |
| `type_vocab_size` | Type vocabulary size. ~~int~~ |
| `torchscript` | Set to `True` when loading TorchScript models, `False` otherwise. ~~bool~~ |
| `mixed_precision` | Use mixed-precision training. ~~bool~~ |
| `grad_scaler_config` | Configuration passed to the PyTorch gradient scaler. ~~dict~~ |
| **CREATES** | The model using the architecture ~~Model[TransformerInT, TransformerOutT]~~ |
### spacy-curated-transformers.CamembertTransformer.v1
Construct a CamemBERT transformer model.
| Name | Description |
|--------------------------------|-----------------------------------------------------------------------------|
| `vocab_size` | Vocabulary size. ~~int~~ |
| `with_spans` | Callback that constructs a span generator model. ~~Callable~~ |
| `with_spans` | piece_encoder (Model) ~~Callable~~ |
| `with_spans` | The piece encoder to segment input tokens. ~~Callable~~ |
| `attention_probs_dropout_prob` | Dropout probabilty of the self-attention layers. ~~float~~ |
| `hidden_act` | Activation used by the point-wise feed-forward layers. ~~str~~ |
| `hidden_dropout_prob` | Dropout probabilty of the point-wise feed-forward and ~~float~~ |
| `hidden_dropout_prob` | embedding layers. ~~float~~ |
| `hidden_width` | Width of the final representations. ~~int~~ |
| `intermediate_width` | Width of the intermediate projection layer in the ~~int~~ |
| `intermediate_width` | point-wise feed-forward layer. ~~int~~ |
| `layer_norm_eps` | Epsilon for layer normalization. ~~float~~ |
| `max_position_embeddings` | Maximum length of position embeddings. ~~int~~ |
| `model_max_length` | Maximum length of model inputs. ~~int~~ |
| `num_attention_heads` | Number of self-attention heads. ~~int~~ |
| `num_hidden_layers` | Number of hidden layers. ~~int~~ |
| `padding_idx` | Index of the padding meta-token. ~~int~~ |
| `type_vocab_size` | Type vocabulary size. ~~int~~ |
| `torchscript` | Set to `True` when loading TorchScript models, `False` otherwise. ~~bool~~ |
| `mixed_precision` | Use mixed-precision training. ~~bool~~ |
| `grad_scaler_config` | Configuration passed to the PyTorch gradient scaler. ~~dict~~ |
| **CREATES** | The model using the architecture ~~Model[TransformerInT, TransformerOutT]~~ |
### spacy-curated-transformers.RobertaTransformer.v1
Construct a RoBERTa transformer model.
| Name | Description |
|--------------------------------|-----------------------------------------------------------------------------|
| `vocab_size` | Vocabulary size. ~~int~~ |
| `with_spans` | Callback that constructs a span generator model. ~~Callable~~ |
| `with_spans` | piece_encoder (Model) ~~Callable~~ |
| `with_spans` | The piece encoder to segment input tokens. ~~Callable~~ |
| `attention_probs_dropout_prob` | Dropout probabilty of the self-attention layers. ~~float~~ |
| `hidden_act` | Activation used by the point-wise feed-forward layers. ~~str~~ |
| `hidden_dropout_prob` | Dropout probabilty of the point-wise feed-forward and ~~float~~ |
| `hidden_dropout_prob` | embedding layers. ~~float~~ |
| `hidden_width` | Width of the final representations. ~~int~~ |
| `intermediate_width` | Width of the intermediate projection layer in the ~~int~~ |
| `intermediate_width` | point-wise feed-forward layer. ~~int~~ |
| `layer_norm_eps` | Epsilon for layer normalization. ~~float~~ |
| `max_position_embeddings` | Maximum length of position embeddings. ~~int~~ |
| `model_max_length` | Maximum length of model inputs. ~~int~~ |
| `num_attention_heads` | Number of self-attention heads. ~~int~~ |
| `num_hidden_layers` | Number of hidden layers. ~~int~~ |
| `padding_idx` | Index of the padding meta-token. ~~int~~ |
| `type_vocab_size` | Type vocabulary size. ~~int~~ |
| `torchscript` | Set to `True` when loading TorchScript models, `False` otherwise. ~~bool~~ |
| `mixed_precision` | Use mixed-precision training. ~~bool~~ |
| `grad_scaler_config` | Configuration passed to the PyTorch gradient scaler. ~~dict~~ |
| **CREATES** | The model using the architecture ~~Model[TransformerInT, TransformerOutT]~~ |
### spacy-curated-transformers.XlmrTransformer.v1
Construct a XLM-RoBERTa transformer model.
| Name | Description |
|--------------------------------|-----------------------------------------------------------------------------|
| `vocab_size` | Vocabulary size. ~~int~~ |
| `with_spans` | Callback that constructs a span generator model. ~~Callable~~ |
| `with_spans` | piece_encoder (Model) ~~Callable~~ |
| `with_spans` | The piece encoder to segment input tokens. ~~Callable~~ |
| `attention_probs_dropout_prob` | Dropout probabilty of the self-attention layers. ~~float~~ |
| `hidden_act` | Activation used by the point-wise feed-forward layers. ~~str~~ |
| `hidden_dropout_prob` | Dropout probabilty of the point-wise feed-forward and ~~float~~ |
| `hidden_dropout_prob` | embedding layers. ~~float~~ |
| `hidden_width` | Width of the final representations. ~~int~~ |
| `intermediate_width` | Width of the intermediate projection layer in the ~~int~~ |
| `intermediate_width` | point-wise feed-forward layer. ~~int~~ |
| `layer_norm_eps` | Epsilon for layer normalization. ~~float~~ |
| `max_position_embeddings` | Maximum length of position embeddings. ~~int~~ |
| `model_max_length` | Maximum length of model inputs. ~~int~~ |
| `num_attention_heads` | Number of self-attention heads. ~~int~~ |
| `num_hidden_layers` | Number of hidden layers. ~~int~~ |
| `padding_idx` | Index of the padding meta-token. ~~int~~ |
| `type_vocab_size` | Type vocabulary size. ~~int~~ |
| `torchscript` | Set to `True` when loading TorchScript models, `False` otherwise. ~~bool~~ |
| `mixed_precision` | Use mixed-precision training. ~~bool~~ |
| `grad_scaler_config` | Configuration passed to the PyTorch gradient scaler. ~~dict~~ |
| **CREATES** | The model using the architecture ~~Model[TransformerInT, TransformerOutT]~~ |
### spacy-curated-transformers.ScalarWeight.v1
Construct a model that accepts a list of transformer layer outputs and returns a weighted representation of the same.
| Name | Description |
|----------------------|-------------------------------------------------------------------------------|
| `num_layers` | Number of transformer hidden layers. ~~int~~ |
| `dropout_prob` | Dropout probability. ~~float~~ |
| `mixed_precision` | Use mixed-precision training. ~~bool~~ |
| `grad_scaler_config` | Configuration passed to the PyTorch gradient scaler. ~~dict~~ |
| **CREATES** | The model using the architecture ~~Model[ScalarWeightInT, ScalarWeightOutT]~~ |
### spacy-curated-transformers.TransformerLayersListener.v1
Construct a listener layer that communicates with one or more upstream Transformer
components. This layer extracts the output of the last transformer layer and performs
pooling over the individual pieces of each Doc token, returning their corresponding
representations. The upstream name should either be the wildcard string '*', or the name of the Transformer component.
In almost all cases, the wildcard string will suffice as there'll only be one
upstream Transformer component. But in certain situations, e.g: you have disjoint
datasets for certain tasks, or you'd like to use a pre-trained pipeline but a
downstream task requires its own token representations, you could end up with
more than one Transformer component in the pipeline.
| Name | Description |
|-----------------|------------------------------------------------------------------------------------------------------------------------|
| `layers` | The the number of layers produced by the upstream transformer component, excluding the embedding layer. ~~int~~ |
| `width` | The width of the vectors produced by the upstream transformer component. ~~int~~ |
| `pooling` | Model that is used to perform pooling over the piece representations. ~~Model~~ |
| `upstream_name` | A string to identify the 'upstream' Transformer component to communicate with. ~~str~~ |
| `grad_factor` | Factor to multiply gradients with. ~~float~~ |
| **CREATES** | A model that returns the relevant vectors from an upstream transformer component. ~~Model[List[Doc], List[Floats2d]]~~ |
### spacy-curated-transformers.LastTransformerLayerListener.v1
Construct a listener layer that communicates with one or more upstream Transformer
components. This layer extracts the output of the last transformer layer and performs
pooling over the individual pieces of each Doc token, returning their corresponding
representations. The upstream name should either be the wildcard string '*', or the name of the Transformer component.
In almost all cases, the wildcard string will suffice as there'll only be one
upstream Transformer component. But in certain situations, e.g: you have disjoint
datasets for certain tasks, or you'd like to use a pre-trained pipeline but a
downstream task requires its own token representations, you could end up with
more than one Transformer component in the pipeline.
| Name | Description |
|-----------------|------------------------------------------------------------------------------------------------------------------------|
| `width` | The width of the vectors produced by the upstream transformer component. ~~int~~ |
| `pooling` | Model that is used to perform pooling over the piece representations. ~~Model~~ |
| `upstream_name` | A string to identify the 'upstream' Transformer component to communicate with. ~~str~~ |
| `grad_factor` | Factor to multiply gradients with. ~~float~~ |
| **CREATES** | A model that returns the relevant vectors from an upstream transformer component. ~~Model[List[Doc], List[Floats2d]]~~ |
### spacy-curated-transformers.ScalarWeightingListener.v1
Construct a listener layer that communicates with one or more upstream Transformer
components. This layer calculates a weighted representation of all transformer layer
outputs and performs pooling over the individual pieces of each Doc token, returning
their corresponding representations.
Requires its upstream Transformer components to return all layer outputs from
their models. The upstream name should either be the wildcard string '*', or the name of the Transformer component.
In almost all cases, the wildcard string will suffice as there'll only be one
upstream Transformer component. But in certain situations, e.g: you have disjoint
datasets for certain tasks, or you'd like to use a pre-trained pipeline but a
downstream task requires its own token representations, you could end up with
more than one Transformer component in the pipeline.
| Name | Description |
|-----------------|------------------------------------------------------------------------------------------------------------------------|
| `width` | The width of the vectors produced by the upstream transformer component. ~~int~~ |
| `weighting` | Model that is used to perform the weighting of the different layer outputs. ~~Model~~ |
| `pooling` | Model that is used to perform pooling over the piece representations. ~~Model~~ |
| `upstream_name` | A string to identify the 'upstream' Transformer component to communicate with. ~~str~~ |
| `grad_factor` | Factor to multiply gradients with. ~~float~~ |
| **CREATES** | A model that returns the relevant vectors from an upstream transformer component. ~~Model[List[Doc], List[Floats2d]]~~ |
### spacy-curated-transformers.BertWordpieceEncoder.v1
Construct a WordPiece piece encoder model that accepts a list
of token sequences or documents and returns a corresponding list
of piece identifiers. This encoder also splits each token
on punctuation characters, as expected by most BERT models.
This model must be separately initialized using an appropriate
loader.
### spacy-curated-transformers.ByteBpeEncoder.v1
Construct a Byte-BPE piece encoder model that accepts a list
of token sequences or documents and returns a corresponding list
of piece identifiers.
This model must be separately initialized using an appropriate
loader.
### spacy-curated-transformers.CamembertSentencepieceEncoder.v1
Construct a SentencePiece piece encoder model that accepts a list
of token sequences or documents and returns a corresponding list
of piece identifiers with CamemBERT post-processing applied.
This model must be separately initialized using an appropriate
loader.
### spacy-curated-transformers.CharEncoder.v1
Construct a character piece encoder model that accepts a list
of token sequences or documents and returns a corresponding list
of piece identifiers.
This model must be separately initialized using an appropriate
loader.
### spacy-curated-transformers.SentencepieceEncoder.v1
Construct a SentencePiece piece encoder model that accepts a list
of token sequences or documents and returns a corresponding list
of piece identifiers with CamemBERT post-processing applied.
This model must be separately initialized using an appropriate
loader.
### spacy-curated-transformers.WordpieceEncoder.v1
Construct a WordPiece piece encoder model that accepts a list
of token sequences or documents and returns a corresponding list
of piece identifiers. This encoder also splits each token
on punctuation characters, as expected by most BERT models.
This model must be separately initialized using an appropriate
loader.
### spacy-curated-transformers.XlmrSentencepieceEncoder.v1
Construct a SentencePiece piece encoder model that accepts a list
of token sequences or documents and returns a corresponding list
of piece identifiers with XLM-RoBERTa post-processing applied.
This model must be separately initialized using an appropriate
loader.
## Pretraining architectures {id="pretrain",source="spacy/ml/models/multi_task.py"}
The spacy `pretrain` command lets you initialize a `Tok2Vec` layer in your
@ -519,7 +831,7 @@ objective for a Tok2Vec layer. To use this objective, make sure that the
vectors.
| Name | Description |
| --------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------- |
|-----------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------|
| `maxout_pieces` | The number of maxout pieces to use. Recommended values are `2` or `3`. ~~int~~ |
| `hidden_size` | Size of the hidden layer of the model. ~~int~~ |
| `loss` | The loss function can be either "cosine" or "L2". We typically recommend to use "cosine". ~~~str~~ |

View File

@ -1018,6 +1018,54 @@ $ python -m spacy debug model ./config.cfg tagger -l "5,15" -DIM -PAR -P0 -P1 -P
| overrides | Config parameters to override. Should be options starting with `--` that correspond to the config section and value to override, e.g. `--paths.train ./train.spacy`. ~~Any (option/flag)~~ |
| **PRINTS** | Debugging information. |
### debug pieces {id="debug-pieces",version="3.6",tag="command"}
Analyze word- or sentencepiece stats.
```bash
$ python -m spacy debug pieces [config_path] [code_path] [transformer_name]
```
| Name | Description |
|--------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `config_path` | Path to config file. ~~Union[Path, str] (positional)~~ |
| `code_path` | Path to Python file with additional code (registered functions) to be imported. ~~Union[Path, str] (option)~~ |
| `transformer_name` | Name of the transformer pipe to gather piece statistics for (default: first transformer pipe). ~~str (option)~~ |
| overrides | Config parameters to override. Should be options starting with `--` that correspond to the config section and value to override, e.g. `--paths.train ./train.spacy`. ~~Any (option/flag)~~ |
| **PRINTS** | Debugging information. |
<Accordion title="Example outputs" spaced>
```bash
$ python -m spacy debug pieces ./config.cfg
```
```
========================= Training corpus statistics =========================
Median token length: 1.0
Mean token length: 1.54
Token length range: [1, 13]
======================= Development corpus statistics =======================
Median token length: 1.0
Mean token length: 1.44
Token length range: [1, 8]
```
</Accordion>
## quantize {id="quantize",tag="command",version="3.6"}
Quantize a curated transformers model to reduce its size.
| Name | Description |
|----------------|---------------------------------------------------------------------|
| `model_path` | Model to quantize. ~~Path (positional)~~ |
| `output_path` | Output directory to store quantized model in. ~~Path (positional)~~ |
| `max_mse_loss` | Maximum MSE loss of quantized parameters. ~~float (option)~~ |
| `skip_embeds` | Do not quantize embeddings. ~~bool (option)~~ |
| `skip_linear` | Do not quantize linear layers. ~~bool (option)~~ |
## train {id="train",tag="command"}
Train a pipeline. Expects data in spaCy's

View File

@ -483,7 +483,6 @@ Construct a callback that initializes a character piece encoder model.
| `eos_piece` | Piece used as a end-of-sentence token. Defaults to `"[EOS]"`. ~~str~~ |
| `unk_piece` | Piece used as a stand-in for unknown tokens. Defaults to `"[UNK]"`. ~~str~~ |
| `normalize` | Unicode normalization form to use. Defaults to `"NFKC"`. ~~str~~ |
| `vocab` | The shared vocabulary to use. ~~Optional[Vocab]~~ |
### HFPieceEncoderLoader.v1 {id="hf_pieceencoder_loader",tag="registered_function"}
@ -531,3 +530,17 @@ Construct a callback that initializes a supported transformer model with weights
|--------|------------------------------------------|
| `path` | Path to the PyTorch checkpoint. ~~Path~~ |
## Callbacks
### gradual_transformer_unfreezing.v1 {id="gradual_transformer_unfreezing",tag="registered_function"}
Construct a callback that can be used to gradually unfreeze the
weights of one or more Transformer components during training. This
can be used to prevent catastrophic forgetting during fine-tuning.
| Name | Description |
|----------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `target_pipes` | A dictionary whose keys and values correspond to the names of Transformer components and the training step at which they should be unfrozen respectively. ~~Dict[str, int]~~ |