From a633b88ef236fd99c26d3ea7aefb2f7809a01fbf Mon Sep 17 00:00:00 2001 From: vinit Date: Fri, 26 May 2023 11:46:34 +0200 Subject: [PATCH] initial documentation run --- website/docs/api/architectures.mdx | 314 ++++++++++++++++++++++- website/docs/api/cli.mdx | 48 ++++ website/docs/api/curated-transformer.mdx | 15 +- 3 files changed, 375 insertions(+), 2 deletions(-) diff --git a/website/docs/api/architectures.mdx b/website/docs/api/architectures.mdx index 268c04a07..cb217af15 100644 --- a/website/docs/api/architectures.mdx +++ b/website/docs/api/architectures.mdx @@ -481,6 +481,318 @@ The other arguments are shared between all versions. +## Curated transformer architectures {id="curated-trf",source="https://github.com/explosion/spacy-curated-transformers/blob/main/spacy_curated_transformers/models/architectures.py"} + +The following architectures are provided by the package +[`spacy-curated-transformers`](https://github.com/explosion/spacy-curated-transformers). See the +[usage documentation](/usage/embeddings-transformers#transformers) for how to +integrate the architectures into your training config. + + + +Note that in order to use these architectures in your config, you need to +install the +[`spacy-transformers`](https://github.com/explosion/spacy-transformers). See the +[installation docs](/usage/embeddings-transformers#transformers-installation) +for details and system requirements. + + + +### spacy-curated-transformers.AlbertTransformer.v1 + +Construct an ALBERT transformer model. + +| Name | Description | +|--------------------------------|-----------------------------------------------------------------------------| +| `vocab_size` | Vocabulary size. ~~int~~ | +| `with_spans` | Callback that constructs a span generator model. ~~Callable~~ | +| `with_spans` | piece_encoder (Model) ~~Callable~~ | +| `with_spans` | The piece encoder to segment input tokens. ~~Callable~~ | +| `attention_probs_dropout_prob` | Dropout probabilty of the self-attention layers. ~~float~~ | +| `embedding_width` | Width of the embedding representations. ~~int~~ | +| `hidden_act` | Activation used by the point-wise feed-forward layers. ~~str~~ | +| `hidden_dropout_prob` | Dropout probabilty of the point-wise feed-forward and ~~float~~ | +| `hidden_dropout_prob` | embedding layers. ~~float~~ | +| `hidden_width` | Width of the final representations. ~~int~~ | +| `intermediate_width` | Width of the intermediate projection layer in the ~~int~~ | +| `intermediate_width` | point-wise feed-forward layer. ~~int~~ | +| `layer_norm_eps` | Epsilon for layer normalization. ~~float~~ | +| `max_position_embeddings` | Maximum length of position embeddings. ~~int~~ | +| `model_max_length` | Maximum length of model inputs. ~~int~~ | +| `num_attention_heads` | Number of self-attention heads. ~~int~~ | +| `num_hidden_groups` | Number of layer groups whose constituents share parameters. ~~int~~ | +| `num_hidden_layers` | Number of hidden layers. ~~int~~ | +| `padding_idx` | Index of the padding meta-token. ~~int~~ | +| `type_vocab_size` | Type vocabulary size. ~~int~~ | +| `torchscript` | Set to `True` when loading TorchScript models, `False` otherwise. ~~bool~~ | +| `mixed_precision` | Use mixed-precision training. ~~bool~~ | +| `grad_scaler_config` | Configuration passed to the PyTorch gradient scaler. ~~dict~~ | +| **CREATES** | The model using the architecture ~~Model[TransformerInT, TransformerOutT]~~ | + +### spacy-curated-transformers.BertTransformer.v1 + +Construct a BERT transformer model. + +| Name | Description | +|--------------------------------|-----------------------------------------------------------------------------| +| `vocab_size` | Vocabulary size. ~~int~~ | +| `with_spans` | Callback that constructs a span generator model. ~~Callable~~ | +| `with_spans` | piece_encoder (Model) ~~Callable~~ | +| `with_spans` | The piece encoder to segment input tokens. ~~Callable~~ | +| `attention_probs_dropout_prob` | Dropout probabilty of the self-attention layers. ~~float~~ | +| `hidden_act` | Activation used by the point-wise feed-forward layers. ~~str~~ | +| `hidden_dropout_prob` | Dropout probabilty of the point-wise feed-forward and ~~float~~ | +| `hidden_dropout_prob` | embedding layers. ~~float~~ | +| `hidden_width` | Width of the final representations. ~~int~~ | +| `intermediate_width` | Width of the intermediate projection layer in the ~~int~~ | +| `intermediate_width` | point-wise feed-forward layer. ~~int~~ | +| `layer_norm_eps` | Epsilon for layer normalization. ~~float~~ | +| `max_position_embeddings` | Maximum length of position embeddings. ~~int~~ | +| `model_max_length` | Maximum length of model inputs. ~~int~~ | +| `num_attention_heads` | Number of self-attention heads. ~~int~~ | +| `num_hidden_layers` | Number of hidden layers. ~~int~~ | +| `padding_idx` | Index of the padding meta-token. ~~int~~ | +| `type_vocab_size` | Type vocabulary size. ~~int~~ | +| `torchscript` | Set to `True` when loading TorchScript models, `False` otherwise. ~~bool~~ | +| `mixed_precision` | Use mixed-precision training. ~~bool~~ | +| `grad_scaler_config` | Configuration passed to the PyTorch gradient scaler. ~~dict~~ | +| **CREATES** | The model using the architecture ~~Model[TransformerInT, TransformerOutT]~~ | + +### spacy-curated-transformers.CamembertTransformer.v1 + +Construct a CamemBERT transformer model. + +| Name | Description | +|--------------------------------|-----------------------------------------------------------------------------| +| `vocab_size` | Vocabulary size. ~~int~~ | +| `with_spans` | Callback that constructs a span generator model. ~~Callable~~ | +| `with_spans` | piece_encoder (Model) ~~Callable~~ | +| `with_spans` | The piece encoder to segment input tokens. ~~Callable~~ | +| `attention_probs_dropout_prob` | Dropout probabilty of the self-attention layers. ~~float~~ | +| `hidden_act` | Activation used by the point-wise feed-forward layers. ~~str~~ | +| `hidden_dropout_prob` | Dropout probabilty of the point-wise feed-forward and ~~float~~ | +| `hidden_dropout_prob` | embedding layers. ~~float~~ | +| `hidden_width` | Width of the final representations. ~~int~~ | +| `intermediate_width` | Width of the intermediate projection layer in the ~~int~~ | +| `intermediate_width` | point-wise feed-forward layer. ~~int~~ | +| `layer_norm_eps` | Epsilon for layer normalization. ~~float~~ | +| `max_position_embeddings` | Maximum length of position embeddings. ~~int~~ | +| `model_max_length` | Maximum length of model inputs. ~~int~~ | +| `num_attention_heads` | Number of self-attention heads. ~~int~~ | +| `num_hidden_layers` | Number of hidden layers. ~~int~~ | +| `padding_idx` | Index of the padding meta-token. ~~int~~ | +| `type_vocab_size` | Type vocabulary size. ~~int~~ | +| `torchscript` | Set to `True` when loading TorchScript models, `False` otherwise. ~~bool~~ | +| `mixed_precision` | Use mixed-precision training. ~~bool~~ | +| `grad_scaler_config` | Configuration passed to the PyTorch gradient scaler. ~~dict~~ | +| **CREATES** | The model using the architecture ~~Model[TransformerInT, TransformerOutT]~~ | + +### spacy-curated-transformers.RobertaTransformer.v1 + +Construct a RoBERTa transformer model. + +| Name | Description | +|--------------------------------|-----------------------------------------------------------------------------| +| `vocab_size` | Vocabulary size. ~~int~~ | +| `with_spans` | Callback that constructs a span generator model. ~~Callable~~ | +| `with_spans` | piece_encoder (Model) ~~Callable~~ | +| `with_spans` | The piece encoder to segment input tokens. ~~Callable~~ | +| `attention_probs_dropout_prob` | Dropout probabilty of the self-attention layers. ~~float~~ | +| `hidden_act` | Activation used by the point-wise feed-forward layers. ~~str~~ | +| `hidden_dropout_prob` | Dropout probabilty of the point-wise feed-forward and ~~float~~ | +| `hidden_dropout_prob` | embedding layers. ~~float~~ | +| `hidden_width` | Width of the final representations. ~~int~~ | +| `intermediate_width` | Width of the intermediate projection layer in the ~~int~~ | +| `intermediate_width` | point-wise feed-forward layer. ~~int~~ | +| `layer_norm_eps` | Epsilon for layer normalization. ~~float~~ | +| `max_position_embeddings` | Maximum length of position embeddings. ~~int~~ | +| `model_max_length` | Maximum length of model inputs. ~~int~~ | +| `num_attention_heads` | Number of self-attention heads. ~~int~~ | +| `num_hidden_layers` | Number of hidden layers. ~~int~~ | +| `padding_idx` | Index of the padding meta-token. ~~int~~ | +| `type_vocab_size` | Type vocabulary size. ~~int~~ | +| `torchscript` | Set to `True` when loading TorchScript models, `False` otherwise. ~~bool~~ | +| `mixed_precision` | Use mixed-precision training. ~~bool~~ | +| `grad_scaler_config` | Configuration passed to the PyTorch gradient scaler. ~~dict~~ | +| **CREATES** | The model using the architecture ~~Model[TransformerInT, TransformerOutT]~~ | + + +### spacy-curated-transformers.XlmrTransformer.v1 + +Construct a XLM-RoBERTa transformer model. + +| Name | Description | +|--------------------------------|-----------------------------------------------------------------------------| +| `vocab_size` | Vocabulary size. ~~int~~ | +| `with_spans` | Callback that constructs a span generator model. ~~Callable~~ | +| `with_spans` | piece_encoder (Model) ~~Callable~~ | +| `with_spans` | The piece encoder to segment input tokens. ~~Callable~~ | +| `attention_probs_dropout_prob` | Dropout probabilty of the self-attention layers. ~~float~~ | +| `hidden_act` | Activation used by the point-wise feed-forward layers. ~~str~~ | +| `hidden_dropout_prob` | Dropout probabilty of the point-wise feed-forward and ~~float~~ | +| `hidden_dropout_prob` | embedding layers. ~~float~~ | +| `hidden_width` | Width of the final representations. ~~int~~ | +| `intermediate_width` | Width of the intermediate projection layer in the ~~int~~ | +| `intermediate_width` | point-wise feed-forward layer. ~~int~~ | +| `layer_norm_eps` | Epsilon for layer normalization. ~~float~~ | +| `max_position_embeddings` | Maximum length of position embeddings. ~~int~~ | +| `model_max_length` | Maximum length of model inputs. ~~int~~ | +| `num_attention_heads` | Number of self-attention heads. ~~int~~ | +| `num_hidden_layers` | Number of hidden layers. ~~int~~ | +| `padding_idx` | Index of the padding meta-token. ~~int~~ | +| `type_vocab_size` | Type vocabulary size. ~~int~~ | +| `torchscript` | Set to `True` when loading TorchScript models, `False` otherwise. ~~bool~~ | +| `mixed_precision` | Use mixed-precision training. ~~bool~~ | +| `grad_scaler_config` | Configuration passed to the PyTorch gradient scaler. ~~dict~~ | +| **CREATES** | The model using the architecture ~~Model[TransformerInT, TransformerOutT]~~ | + + +### spacy-curated-transformers.ScalarWeight.v1 + +Construct a model that accepts a list of transformer layer outputs and returns a weighted representation of the same. + +| Name | Description | +|----------------------|-------------------------------------------------------------------------------| +| `num_layers` | Number of transformer hidden layers. ~~int~~ | +| `dropout_prob` | Dropout probability. ~~float~~ | +| `mixed_precision` | Use mixed-precision training. ~~bool~~ | +| `grad_scaler_config` | Configuration passed to the PyTorch gradient scaler. ~~dict~~ | +| **CREATES** | The model using the architecture ~~Model[ScalarWeightInT, ScalarWeightOutT]~~ | + +### spacy-curated-transformers.TransformerLayersListener.v1 + +Construct a listener layer that communicates with one or more upstream Transformer +components. This layer extracts the output of the last transformer layer and performs +pooling over the individual pieces of each Doc token, returning their corresponding +representations. The upstream name should either be the wildcard string '*', or the name of the Transformer component. + +In almost all cases, the wildcard string will suffice as there'll only be one +upstream Transformer component. But in certain situations, e.g: you have disjoint +datasets for certain tasks, or you'd like to use a pre-trained pipeline but a +downstream task requires its own token representations, you could end up with +more than one Transformer component in the pipeline. + + +| Name | Description | +|-----------------|------------------------------------------------------------------------------------------------------------------------| +| `layers` | The the number of layers produced by the upstream transformer component, excluding the embedding layer. ~~int~~ | +| `width` | The width of the vectors produced by the upstream transformer component. ~~int~~ | +| `pooling` | Model that is used to perform pooling over the piece representations. ~~Model~~ | +| `upstream_name` | A string to identify the 'upstream' Transformer component to communicate with. ~~str~~ | +| `grad_factor` | Factor to multiply gradients with. ~~float~~ | +| **CREATES** | A model that returns the relevant vectors from an upstream transformer component. ~~Model[List[Doc], List[Floats2d]]~~ | + + +### spacy-curated-transformers.LastTransformerLayerListener.v1 + +Construct a listener layer that communicates with one or more upstream Transformer +components. This layer extracts the output of the last transformer layer and performs +pooling over the individual pieces of each Doc token, returning their corresponding +representations. The upstream name should either be the wildcard string '*', or the name of the Transformer component. + +In almost all cases, the wildcard string will suffice as there'll only be one +upstream Transformer component. But in certain situations, e.g: you have disjoint +datasets for certain tasks, or you'd like to use a pre-trained pipeline but a +downstream task requires its own token representations, you could end up with +more than one Transformer component in the pipeline. + +| Name | Description | +|-----------------|------------------------------------------------------------------------------------------------------------------------| +| `width` | The width of the vectors produced by the upstream transformer component. ~~int~~ | +| `pooling` | Model that is used to perform pooling over the piece representations. ~~Model~~ | +| `upstream_name` | A string to identify the 'upstream' Transformer component to communicate with. ~~str~~ | +| `grad_factor` | Factor to multiply gradients with. ~~float~~ | +| **CREATES** | A model that returns the relevant vectors from an upstream transformer component. ~~Model[List[Doc], List[Floats2d]]~~ | + + +### spacy-curated-transformers.ScalarWeightingListener.v1 + +Construct a listener layer that communicates with one or more upstream Transformer +components. This layer calculates a weighted representation of all transformer layer +outputs and performs pooling over the individual pieces of each Doc token, returning +their corresponding representations. + +Requires its upstream Transformer components to return all layer outputs from +their models. The upstream name should either be the wildcard string '*', or the name of the Transformer component. + +In almost all cases, the wildcard string will suffice as there'll only be one +upstream Transformer component. But in certain situations, e.g: you have disjoint +datasets for certain tasks, or you'd like to use a pre-trained pipeline but a +downstream task requires its own token representations, you could end up with +more than one Transformer component in the pipeline. + +| Name | Description | +|-----------------|------------------------------------------------------------------------------------------------------------------------| +| `width` | The width of the vectors produced by the upstream transformer component. ~~int~~ | +| `weighting` | Model that is used to perform the weighting of the different layer outputs. ~~Model~~ | +| `pooling` | Model that is used to perform pooling over the piece representations. ~~Model~~ | +| `upstream_name` | A string to identify the 'upstream' Transformer component to communicate with. ~~str~~ | +| `grad_factor` | Factor to multiply gradients with. ~~float~~ | +| **CREATES** | A model that returns the relevant vectors from an upstream transformer component. ~~Model[List[Doc], List[Floats2d]]~~ | + +### spacy-curated-transformers.BertWordpieceEncoder.v1 + +Construct a WordPiece piece encoder model that accepts a list +of token sequences or documents and returns a corresponding list +of piece identifiers. This encoder also splits each token +on punctuation characters, as expected by most BERT models. + +This model must be separately initialized using an appropriate +loader. + +### spacy-curated-transformers.ByteBpeEncoder.v1 + +Construct a Byte-BPE piece encoder model that accepts a list +of token sequences or documents and returns a corresponding list +of piece identifiers. + +This model must be separately initialized using an appropriate +loader. + +### spacy-curated-transformers.CamembertSentencepieceEncoder.v1 +Construct a SentencePiece piece encoder model that accepts a list +of token sequences or documents and returns a corresponding list +of piece identifiers with CamemBERT post-processing applied. + +This model must be separately initialized using an appropriate +loader. + +### spacy-curated-transformers.CharEncoder.v1 +Construct a character piece encoder model that accepts a list +of token sequences or documents and returns a corresponding list +of piece identifiers. + +This model must be separately initialized using an appropriate +loader. + +### spacy-curated-transformers.SentencepieceEncoder.v1 +Construct a SentencePiece piece encoder model that accepts a list +of token sequences or documents and returns a corresponding list +of piece identifiers with CamemBERT post-processing applied. + +This model must be separately initialized using an appropriate +loader. + +### spacy-curated-transformers.WordpieceEncoder.v1 +Construct a WordPiece piece encoder model that accepts a list +of token sequences or documents and returns a corresponding list +of piece identifiers. This encoder also splits each token +on punctuation characters, as expected by most BERT models. + +This model must be separately initialized using an appropriate +loader. + +### spacy-curated-transformers.XlmrSentencepieceEncoder.v1 +Construct a SentencePiece piece encoder model that accepts a list +of token sequences or documents and returns a corresponding list +of piece identifiers with XLM-RoBERTa post-processing applied. + +This model must be separately initialized using an appropriate +loader. + + + + ## Pretraining architectures {id="pretrain",source="spacy/ml/models/multi_task.py"} The spacy `pretrain` command lets you initialize a `Tok2Vec` layer in your @@ -519,7 +831,7 @@ objective for a Tok2Vec layer. To use this objective, make sure that the vectors. | Name | Description | -| --------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------- | +|-----------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------| | `maxout_pieces` | The number of maxout pieces to use. Recommended values are `2` or `3`. ~~int~~ | | `hidden_size` | Size of the hidden layer of the model. ~~int~~ | | `loss` | The loss function can be either "cosine" or "L2". We typically recommend to use "cosine". ~~~str~~ | diff --git a/website/docs/api/cli.mdx b/website/docs/api/cli.mdx index 5b4bca1ce..86a5d026d 100644 --- a/website/docs/api/cli.mdx +++ b/website/docs/api/cli.mdx @@ -1018,6 +1018,54 @@ $ python -m spacy debug model ./config.cfg tagger -l "5,15" -DIM -PAR -P0 -P1 -P | overrides | Config parameters to override. Should be options starting with `--` that correspond to the config section and value to override, e.g. `--paths.train ./train.spacy`. ~~Any (option/flag)~~ | | **PRINTS** | Debugging information. | +### debug pieces {id="debug-pieces",version="3.6",tag="command"} + +Analyze word- or sentencepiece stats. + +```bash +$ python -m spacy debug pieces [config_path] [code_path] [transformer_name] +``` + +| Name | Description | +|--------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| `config_path` | Path to config file. ~~Union[Path, str] (positional)~~ | +| `code_path` | Path to Python file with additional code (registered functions) to be imported. ~~Union[Path, str] (option)~~ | +| `transformer_name` | Name of the transformer pipe to gather piece statistics for (default: first transformer pipe). ~~str (option)~~ | +| overrides | Config parameters to override. Should be options starting with `--` that correspond to the config section and value to override, e.g. `--paths.train ./train.spacy`. ~~Any (option/flag)~~ | +| **PRINTS** | Debugging information. | + + + + +```bash +$ python -m spacy debug pieces ./config.cfg +``` + +``` +========================= Training corpus statistics ========================= +Median token length: 1.0 +Mean token length: 1.54 +Token length range: [1, 13] + +======================= Development corpus statistics ======================= +Median token length: 1.0 +Mean token length: 1.44 +Token length range: [1, 8] +``` + + +## quantize {id="quantize",tag="command",version="3.6"} + +Quantize a curated transformers model to reduce its size. + +| Name | Description | +|----------------|---------------------------------------------------------------------| +| `model_path` | Model to quantize. ~~Path (positional)~~ | +| `output_path` | Output directory to store quantized model in. ~~Path (positional)~~ | +| `max_mse_loss` | Maximum MSE loss of quantized parameters. ~~float (option)~~ | +| `skip_embeds` | Do not quantize embeddings. ~~bool (option)~~ | +| `skip_linear` | Do not quantize linear layers. ~~bool (option)~~ | + ## train {id="train",tag="command"} Train a pipeline. Expects data in spaCy's diff --git a/website/docs/api/curated-transformer.mdx b/website/docs/api/curated-transformer.mdx index 6ac44cd80..2ba03a4c8 100644 --- a/website/docs/api/curated-transformer.mdx +++ b/website/docs/api/curated-transformer.mdx @@ -483,7 +483,6 @@ Construct a callback that initializes a character piece encoder model. | `eos_piece` | Piece used as a end-of-sentence token. Defaults to `"[EOS]"`. ~~str~~ | | `unk_piece` | Piece used as a stand-in for unknown tokens. Defaults to `"[UNK]"`. ~~str~~ | | `normalize` | Unicode normalization form to use. Defaults to `"NFKC"`. ~~str~~ | -| `vocab` | The shared vocabulary to use. ~~Optional[Vocab]~~ | ### HFPieceEncoderLoader.v1 {id="hf_pieceencoder_loader",tag="registered_function"} @@ -531,3 +530,17 @@ Construct a callback that initializes a supported transformer model with weights |--------|------------------------------------------| | `path` | Path to the PyTorch checkpoint. ~~Path~~ | +## Callbacks + +### gradual_transformer_unfreezing.v1 {id="gradual_transformer_unfreezing",tag="registered_function"} + +Construct a callback that can be used to gradually unfreeze the +weights of one or more Transformer components during training. This +can be used to prevent catastrophic forgetting during fine-tuning. + + +| Name | Description | +|----------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| `target_pipes` | A dictionary whose keys and values correspond to the names of Transformer components and the training step at which they should be unfrozen respectively. ~~Dict[str, int]~~ | + +