Set curated transformers API version to 3.7

This commit is contained in:
shadeMe 2023-08-08 13:14:28 +02:00
parent fa809443de
commit 0d2be9e96c
No known key found for this signature in database
GPG Key ID: 6FCA9FC635B2A402

View File

@ -3,7 +3,7 @@ title: CuratedTransformer
teaser: Pipeline component for multi-task learning with transformer models
tag: class
source: github.com/explosion/spacy-transformers/blob/master/spacy_curated_transformers/pipeline_component.py
version: 3
version: 3.7
api_base_class: /api/pipe
api_string_name: transformer
---
@ -17,25 +17,33 @@ api_string_name: transformer
<Infobox title="Important note" variant="warning">
This component is available via the extension package
[`spacy-curated-transformers`](https://github.com/explosion/spacy-curated-transformers). It
exposes the component via entry points, so if you have the package installed,
[`spacy-curated-transformers`](https://github.com/explosion/spacy-curated-transformers).
It exposes the component via entry points, so if you have the package installed,
using `factory = "curated_transformer"` in your
[training config](/usage/training#config) or `nlp.add_pipe("curated_transformer")` will
work out-of-the-box.
[training config](/usage/training#config) or
`nlp.add_pipe("curated_transformer")` will work out-of-the-box.
</Infobox>
This Python package provides a curated set of transformer models for spaCy. It is focused on deep integration into spaCy and will support deployment-focused features such as distillation and quantization in the future. spaCy curated transformers currently supports the following model types:
This Python package provides a curated set of transformer models for spaCy. It
is focused on deep integration into spaCy and will support deployment-focused
features such as distillation and quantization in the future. spaCy curated
transformers currently supports the following model types:
* ALBERT
* BERT
* CamemBERT
* RoBERTa
* XLM-RoBERTa
- ALBERT
- BERT
- CamemBERT
- RoBERTa
- XLM-RoBERTa
You will usually connect downstream components to a shared curated transformer using one of the curated transformer listener layers. This works similarly to spaCy's [Tok2Vec](/api/tok2vec), and the [Tok2VecListener](/api/architectures/#Tok2VecListener) sublayer.
You will usually connect downstream components to a shared curated transformer
using one of the curated transformer listener layers. This works similarly to
spaCy's [Tok2Vec](/api/tok2vec), and the
[Tok2VecListener](/api/architectures/#Tok2VecListener) sublayer.
Supporting a wide variety of transformer models is a non-goal. If you want to use another type of model, use [spacy-transformers](/api/spacy-transformers), which allows you to use Hugging Face transformers models with spaCy.
Supporting a wide variety of transformer models is a non-goal. If you want to
use another type of model, use [spacy-transformers](/api/spacy-transformers),
which allows you to use Hugging Face transformers models with spaCy.
The component assigns the output of the transformer to the `Doc`'s extension
attributes. We also calculate an alignment between the word-piece tokens and the
@ -51,8 +59,8 @@ For more details, see the [usage documentation](/usage/embeddings-transformers).
The component sets the following
[custom extension attribute](/usage/processing-pipeline#custom-components-attributes):
| Location | Value |
| ---------------- | ------------------------------------------------------------------------ |
| Location | Value |
| ---------------- | ------------------------------------------------------------------------------------ |
| `Doc._.trf_data` | CuratedTransformer tokens and outputs for the `Doc` object. ~~DocTransformerOutput~~ |
## Config and implementation {id="config"}
@ -74,8 +82,8 @@ on the transformer architectures and their arguments and hyperparameters.
> ```
| Setting | Description |
|---------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `model` | The Thinc [`Model`](https://thinc.ai/docs/api-model) wrapping the transformer. Defaults to [CuratedTransformerModel](/api/architectures#CuratedTransformerModel). ~~Model[List[Doc], FullCuratedTransformerBatch]~~ |
| ------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `model` | The Thinc [`Model`](https://thinc.ai/docs/api-model) wrapping the transformer. Defaults to [CuratedTransformerModel](/api/architectures#CuratedTransformerModel). ~~Model[List[Doc], FullCuratedTransformerBatch]~~ |
| `frozen` | If `True`, the model's weights are frozen and no backpropagation is performed. ~~bool~~ |
| `all_layer_outputs` | If `True`, the model returns the outputs of all the layers. Otherwise, only the output of the last layer is returned. This must be set to `True` if any of the pipe's downstream listeners require the outputs of all transformer layers. ~~bool~~ |
@ -110,16 +118,16 @@ https://github.com/explosion/spacy-curated-transformers/blob/main/spacy_curated_
> trf = CuratedTransformer(nlp.vocab, model)
> ```
Construct a `CuratedTransformer` component. One or more subsequent spaCy components can
use the transformer outputs as features in its model, with gradients
backpropagated to the single shared weights. The activations from the
Construct a `CuratedTransformer` component. One or more subsequent spaCy
components can use the transformer outputs as features in its model, with
gradients backpropagated to the single shared weights. The activations from the
transformer are saved in the [`Doc._.trf_data`](#assigned-attributes) extension
attribute. You can also provide a callback to set additional annotations. In
your application, you would normally use a shortcut for this and instantiate the
component using its string name and [`nlp.add_pipe`](/api/language#create_pipe).
| Name | Description |
|---------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| ------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `vocab` | The shared vocabulary. ~~Vocab~~ |
| `model` | One of the supported pre-trained transformer models. ~~Model~~ |
| _keyword-only_ | |
@ -194,7 +202,7 @@ by [`Language.initialize`](/api/language#initialize).
> ```
| Name | Description |
|------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| ---------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `get_examples` | Function that returns gold-standard annotations in the form of [`Example`](/api/example) objects. Must contain at least one `Example`. ~~Callable[[], Iterable[Example]]~~ |
| _keyword-only_ | |
| `nlp` | The current `nlp` object. Defaults to `None`. ~~Optional[Language]~~ |
@ -221,9 +229,9 @@ modifying them.
## CuratedTransformer.set_annotations {id="set_annotations",tag="method"}
Assign the extracted features to the `Doc` objects. By default, the
[`DocTransformerOutput`](/api/curated-transformer#doctransformeroutput) object is written to the
[`Doc._.trf_data`](#assigned-attributes) attribute. Your `set_extra_annotations`
callback is then called, if provided.
[`DocTransformerOutput`](/api/curated-transformer#doctransformeroutput) object
is written to the [`Doc._.trf_data`](#assigned-attributes) attribute. Your
`set_extra_annotations` callback is then called, if provided.
> #### Example
>
@ -233,29 +241,28 @@ callback is then called, if provided.
> trf.set_annotations(docs, scores)
> ```
| Name | Description |
| -------- | ----------------------------------------------------- |
| `docs` | The documents to modify. ~~Iterable[Doc]~~ |
| Name | Description |
| -------- | ------------------------------------------------------------ |
| `docs` | The documents to modify. ~~Iterable[Doc]~~ |
| `scores` | The scores to set, produced by `CuratedTransformer.predict`. |
## CuratedTransformer.update {id="update",tag="method"}
Prepare for an update to the transformer.
Like the [`Tok2Vec`](api/tok2vec) component, the `CuratedTransformer` component is unusual
in that it does not receive "gold standard" annotations to calculate
a weight update. The optimal output of the transformer data is unknown;
it's a hidden layer inside the network that is updated by backpropagating
from output layers.
Like the [`Tok2Vec`](api/tok2vec) component, the `CuratedTransformer` component
is unusual in that it does not receive "gold standard" annotations to calculate
a weight update. The optimal output of the transformer data is unknown; it's a
hidden layer inside the network that is updated by backpropagating from output
layers.
The `CuratedTransformer` component therefore does not perform a weight update
during its own `update` method. Instead, it runs its transformer model
and communicates the output and the backpropagation callback to any
downstream components that have been connected to it via the
TransformerListener sublayer. If there are multiple listeners, the last
layer will actually backprop to the transformer and call the optimizer,
while the others simply increment the gradients.
during its own `update` method. Instead, it runs its transformer model and
communicates the output and the backpropagation callback to any downstream
components that have been connected to it via the TransformerListener sublayer.
If there are multiple listeners, the last layer will actually backprop to the
transformer and call the optimizer, while the others simply increment the
gradients.
> #### Example
>
@ -339,7 +346,7 @@ Load the pipe from disk. Modifies the object in place and returns it.
| `path` | A path to a directory. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ |
| _keyword-only_ | |
| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
| **RETURNS** | The modified `CuratedTransformer` object. ~~CuratedTransformer~~ |
| **RETURNS** | The modified `CuratedTransformer` object. ~~CuratedTransformer~~ |
## CuratedTransformer.to_bytes {id="to_bytes",tag="method"}
@ -356,7 +363,7 @@ Serialize the pipe to a bytestring.
| -------------- | ------------------------------------------------------------------------------------------- |
| _keyword-only_ | |
| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
| **RETURNS** | The serialized form of the `CuratedTransformer` object. ~~bytes~~ |
| **RETURNS** | The serialized form of the `CuratedTransformer` object. ~~bytes~~ |
## CuratedTransformer.from_bytes {id="from_bytes",tag="method"}
@ -375,7 +382,7 @@ Load the pipe from a bytestring. Modifies the object in place and returns it.
| `bytes_data` | The data to load from. ~~bytes~~ |
| _keyword-only_ | |
| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
| **RETURNS** | The `CuratedTransformer` object. ~~CuratedTransformer~~ |
| **RETURNS** | The `CuratedTransformer` object. ~~CuratedTransformer~~ |
## Serialization fields {id="serialization-fields"}
@ -397,16 +404,16 @@ serialization by passing in the string names via the `exclude` argument.
## DocTransformerOutput {id="transformerdata",tag="dataclass"}
CuratedTransformer tokens and outputs for one `Doc` object. The transformer models
return tensors that refer to a whole padded batch of documents. These tensors
are wrapped into the
CuratedTransformer tokens and outputs for one `Doc` object. The transformer
models return tensors that refer to a whole padded batch of documents. These
tensors are wrapped into the
[FullCuratedTransformerBatch](/api/transformer#fulltransformerbatch) object. The
`FullCuratedTransformerBatch` then splits out the per-document data, which is handled
by this class. Instances of this class are typically assigned to the
`FullCuratedTransformerBatch` then splits out the per-document data, which is
handled by this class. Instances of this class are typically assigned to the
[`Doc._.trf_data`](/api/transformer#assigned-attributes) extension attribute.
| Name | Description |
|----------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `tokens` | A slice of the tokens data produced by the tokenizer. This may have several fields, including the token IDs, the texts and the attention mask. See the [`transformers.BatchEncoding`](https://huggingface.co/transformers/main_classes/tokenizer.html#transformers.BatchEncoding) object for details. ~~dict~~ |
| `model_output` | The model output from the transformer model, determined by the model and transformer config. New in `spacy-transformers` v1.1.0. ~~transformers.file_utils.ModelOutput~~ |
| `tensors` | The `model_output` in the earlier `transformers` tuple format converted using [`ModelOutput.to_tuple()`](https://huggingface.co/transformers/main_classes/output.html#transformers.file_utils.ModelOutput.to_tuple). Returns `Tuple` instead of `List` as of `spacy-transformers` v1.1.0. ~~Tuple[Union[FloatsXd, List[FloatsXd]]]~~ |
@ -417,26 +424,26 @@ by this class. Instances of this class are typically assigned to the
Create an empty `DocTransformerOutput` container.
| Name | Description |
| ----------- | ---------------------------------- |
| Name | Description |
| ----------- | --------------------------------------- |
| **RETURNS** | The container. ~~DocTransformerOutput~~ |
## Span getters {id="span_getters",source="github.com/explosion/spacy-transformers/blob/master/spacy_curated_transformers/span_getters.py"}
Span getters are functions that take a batch of [`Doc`](/api/doc) objects and
return a lists of [`Span`](/api/span) objects for each doc to be processed by
the transformer. This is used to manage long documents by cutting them into
smaller sequences before running the transformer. The spans are allowed to
overlap, and you can also omit sections of the `Doc` if they are not relevant. Span getters can be referenced in the `[components.transformer.model.with_spans]`
block of the config to customize the sequences processed by the transformer.
overlap, and you can also omit sections of the `Doc` if they are not relevant.
Span getters can be referenced in the
`[components.transformer.model.with_spans]` block of the config to customize the
sequences processed by the transformer.
| Name | Description |
| ----------- | ------------------------------------------------------------- |
| `docs` | A batch of `Doc` objects. ~~Iterable[Doc]~~ |
| **RETURNS** | The spans to process by the transformer. ~~List[List[Span]]~~ |
### WithStridedSpans.v1 {id="strided_spans",tag="registered function"}
> #### Example config
@ -468,7 +475,7 @@ Placeholder text for tokenizers
Construct a callback that initializes a Byte-BPE piece encoder model.
| Name | Description |
|---------------|---------------------------------------|
| ------------- | ------------------------------------- |
| `vocab_path` | Path to the vocabulary file. ~~Path~~ |
| `merges_path` | Path to the merges file. ~~Path~~ |
@ -477,30 +484,29 @@ Construct a callback that initializes a Byte-BPE piece encoder model.
Construct a callback that initializes a character piece encoder model.
| Name | Description |
|-------------|-----------------------------------------------------------------------------|
| ----------- | --------------------------------------------------------------------------- |
| `path` | Path to the serialized character model. ~~Path~~ |
| `bos_piece` | Piece used as a beginning-of-sentence token. Defaults to `"[BOS]"`. ~~str~~ |
| `eos_piece` | Piece used as a end-of-sentence token. Defaults to `"[EOS]"`. ~~str~~ |
| `unk_piece` | Piece used as a stand-in for unknown tokens. Defaults to `"[UNK]"`. ~~str~~ |
| `normalize` | Unicode normalization form to use. Defaults to `"NFKC"`. ~~str~~ |
### HFPieceEncoderLoader.v1 {id="hf_pieceencoder_loader",tag="registered_function"}
Construct a callback that initializes a HuggingFace piece encoder model. Used in conjunction with the HuggingFace model loader.
Construct a callback that initializes a HuggingFace piece encoder model. Used in
conjunction with the HuggingFace model loader.
| Name | Description |
|------------|--------------------------------------------|
| ---------- | ------------------------------------------ |
| `name` | Name of the HuggingFace model. ~~str~~ |
| `revision` | Name of the model revision/branch. ~~str~~ |
### SentencepieceLoader.v1 {id="sentencepiece_loader",tag="registered_function"}
Construct a callback that initializes a SentencePiece piece encoder model.
| Name | Description |
|--------|------------------------------------------------------|
| ------ | ---------------------------------------------------- |
| `path` | Path to the serialized SentencePiece model. ~~Path~~ |
### WordpieceLoader.v1 {id="wordpiece_loader",tag="registered_function"}
@ -508,39 +514,38 @@ Construct a callback that initializes a SentencePiece piece encoder model.
Construct a callback that initializes a WordPiece piece encoder model.
| Name | Description |
|--------|--------------------------------------------------|
| ------ | ------------------------------------------------ |
| `path` | Path to the serialized WordPiece model. ~~Path~~ |
## Model Loaders
### HFTransformerEncoderLoader.v1 {id="hf_trfencoder_loader",tag="registered_function"}
Construct a callback that initializes a supported transformer model with weights from a corresponding HuggingFace model.
Construct a callback that initializes a supported transformer model with weights
from a corresponding HuggingFace model.
| Name | Description |
|------------|--------------------------------------------|
| ---------- | ------------------------------------------ |
| `name` | Name of the HuggingFace model. ~~str~~ |
| `revision` | Name of the model revision/branch. ~~str~~ |
### PyTorchCheckpointLoader.v1 {id="pytorch_checkpoint_loader",tag="registered_function"}
Construct a callback that initializes a supported transformer model with weights from a PyTorch checkpoint.
Construct a callback that initializes a supported transformer model with weights
from a PyTorch checkpoint.
| Name | Description |
|--------|------------------------------------------|
| ------ | ---------------------------------------- |
| `path` | Path to the PyTorch checkpoint. ~~Path~~ |
## Callbacks
### gradual_transformer_unfreezing.v1 {id="gradual_transformer_unfreezing",tag="registered_function"}
Construct a callback that can be used to gradually unfreeze the
weights of one or more Transformer components during training. This
can be used to prevent catastrophic forgetting during fine-tuning.
Construct a callback that can be used to gradually unfreeze the weights of one
or more Transformer components during training. This can be used to prevent
catastrophic forgetting during fine-tuning.
| Name | Description |
|----------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| -------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `target_pipes` | A dictionary whose keys and values correspond to the names of Transformer components and the training step at which they should be unfrozen respectively. ~~Dict[str, int]~~ |