Fill in the DocTransformerOutput section

This commit is contained in:
shadeMe 2023-08-14 13:48:45 +02:00
parent 921be30331
commit 0d9aa48865
No known key found for this signature in database
GPG Key ID: 6FCA9FC635B2A402

View File

@ -398,32 +398,52 @@ serialization by passing in the string names via the `exclude` argument.
| `cfg` | The config file. You usually don't want to exclude this. | | `cfg` | The config file. You usually don't want to exclude this. |
| `model` | The binary model data. You usually don't want to exclude this. | | `model` | The binary model data. You usually don't want to exclude this. |
## DocTransformerOutput {id="transformerdata",tag="dataclass"} ## DocTransformerOutput {id="doctransformeroutput",tag="dataclass"}
CuratedTransformer tokens and outputs for one `Doc` object. The transformer Curated Transformer outputs for one `Doc` object. Stores the dense
models return tensors that refer to a whole padded batch of documents. These representations generated by the transformer for each piece identifier. Piece
tensors are wrapped into the identifiers are grouped by token. Instances of this class are typically assigned
[FullCuratedTransformerBatch](/api/curatedtransformer#fulltransformerbatch)
object. The `FullCuratedTransformerBatch` then splits out the per-document data,
which is handled by this class. Instances of this class are typically assigned
to the [`Doc._.trf_data`](/api/curatedtransformer#assigned-attributes) extension to the [`Doc._.trf_data`](/api/curatedtransformer#assigned-attributes) extension
attribute. attribute.
| Name | Description | | Name | Description |
| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | ----------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `tokens` | A slice of the tokens data produced by the tokenizer. This may have several fields, including the token IDs, the texts and the attention mask. See the [`transformers.BatchEncoding`](https://huggingface.co/transformers/main_classes/tokenizer.html#transformers.BatchEncoding) object for details. ~~dict~~ | | `all_outputs` | List of `Ragged` tensors that correspends to outputs of the different transformer layers. Each tensor element corresponds to a piece identifier's representation. ~~List[Ragged]~~ |
| `model_output` | The model output from the transformer model, determined by the model and transformer config. New in `spacy-transformers` v1.1.0. ~~transformers.file_utils.ModelOutput~~ | | `last_layer_only` | If only the last transformer layer's outputs are preserved. ~~bool~~ |
| `tensors` | The `model_output` in the earlier `transformers` tuple format converted using [`ModelOutput.to_tuple()`](https://huggingface.co/transformers/main_classes/output.html#transformers.file_utils.ModelOutput.to_tuple). Returns `Tuple` instead of `List` as of `spacy-transformers` v1.1.0. ~~Tuple[Union[FloatsXd, List[FloatsXd]]]~~ |
| `align` | Alignment from the `Doc`'s tokenization to the wordpieces. This is a ragged array, where `align.lengths[i]` indicates the number of wordpiece tokens that token `i` aligns against. The actual indices are provided at `align[i].dataXd`. ~~Ragged~~ |
| `width` | The width of the last hidden layer. ~~int~~ |
### DocTransformerOutput.empty {id="transformerdata-empty",tag="classmethod"} ### DocTransformerOutput.embedding_layer {id="doctransformeroutput-embeddinglayer",tag="property"}
Create an empty `DocTransformerOutput` container. Return the output of the transformer's embedding layer or `None` if
`last_layer_only` is `True`.
| Name | Description | | Name | Description |
| ----------- | --------------------------------------- | | ----------- | -------------------------------------------- |
| **RETURNS** | The container. ~~DocTransformerOutput~~ | | **RETURNS** | Embedding layer output. ~~Optional[Ragged]~~ |
### DocTransformerOutput.last_hidden_layer_state {id="doctransformeroutput-lasthiddenlayerstate",tag="property"}
Return the output of the transformer's last hidden layer.
| Name | Description |
| ----------- | ------------------------------------ |
| **RETURNS** | Last hidden layer output. ~~Ragged~~ |
### DocTransformerOutput.all_hidden_layer_states {id="doctransformeroutput-allhiddenlayerstates",tag="property"}
Return the outputs of all transformer layers (excluding the embedding layer).
| Name | Description |
| ----------- | -------------------------------------- |
| **RETURNS** | Hidden layer outputs. ~~List[Ragged]~~ |
### DocTransformerOutput.num_outputs {id="doctransformeroutput-numoutputs",tag="property"}
Return the number of layer outputs stored in the `DocTransformerOutput` instance
(including the embedding layer).
| Name | Description |
| ----------- | -------------------------- |
| **RETURNS** | Numbef of outputs. ~~int~~ |
## Span getters {id="span_getters",source="github.com/explosion/spacy-transformers/blob/master/spacy_curated_transformers/span_getters.py"} ## Span getters {id="span_getters",source="github.com/explosion/spacy-transformers/blob/master/spacy_curated_transformers/span_getters.py"}