Merge branch 'develop' of https://github.com/explosion/spaCy into develop

This commit is contained in:
Matthew Honnibal 2020-08-16 20:30:01 +02:00
commit 61dfdd9fbd

View File

@ -377,54 +377,64 @@ serialization by passing in the string names via the `exclude` argument.
## TransformerData {#transformerdata tag="dataclass"} ## TransformerData {#transformerdata tag="dataclass"}
Transformer tokens and outputs for one `Doc` object. Transformer tokens and outputs for one `Doc` object. The transformer models
return tensors that refer to a whole padded batch of documents. These tensors
are wrapped into the
[FullTransformerBatch](/api/transformer#fulltransformerbatch) object. The
`FullTransformerBatch` then splits out the per-document data, which is handled
by this class. Instances of this class
are`typically assigned to the [Doc._.trf_data`](/api/transformer#custom-attributes)
extension attribute.
<!-- TODO: finish API docs, also mention "width" is property --> | Name | Type | Description |
| --------- | -------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| Name | Type | Description | | `tokens` | `Dict` | A slice of the tokens data produced by the tokenizer. This may have several fields, including the token IDs, the texts, and the attention mask. See the [`transformers.BatchEncoding`](https://huggingface.co/transformers/main_classes/tokenizer.html#transformers.BatchEncoding) object for details. |
| --------- | -------------------------------------------------- | ----------- | | `tensors` | `List[FloatsXd]` | The activations for the Doc from the transformer. Usually the last tensor that is 3-dimensional will be the most important, as that will provide the final hidden state. Generally activations that are 2-dimensional will be attention weights. Details of this variable will differ depending on the underlying transformer model. |
| `tokens` | `Dict` | | | `align` | [`Ragged`](https://thinc.ai/docs/api-types#ragged) | Alignment from the `Doc`'s tokenization to the wordpieces. This is a ragged array, where `align.lengths[i]` indicates the number of wordpiece tokens that token `i` aligns against. The actual indices are provided at `align[i].dataXd`. |
| `tensors` | `List[FloatsXd]` | | | `width` | int | The width of the last hidden layer. |
| `align` | [`Ragged`](https://thinc.ai/docs/api-types#ragged) | |
| `width` | int | |
### TransformerData.empty {#transformerdata-emoty tag="classmethod"} ### TransformerData.empty {#transformerdata-emoty tag="classmethod"}
<!-- TODO: finish API docs --> Create an empty `TransformerData` container.
| Name | Type | Description | | Name | Type | Description |
| ----------- | ----------------- | ----------- | | ----------- | ----------------- | -------------- |
| **RETURNS** | `TransformerData` | | | **RETURNS** | `TransformerData` | The container. |
## FullTransformerBatch {#fulltransformerbatch tag="dataclass"} ## FullTransformerBatch {#fulltransformerbatch tag="dataclass"}
<!-- TODO: write, also mention doc_data is property --> Holds a batch of input and output objects for a transformer model. The data can
then be split to a list of [`TransformerData`](/api/transformer#transformerdata)
objects to associate the outputs to each [`Doc`](/api/doc) in the batch.
| Name | Type | Description | | Name | Type | Description |
| ---------- | -------------------------------------------------------------------------------------------------------------------------- | ----------- | | ---------- | -------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `spans` | `List[List[Span]]` | | | `spans` | `List[List[Span]]` | The batch of input spans. The outer list refers to the Doc objects in the batch, and the inner list are the spans for that `Doc`. Note that spans are allowed to overlap or exclude tokens, but each Span can only refer to one `Doc` (by definition). This means that within a `Doc`, the regions of the output tensors that correspond to each Span may overlap or have gaps, but for each `Doc`, there is a non-overlapping contiguous slice of the outputs. |
| `tokens` | [`transformers.BatchEncoding`](https://huggingface.co/transformers/main_classes/tokenizer.html#transformers.BatchEncoding) | | | `tokens` | [`transformers.BatchEncoding`](https://huggingface.co/transformers/main_classes/tokenizer.html#transformers.BatchEncoding) | The output of the tokenizer. |
| `tensors` | `List[torch.Tensor]` | | | `tensors` | `List[torch.Tensor]` | The output of the transformer model. |
| `align` | [`Ragged`](https://thinc.ai/docs/api-types#ragged) | | | `align` | [`Ragged`](https://thinc.ai/docs/api-types#ragged) | Alignment from the spaCy tokenization to the wordpieces. This is a ragged array, where `align.lengths[i]` indicates the number of wordpiece tokens that token `i` aligns against. The actual indices are provided at `align[i].dataXd`. |
| `doc_data` | `List[TransformerData]` | | | `doc_data` | `List[TransformerData]` | The outputs, split per `Doc` object. |
### FullTransformerBatch.unsplit_by_doc {#fulltransformerbatch-unsplit_by_doc tag="method"} ### FullTransformerBatch.unsplit_by_doc {#fulltransformerbatch-unsplit_by_doc tag="method"}
<!-- TODO: write --> Return a new `FullTransformerBatch` from a split batch of activations, using the
current object's spans, tokens and alignment. This is used during the backward
pass, in order to construct the gradients to pass back into the transformer
model.
| Name | Type | Description | | Name | Type | Description |
| ----------- | ---------------------- | ----------- | | ----------- | ---------------------- | ------------------------------- |
| `arrays` | `List[List[Floats3d]]` | | | `arrays` | `List[List[Floats3d]]` | The split batch of activations. |
| **RETURNS** | `FullTransformerBatch` | | | **RETURNS** | `FullTransformerBatch` | The transformer batch. |
### FullTransformerBatch.split_by_doc {#fulltransformerbatch-split_by_doc tag="method"} ### FullTransformerBatch.split_by_doc {#fulltransformerbatch-split_by_doc tag="method"}
Split a `TransformerData` object that represents a batch into a list with one Split a `TransformerData` object that represents a batch into a list with one
`TransformerData` per `Doc`. `TransformerData` per `Doc`.
| Name | Type | Description | | Name | Type | Description |
| ----------- | ----------------------- | ----------- | | ----------- | ----------------------- | ---------------- |
| **RETURNS** | `List[TransformerData]` | | | **RETURNS** | `List[TransformerData]` | The split batch. |
## Span getters {#span_getters source="github.com/explosion/spacy-transformers/blob/master/spacy_transformers/span_getters.py"} ## Span getters {#span_getters source="github.com/explosion/spacy-transformers/blob/master/spacy_transformers/span_getters.py"}