adjust references to null_annotation_setter to trfdata_setter

This commit is contained in:
svlandeg 2020-08-27 09:43:32 +02:00
parent ec069627fe
commit 559b65f2e0
2 changed files with 34 additions and 31 deletions

View File

@ -25,24 +25,23 @@ work out-of-the-box.
</Infobox>
This pipeline component lets you use transformer models in your pipeline.
Supports all models that are available via the
This pipeline component lets you use transformer models in your pipeline. It
supports all models that are available via the
[HuggingFace `transformers`](https://huggingface.co/transformers) library.
Usually you will connect subsequent components to the shared transformer using
the [TransformerListener](/api/architectures#TransformerListener) layer. This
works similarly to spaCy's [Tok2Vec](/api/tok2vec) component and
[Tok2VecListener](/api/architectures/Tok2VecListener) sublayer.
The component assigns the output of the transformer to the `Doc`'s extension
attributes. We also calculate an alignment between the word-piece tokens and the
spaCy tokenization, so that we can use the last hidden states to set the
`Doc.tensor` attribute. When multiple word-piece tokens align to the same spaCy
token, the spaCy token receives the sum of their values. To access the values,
you can use the custom [`Doc._.trf_data`](#custom-attributes) attribute. The
package also adds the function registries [`@span_getters`](#span_getters) and
[`@annotation_setters`](#annotation_setters) with several built-in registered
functions. For more details, see the
[usage documentation](/usage/embeddings-transformers).
We calculate an alignment between the word-piece tokens and the spaCy
tokenization, so that we can use the last hidden states to store the information
on the `Doc`. When multiple word-piece tokens align to the same spaCy token, the
spaCy token receives the sum of their values. By default, the information is
written to the [`Doc._.trf_data`](#custom-attributes) extension attribute, but
you can implement a custom [`@annotation_setter`](#annotation_setters) to change
this behaviour. The package also adds the function registry
[`@span_getters`](#span_getters) with several built-in registered functions. For
more details, see the [usage documentation](/usage/embeddings-transformers).
## Config and implementation {#config}
@ -62,9 +61,9 @@ architectures and their arguments and hyperparameters.
> ```
| Setting | Description |
| ------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| ------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `max_batch_items` | Maximum size of a padded batch. Defaults to `4096`. ~~int~~ |
| `annotation_setter` | Function that takes a batch of `Doc` objects and transformer outputs can set additional annotations on the `Doc`. The `Doc._.transformer_data` attribute is set prior to calling the callback. Defaults to `null_annotation_setter` (no additional annotations). ~~Callable[[List[Doc], FullTransformerBatch], None]~~ |
| `annotation_setter` | Function that takes a batch of `Doc` objects and transformer outputs to store the annotations on the `Doc`. Defaults to `trfdata_setter` which sets the `Doc._.transformer_data` attribute. ~~Callable[[List[Doc], FullTransformerBatch], None]~~ |
| `model` | The Thinc [`Model`](https://thinc.ai/docs/api-model) wrapping the transformer. Defaults to [TransformerModel](/api/architectures#TransformerModel). ~~Model[List[Doc], FullTransformerBatch]~~ |
```python
@ -518,19 +517,23 @@ right context.
## Annotation setters {#annotation_setters tag="registered functions" source="github.com/explosion/spacy-transformers/blob/master/spacy_transformers/annotation_setters.py"}
Annotation setters are functions that that take a batch of `Doc` objects and a
[`FullTransformerBatch`](/api/transformer#fulltransformerbatch) and can set
additional annotations on the `Doc`, e.g. to set custom or built-in attributes.
You can register custom annotation setters using the
`@registry.annotation_setters` decorator.
Annotation setters are functions that take a batch of `Doc` objects and a
[`FullTransformerBatch`](/api/transformer#fulltransformerbatch) and store the
annotations on the `Doc`, e.g. to set custom or built-in attributes. You can
register custom annotation setters using the `@registry.annotation_setters`
decorator. The default annotation setter used by the `Transformer` pipeline
component is `trfdata_setter`, which sets the custom `Doc._.transformer_data`
attribute.
> #### Example
>
> ```python
> @registry.annotation_setters("spacy-transformers.null_annotation_setter.v1")
> def configure_null_annotation_setter() -> Callable:
> @registry.annotation_setters("spacy-transformers.trfdata_setter.v1")
> def configure_trfdata_setter() -> Callable:
> def setter(docs: List[Doc], trf_data: FullTransformerBatch) -> None:
> pass
> doc_data = list(trf_data.doc_data)
> for doc, data in zip(docs, doc_data):
> doc._.trf_data = data
>
> return setter
> ```
@ -543,8 +546,8 @@ You can register custom annotation setters using the
The following built-in functions are available:
| Name | Description |
| ---------------------------------------------- | ------------------------------------- |
| `spacy-transformers.null_annotation_setter.v1` | Don't set any additional annotations. |
| -------------------------------------- | ------------------------------------------------------------- |
| `spacy-transformers.trfdata_setter.v1` | Set the annotations to the custom attribute `doc._.trf_data`. |
## Custom attributes {#custom-attributes}

View File

@ -299,7 +299,7 @@ component:
>
> ```python
> from spacy_transformers import Transformer, TransformerModel
> from spacy_transformers.annotation_setters import null_annotation_setter
> from spacy_transformers.annotation_setters import configure_trfdata_setter
> from spacy_transformers.span_getters import get_doc_spans
>
> trf = Transformer(
@ -309,7 +309,7 @@ component:
> get_spans=get_doc_spans,
> tokenizer_config={"use_fast": True},
> ),
> annotation_setter=null_annotation_setter,
> annotation_setter=configure_trfdata_setter(),
> max_batch_items=4096,
> )
> ```
@ -329,7 +329,7 @@ tokenizer_config = {"use_fast": true}
@span_getters = "doc_spans.v1"
[components.transformer.annotation_setter]
@annotation_setters = "spacy-transformers.null_annotation_setter.v1"
@annotation_setters = "spacy-transformers.trfdata_setter.v1"
```