adjust references to null_annotation_setter to trfdata_setter

This commit is contained in:
svlandeg 2020-08-27 09:43:32 +02:00
parent ec069627fe
commit 559b65f2e0
2 changed files with 34 additions and 31 deletions

View File

@ -25,24 +25,23 @@ work out-of-the-box.
</Infobox> </Infobox>
This pipeline component lets you use transformer models in your pipeline. This pipeline component lets you use transformer models in your pipeline. It
Supports all models that are available via the supports all models that are available via the
[HuggingFace `transformers`](https://huggingface.co/transformers) library. [HuggingFace `transformers`](https://huggingface.co/transformers) library.
Usually you will connect subsequent components to the shared transformer using Usually you will connect subsequent components to the shared transformer using
the [TransformerListener](/api/architectures#TransformerListener) layer. This the [TransformerListener](/api/architectures#TransformerListener) layer. This
works similarly to spaCy's [Tok2Vec](/api/tok2vec) component and works similarly to spaCy's [Tok2Vec](/api/tok2vec) component and
[Tok2VecListener](/api/architectures/Tok2VecListener) sublayer. [Tok2VecListener](/api/architectures/Tok2VecListener) sublayer.
The component assigns the output of the transformer to the `Doc`'s extension We calculate an alignment between the word-piece tokens and the spaCy
attributes. We also calculate an alignment between the word-piece tokens and the tokenization, so that we can use the last hidden states to store the information
spaCy tokenization, so that we can use the last hidden states to set the on the `Doc`. When multiple word-piece tokens align to the same spaCy token, the
`Doc.tensor` attribute. When multiple word-piece tokens align to the same spaCy spaCy token receives the sum of their values. By default, the information is
token, the spaCy token receives the sum of their values. To access the values, written to the [`Doc._.trf_data`](#custom-attributes) extension attribute, but
you can use the custom [`Doc._.trf_data`](#custom-attributes) attribute. The you can implement a custom [`@annotation_setter`](#annotation_setters) to change
package also adds the function registries [`@span_getters`](#span_getters) and this behaviour. The package also adds the function registry
[`@annotation_setters`](#annotation_setters) with several built-in registered [`@span_getters`](#span_getters) with several built-in registered functions. For
functions. For more details, see the more details, see the [usage documentation](/usage/embeddings-transformers).
[usage documentation](/usage/embeddings-transformers).
## Config and implementation {#config} ## Config and implementation {#config}
@ -61,11 +60,11 @@ architectures and their arguments and hyperparameters.
> nlp.add_pipe("transformer", config=DEFAULT_CONFIG) > nlp.add_pipe("transformer", config=DEFAULT_CONFIG)
> ``` > ```
| Setting | Description | | Setting | Description |
| ------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | ------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `max_batch_items` | Maximum size of a padded batch. Defaults to `4096`. ~~int~~ | | `max_batch_items` | Maximum size of a padded batch. Defaults to `4096`. ~~int~~ |
| `annotation_setter` | Function that takes a batch of `Doc` objects and transformer outputs can set additional annotations on the `Doc`. The `Doc._.transformer_data` attribute is set prior to calling the callback. Defaults to `null_annotation_setter` (no additional annotations). ~~Callable[[List[Doc], FullTransformerBatch], None]~~ | | `annotation_setter` | Function that takes a batch of `Doc` objects and transformer outputs to store the annotations on the `Doc`. Defaults to `trfdata_setter` which sets the `Doc._.transformer_data` attribute. ~~Callable[[List[Doc], FullTransformerBatch], None]~~ |
| `model` | The Thinc [`Model`](https://thinc.ai/docs/api-model) wrapping the transformer. Defaults to [TransformerModel](/api/architectures#TransformerModel). ~~Model[List[Doc], FullTransformerBatch]~~ | | `model` | The Thinc [`Model`](https://thinc.ai/docs/api-model) wrapping the transformer. Defaults to [TransformerModel](/api/architectures#TransformerModel). ~~Model[List[Doc], FullTransformerBatch]~~ |
```python ```python
https://github.com/explosion/spacy-transformers/blob/master/spacy_transformers/pipeline_component.py https://github.com/explosion/spacy-transformers/blob/master/spacy_transformers/pipeline_component.py
@ -518,19 +517,23 @@ right context.
## Annotation setters {#annotation_setters tag="registered functions" source="github.com/explosion/spacy-transformers/blob/master/spacy_transformers/annotation_setters.py"} ## Annotation setters {#annotation_setters tag="registered functions" source="github.com/explosion/spacy-transformers/blob/master/spacy_transformers/annotation_setters.py"}
Annotation setters are functions that that take a batch of `Doc` objects and a Annotation setters are functions that take a batch of `Doc` objects and a
[`FullTransformerBatch`](/api/transformer#fulltransformerbatch) and can set [`FullTransformerBatch`](/api/transformer#fulltransformerbatch) and store the
additional annotations on the `Doc`, e.g. to set custom or built-in attributes. annotations on the `Doc`, e.g. to set custom or built-in attributes. You can
You can register custom annotation setters using the register custom annotation setters using the `@registry.annotation_setters`
`@registry.annotation_setters` decorator. decorator. The default annotation setter used by the `Transformer` pipeline
component is `trfdata_setter`, which sets the custom `Doc._.transformer_data`
attribute.
> #### Example > #### Example
> >
> ```python > ```python
> @registry.annotation_setters("spacy-transformers.null_annotation_setter.v1") > @registry.annotation_setters("spacy-transformers.trfdata_setter.v1")
> def configure_null_annotation_setter() -> Callable: > def configure_trfdata_setter() -> Callable:
> def setter(docs: List[Doc], trf_data: FullTransformerBatch) -> None: > def setter(docs: List[Doc], trf_data: FullTransformerBatch) -> None:
> pass > doc_data = list(trf_data.doc_data)
> for doc, data in zip(docs, doc_data):
> doc._.trf_data = data
> >
> return setter > return setter
> ``` > ```
@ -542,9 +545,9 @@ You can register custom annotation setters using the
The following built-in functions are available: The following built-in functions are available:
| Name | Description | | Name | Description |
| ---------------------------------------------- | ------------------------------------- | | -------------------------------------- | ------------------------------------------------------------- |
| `spacy-transformers.null_annotation_setter.v1` | Don't set any additional annotations. | | `spacy-transformers.trfdata_setter.v1` | Set the annotations to the custom attribute `doc._.trf_data`. |
## Custom attributes {#custom-attributes} ## Custom attributes {#custom-attributes}

View File

@ -299,7 +299,7 @@ component:
> >
> ```python > ```python
> from spacy_transformers import Transformer, TransformerModel > from spacy_transformers import Transformer, TransformerModel
> from spacy_transformers.annotation_setters import null_annotation_setter > from spacy_transformers.annotation_setters import configure_trfdata_setter
> from spacy_transformers.span_getters import get_doc_spans > from spacy_transformers.span_getters import get_doc_spans
> >
> trf = Transformer( > trf = Transformer(
@ -309,7 +309,7 @@ component:
> get_spans=get_doc_spans, > get_spans=get_doc_spans,
> tokenizer_config={"use_fast": True}, > tokenizer_config={"use_fast": True},
> ), > ),
> annotation_setter=null_annotation_setter, > annotation_setter=configure_trfdata_setter(),
> max_batch_items=4096, > max_batch_items=4096,
> ) > )
> ``` > ```
@ -329,7 +329,7 @@ tokenizer_config = {"use_fast": true}
@span_getters = "doc_spans.v1" @span_getters = "doc_spans.v1"
[components.transformer.annotation_setter] [components.transformer.annotation_setter]
@annotation_setters = "spacy-transformers.null_annotation_setter.v1" @annotation_setters = "spacy-transformers.trfdata_setter.v1"
``` ```