revert annotations refactor

This commit is contained in:
svlandeg 2020-08-31 14:40:55 +02:00
parent 13ee742fb4
commit e47ea88aeb
3 changed files with 68 additions and 71 deletions

View File

@ -306,25 +306,24 @@ factories.
> i += 1 > i += 1
> ``` > ```
| Registry name | Description | | Registry name | Description |
| -------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | ----------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `annotation_setters` | Registry for functions that store Tok2Vec annotations on `Doc` objects. | | `architectures` | Registry for functions that create [model architectures](/api/architectures). Can be used to register custom model architectures and reference them in the `config.cfg`. |
| `architectures` | Registry for functions that create [model architectures](/api/architectures). Can be used to register custom model architectures and reference them in the `config.cfg`. | | `assets` | Registry for data assets, knowledge bases etc. |
| `assets` | Registry for data assets, knowledge bases etc. | | `batchers` | Registry for training and evaluation [data batchers](#batchers). |
| `batchers` | Registry for training and evaluation [data batchers](#batchers). | | `callbacks` | Registry for custom callbacks to [modify the `nlp` object](/usage/training#custom-code-nlp-callbacks) before training. |
| `callbacks` | Registry for custom callbacks to [modify the `nlp` object](/usage/training#custom-code-nlp-callbacks) before training. | | `displacy_colors` | Registry for custom color scheme for the [`displacy` NER visualizer](/usage/visualizers). Automatically reads from [entry points](/usage/saving-loading#entry-points). |
| `displacy_colors` | Registry for custom color scheme for the [`displacy` NER visualizer](/usage/visualizers). Automatically reads from [entry points](/usage/saving-loading#entry-points). | | `factories` | Registry for functions that create [pipeline components](/usage/processing-pipelines#custom-components). Added automatically when you use the `@spacy.component` decorator and also reads from [entry points](/usage/saving-loading#entry-points). |
| `factories` | Registry for functions that create [pipeline components](/usage/processing-pipelines#custom-components). Added automatically when you use the `@spacy.component` decorator and also reads from [entry points](/usage/saving-loading#entry-points). | | `initializers` | Registry for functions that create [initializers](https://thinc.ai/docs/api-initializers). |
| `initializers` | Registry for functions that create [initializers](https://thinc.ai/docs/api-initializers). | | `languages` | Registry for language-specific `Language` subclasses. Automatically reads from [entry points](/usage/saving-loading#entry-points). |
| `languages` | Registry for language-specific `Language` subclasses. Automatically reads from [entry points](/usage/saving-loading#entry-points). | | `layers` | Registry for functions that create [layers](https://thinc.ai/docs/api-layers). |
| `layers` | Registry for functions that create [layers](https://thinc.ai/docs/api-layers). | | `loggers` | Registry for functions that log [training results](/usage/training). |
| `loggers` | Registry for functions that log [training results](/usage/training). | | `lookups` | Registry for large lookup tables available via `vocab.lookups`. |
| `lookups` | Registry for large lookup tables available via `vocab.lookups`. | | `losses` | Registry for functions that create [losses](https://thinc.ai/docs/api-loss). |
| `losses` | Registry for functions that create [losses](https://thinc.ai/docs/api-loss). | | `optimizers` | Registry for functions that create [optimizers](https://thinc.ai/docs/api-optimizers). |
| `optimizers` | Registry for functions that create [optimizers](https://thinc.ai/docs/api-optimizers). | | `readers` | Registry for training and evaluation data readers like [`Corpus`](/api/corpus). |
| `readers` | Registry for training and evaluation data readers like [`Corpus`](/api/corpus). | | `schedules` | Registry for functions that create [schedules](https://thinc.ai/docs/api-schedules). |
| `schedules` | Registry for functions that create [schedules](https://thinc.ai/docs/api-schedules). | | `tokenizers` | Registry for tokenizer factories. Registered functions should return a callback that receives the `nlp` object and returns a [`Tokenizer`](/api/tokenizer) or a custom callable. |
| `tokenizers` | Registry for tokenizer factories. Registered functions should return a callback that receives the `nlp` object and returns a [`Tokenizer`](/api/tokenizer) or a custom callable. |
### spacy-transformers registry {#registry-transformers} ### spacy-transformers registry {#registry-transformers}
@ -338,17 +337,18 @@ See the [`Transformer`](/api/transformer) API reference and
> ```python > ```python
> import spacy_transformers > import spacy_transformers
> >
> @spacy_transformers.registry.span_getters("my_span_getter.v1") > @spacy_transformers.registry.annotation_setters("my_annotation_setter.v1")
> def configure_custom_span_getter() -> Callable: > def configure_custom_annotation_setter():
> def span_getter(docs: List[Doc]) -> List[List[Span]]: > def annotation_setter(docs, trf_data) -> None:
> # Transform each Doc into a List of Span objects > # Set annotations on the docs
> >
> return span_getter > return annotation_setter
> ``` > ```
| Registry name | Description | | Registry name | Description |
| ----------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------- | | ----------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| [`span_getters`](/api/transformer#span_getters) | Registry for functions that take a batch of `Doc` objects and return a list of `Span` objects to process by the transformer, e.g. sentences. | | [`span_getters`](/api/transformer#span_getters) | Registry for functions that take a batch of `Doc` objects and return a list of `Span` objects to process by the transformer, e.g. sentences. |
| [`annotation_setters`](/api/transformer#annotation_setters) | Registry for functions that create annotation setters. Annotation setters are functions that take a batch of `Doc` objects and a [`FullTransformerBatch`](/api/transformer#fulltransformerbatch) and can set additional annotations on the `Doc`. |
## Loggers {#loggers source="spacy/gold/loggers.py" new="3"} ## Loggers {#loggers source="spacy/gold/loggers.py" new="3"}

View File

@ -33,15 +33,16 @@ the [TransformerListener](/api/architectures#TransformerListener) layer. This
works similarly to spaCy's [Tok2Vec](/api/tok2vec) component and works similarly to spaCy's [Tok2Vec](/api/tok2vec) component and
[Tok2VecListener](/api/architectures/Tok2VecListener) sublayer. [Tok2VecListener](/api/architectures/Tok2VecListener) sublayer.
We calculate an alignment between the word-piece tokens and the spaCy The component assigns the output of the transformer to the `Doc`'s extension
tokenization, so that we can use the last hidden states to store the information attributes. We also calculate an alignment between the word-piece tokens and the
on the `Doc`. When multiple word-piece tokens align to the same spaCy token, the spaCy tokenization, so that we can use the last hidden states to set the
spaCy token receives the sum of their values. By default, the information is `Doc.tensor` attribute. When multiple word-piece tokens align to the same spaCy
written to the [`Doc._.trf_data`](#custom-attributes) extension attribute, but token, the spaCy token receives the sum of their values. To access the values,
you can implement a custom [`@annotation_setter`](#annotation_setters) to change you can use the custom [`Doc._.trf_data`](#custom-attributes) attribute. The
this behaviour. The package also adds the function registry package also adds the function registries [`@span_getters`](#span_getters) and
[`@span_getters`](#span_getters) with several built-in registered functions. For [`@annotation_setters`](#annotation_setters) with several built-in registered
more details, see the [usage documentation](/usage/embeddings-transformers). functions. For more details, see the
[usage documentation](/usage/embeddings-transformers).
## Config and implementation {#config} ## Config and implementation {#config}
@ -60,11 +61,11 @@ on the transformer architectures and their arguments and hyperparameters.
> nlp.add_pipe("transformer", config=DEFAULT_CONFIG) > nlp.add_pipe("transformer", config=DEFAULT_CONFIG)
> ``` > ```
| Setting | Description | | Setting | Description |
| ------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | ------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `max_batch_items` | Maximum size of a padded batch. Defaults to `4096`. ~~int~~ | | `max_batch_items` | Maximum size of a padded batch. Defaults to `4096`. ~~int~~ |
| `annotation_setter` | Function that takes a batch of `Doc` objects and transformer outputs to store the annotations on the `Doc`. Defaults to `trfdata_setter` which sets the `Doc._.trf_data` attribute. ~~Callable[[List[Doc], FullTransformerBatch], None]~~ | | `annotation_setter` | Function that takes a batch of `Doc` objects and transformer outputs to set additional annotations on the `Doc`. The `Doc._.transformer_data` attribute is set prior to calling the callback. Defaults to `null_annotation_setter` (no additional annotations). ~~Callable[[List[Doc], FullTransformerBatch], None]~~ |
| `model` | The Thinc [`Model`](https://thinc.ai/docs/api-model) wrapping the transformer. Defaults to [TransformerModel](/api/architectures#TransformerModel). ~~Model[List[Doc], FullTransformerBatch]~~ | | `model` | The Thinc [`Model`](https://thinc.ai/docs/api-model) wrapping the transformer. Defaults to [TransformerModel](/api/architectures#TransformerModel). ~~Model[List[Doc], FullTransformerBatch]~~ |
```python ```python
https://github.com/explosion/spacy-transformers/blob/master/spacy_transformers/pipeline_component.py https://github.com/explosion/spacy-transformers/blob/master/spacy_transformers/pipeline_component.py
@ -97,10 +98,9 @@ Construct a `Transformer` component. One or more subsequent spaCy components can
use the transformer outputs as features in its model, with gradients use the transformer outputs as features in its model, with gradients
backpropagated to the single shared weights. The activations from the backpropagated to the single shared weights. The activations from the
transformer are saved in the [`Doc._.trf_data`](#custom-attributes) extension transformer are saved in the [`Doc._.trf_data`](#custom-attributes) extension
attribute by default, but you can provide a different `annotation_setter` to attribute. You can also provide a callback to set additional annotations. In
customize this behaviour. In your application, you would normally use a shortcut your application, you would normally use a shortcut for this and instantiate the
and instantiate the component using its string name and component using its string name and [`nlp.add_pipe`](/api/language#create_pipe).
[`nlp.add_pipe`](/api/language#create_pipe).
| Name | Description | | Name | Description |
| ------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | ------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
@ -205,9 +205,8 @@ modifying them.
Assign the extracted features to the Doc objects. By default, the Assign the extracted features to the Doc objects. By default, the
[`TransformerData`](/api/transformer#transformerdata) object is written to the [`TransformerData`](/api/transformer#transformerdata) object is written to the
[`Doc._.trf_data`](#custom-attributes) attribute. This behaviour can be [`Doc._.trf_data`](#custom-attributes) attribute. Your annotation_setter
customized by providing a different `annotation_setter` argument upon callback is then called, if provided.
construction.
> #### Example > #### Example
> >
@ -520,23 +519,20 @@ right context.
## Annotation setters {#annotation_setters tag="registered functions" source="github.com/explosion/spacy-transformers/blob/master/spacy_transformers/annotation_setters.py"} ## Annotation setters {#annotation_setters tag="registered functions" source="github.com/explosion/spacy-transformers/blob/master/spacy_transformers/annotation_setters.py"}
Annotation setters are functions that take a batch of `Doc` objects and a Annotation setters are functions that take a batch of `Doc` objects and a
[`FullTransformerBatch`](/api/transformer#fulltransformerbatch) and store the [`FullTransformerBatch`](/api/transformer#fulltransformerbatch) and can set
annotations on the `Doc`, e.g. to set custom or built-in attributes. You can additional annotations on the `Doc`, e.g. to set custom or built-in attributes.
register custom annotation setters using the `@registry.annotation_setters` You can register custom annotation setters using the
decorator. The default annotation setter used by the `Transformer` pipeline `@registry.annotation_setters` decorator.
component is `trfdata_setter`, which sets the custom `Doc._.trf_data` attribute.
> #### Example > #### Example
> >
> ```python > ```python
> @registry.annotation_setters("spacy-transformers.trfdata_setter.v1") > @registry.annotation_setters("spacy-transformers.null_annotation_setter.v1")
> def configure_trfdata_setter() -> Callable: > def configure_null_annotation_setter() -> Callable:
> def setter(docs: List[Doc], trf_data: FullTransformerBatch) -> None: > def setter(docs: List[Doc], trf_data: FullTransformerBatch) -> None:
> doc_data = list(trf_data.doc_data) > pass
> for doc, data in zip(docs, doc_data):
> doc._.trf_data = data
> >
> return setter > return setter
> ``` > ```
| Name | Description | | Name | Description |
@ -546,9 +542,9 @@ component is `trfdata_setter`, which sets the custom `Doc._.trf_data` attribute.
The following built-in functions are available: The following built-in functions are available:
| Name | Description | | Name | Description |
| -------------------------------------- | ------------------------------------------------------------- | | ---------------------------------------------- | ------------------------------------- |
| `spacy-transformers.trfdata_setter.v1` | Set the annotations to the custom attribute `doc._.trf_data`. | | `spacy-transformers.null_annotation_setter.v1` | Don't set any additional annotations. |
## Custom attributes {#custom-attributes} ## Custom attributes {#custom-attributes}

View File

@ -252,12 +252,13 @@ for doc in nlp.pipe(["some text", "some other text"]):
``` ```
You can also customize how the [`Transformer`](/api/transformer) component sets You can also customize how the [`Transformer`](/api/transformer) component sets
annotations onto the [`Doc`](/api/doc), by customizing the `annotation_setter`. annotations onto the [`Doc`](/api/doc), by specifying a custom
This callback will be called with the raw input and output data for the whole `annotation_setter`. This callback will be called with the raw input and output
batch, along with the batch of `Doc` objects, allowing you to implement whatever data for the whole batch, along with the batch of `Doc` objects, allowing you to
you need. The annotation setter is called with a batch of [`Doc`](/api/doc) implement whatever you need. The annotation setter is called with a batch of
objects and a [`FullTransformerBatch`](/api/transformer#fulltransformerbatch) [`Doc`](/api/doc) objects and a
containing the transformers data for the batch. [`FullTransformerBatch`](/api/transformer#fulltransformerbatch) containing the
transformers data for the batch.
```python ```python
def custom_annotation_setter(docs, trf_data): def custom_annotation_setter(docs, trf_data):
@ -370,9 +371,9 @@ To change any of the settings, you can edit the `config.cfg` and re-run the
training. To change any of the functions, like the span getter, you can replace training. To change any of the functions, like the span getter, you can replace
the name of the referenced function e.g. `@span_getters = "sent_spans.v1"` to the name of the referenced function e.g. `@span_getters = "sent_spans.v1"` to
process sentences. You can also register your own functions using the process sentences. You can also register your own functions using the
[`span_getters` registry](/api/top-level#registry). For instance, the following [`span_getters` registry](/api/top-level#registry). For instance, the following
custom function returns [`Span`](/api/span) objects following sentence custom function returns [`Span`](/api/span) objects following sentence
boundaries, unless a sentence succeeds a certain amount of tokens, in which case boundaries, unless a sentence succeeds a certain amount of tokens, in which case
subsentences of at most `max_length` tokens are returned. subsentences of at most `max_length` tokens are returned.
> #### config.cfg > #### config.cfg