revert annotations refactor

This commit is contained in:
svlandeg 2020-08-31 14:40:55 +02:00
parent 13ee742fb4
commit e47ea88aeb
3 changed files with 68 additions and 71 deletions

View File

@ -306,25 +306,24 @@ factories.
> i += 1
> ```
| Registry name | Description |
| -------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `annotation_setters` | Registry for functions that store Tok2Vec annotations on `Doc` objects. |
| `architectures` | Registry for functions that create [model architectures](/api/architectures). Can be used to register custom model architectures and reference them in the `config.cfg`. |
| `assets` | Registry for data assets, knowledge bases etc. |
| `batchers` | Registry for training and evaluation [data batchers](#batchers). |
| `callbacks` | Registry for custom callbacks to [modify the `nlp` object](/usage/training#custom-code-nlp-callbacks) before training. |
| `displacy_colors` | Registry for custom color scheme for the [`displacy` NER visualizer](/usage/visualizers). Automatically reads from [entry points](/usage/saving-loading#entry-points). |
| `factories` | Registry for functions that create [pipeline components](/usage/processing-pipelines#custom-components). Added automatically when you use the `@spacy.component` decorator and also reads from [entry points](/usage/saving-loading#entry-points). |
| `initializers` | Registry for functions that create [initializers](https://thinc.ai/docs/api-initializers). |
| `languages` | Registry for language-specific `Language` subclasses. Automatically reads from [entry points](/usage/saving-loading#entry-points). |
| `layers` | Registry for functions that create [layers](https://thinc.ai/docs/api-layers). |
| `loggers` | Registry for functions that log [training results](/usage/training). |
| `lookups` | Registry for large lookup tables available via `vocab.lookups`. |
| `losses` | Registry for functions that create [losses](https://thinc.ai/docs/api-loss). |
| `optimizers` | Registry for functions that create [optimizers](https://thinc.ai/docs/api-optimizers). |
| `readers` | Registry for training and evaluation data readers like [`Corpus`](/api/corpus). |
| `schedules` | Registry for functions that create [schedules](https://thinc.ai/docs/api-schedules). |
| `tokenizers` | Registry for tokenizer factories. Registered functions should return a callback that receives the `nlp` object and returns a [`Tokenizer`](/api/tokenizer) or a custom callable. |
| Registry name | Description |
| ----------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `architectures` | Registry for functions that create [model architectures](/api/architectures). Can be used to register custom model architectures and reference them in the `config.cfg`. |
| `assets` | Registry for data assets, knowledge bases etc. |
| `batchers` | Registry for training and evaluation [data batchers](#batchers). |
| `callbacks` | Registry for custom callbacks to [modify the `nlp` object](/usage/training#custom-code-nlp-callbacks) before training. |
| `displacy_colors` | Registry for custom color scheme for the [`displacy` NER visualizer](/usage/visualizers). Automatically reads from [entry points](/usage/saving-loading#entry-points). |
| `factories` | Registry for functions that create [pipeline components](/usage/processing-pipelines#custom-components). Added automatically when you use the `@spacy.component` decorator and also reads from [entry points](/usage/saving-loading#entry-points). |
| `initializers` | Registry for functions that create [initializers](https://thinc.ai/docs/api-initializers). |
| `languages` | Registry for language-specific `Language` subclasses. Automatically reads from [entry points](/usage/saving-loading#entry-points). |
| `layers` | Registry for functions that create [layers](https://thinc.ai/docs/api-layers). |
| `loggers` | Registry for functions that log [training results](/usage/training). |
| `lookups` | Registry for large lookup tables available via `vocab.lookups`. |
| `losses` | Registry for functions that create [losses](https://thinc.ai/docs/api-loss). |
| `optimizers` | Registry for functions that create [optimizers](https://thinc.ai/docs/api-optimizers). |
| `readers` | Registry for training and evaluation data readers like [`Corpus`](/api/corpus). |
| `schedules` | Registry for functions that create [schedules](https://thinc.ai/docs/api-schedules). |
| `tokenizers` | Registry for tokenizer factories. Registered functions should return a callback that receives the `nlp` object and returns a [`Tokenizer`](/api/tokenizer) or a custom callable. |
### spacy-transformers registry {#registry-transformers}
@ -338,17 +337,18 @@ See the [`Transformer`](/api/transformer) API reference and
> ```python
> import spacy_transformers
>
> @spacy_transformers.registry.span_getters("my_span_getter.v1")
> def configure_custom_span_getter() -> Callable:
> def span_getter(docs: List[Doc]) -> List[List[Span]]:
> # Transform each Doc into a List of Span objects
> @spacy_transformers.registry.annotation_setters("my_annotation_setter.v1")
> def configure_custom_annotation_setter():
> def annotation_setter(docs, trf_data) -> None:
> # Set annotations on the docs
>
> return span_getter
> return annotation_setter
> ```
| Registry name | Description |
| ----------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------- |
| [`span_getters`](/api/transformer#span_getters) | Registry for functions that take a batch of `Doc` objects and return a list of `Span` objects to process by the transformer, e.g. sentences. |
| Registry name | Description |
| ----------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| [`span_getters`](/api/transformer#span_getters) | Registry for functions that take a batch of `Doc` objects and return a list of `Span` objects to process by the transformer, e.g. sentences. |
| [`annotation_setters`](/api/transformer#annotation_setters) | Registry for functions that create annotation setters. Annotation setters are functions that take a batch of `Doc` objects and a [`FullTransformerBatch`](/api/transformer#fulltransformerbatch) and can set additional annotations on the `Doc`. |
## Loggers {#loggers source="spacy/gold/loggers.py" new="3"}

View File

@ -33,15 +33,16 @@ the [TransformerListener](/api/architectures#TransformerListener) layer. This
works similarly to spaCy's [Tok2Vec](/api/tok2vec) component and
[Tok2VecListener](/api/architectures/Tok2VecListener) sublayer.
We calculate an alignment between the word-piece tokens and the spaCy
tokenization, so that we can use the last hidden states to store the information
on the `Doc`. When multiple word-piece tokens align to the same spaCy token, the
spaCy token receives the sum of their values. By default, the information is
written to the [`Doc._.trf_data`](#custom-attributes) extension attribute, but
you can implement a custom [`@annotation_setter`](#annotation_setters) to change
this behaviour. The package also adds the function registry
[`@span_getters`](#span_getters) with several built-in registered functions. For
more details, see the [usage documentation](/usage/embeddings-transformers).
The component assigns the output of the transformer to the `Doc`'s extension
attributes. We also calculate an alignment between the word-piece tokens and the
spaCy tokenization, so that we can use the last hidden states to set the
`Doc.tensor` attribute. When multiple word-piece tokens align to the same spaCy
token, the spaCy token receives the sum of their values. To access the values,
you can use the custom [`Doc._.trf_data`](#custom-attributes) attribute. The
package also adds the function registries [`@span_getters`](#span_getters) and
[`@annotation_setters`](#annotation_setters) with several built-in registered
functions. For more details, see the
[usage documentation](/usage/embeddings-transformers).
## Config and implementation {#config}
@ -60,11 +61,11 @@ on the transformer architectures and their arguments and hyperparameters.
> nlp.add_pipe("transformer", config=DEFAULT_CONFIG)
> ```
| Setting | Description |
| ------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `max_batch_items` | Maximum size of a padded batch. Defaults to `4096`. ~~int~~ |
| `annotation_setter` | Function that takes a batch of `Doc` objects and transformer outputs to store the annotations on the `Doc`. Defaults to `trfdata_setter` which sets the `Doc._.trf_data` attribute. ~~Callable[[List[Doc], FullTransformerBatch], None]~~ |
| `model` | The Thinc [`Model`](https://thinc.ai/docs/api-model) wrapping the transformer. Defaults to [TransformerModel](/api/architectures#TransformerModel). ~~Model[List[Doc], FullTransformerBatch]~~ |
| Setting | Description |
| ------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `max_batch_items` | Maximum size of a padded batch. Defaults to `4096`. ~~int~~ |
| `annotation_setter` | Function that takes a batch of `Doc` objects and transformer outputs to set additional annotations on the `Doc`. The `Doc._.transformer_data` attribute is set prior to calling the callback. Defaults to `null_annotation_setter` (no additional annotations). ~~Callable[[List[Doc], FullTransformerBatch], None]~~ |
| `model` | The Thinc [`Model`](https://thinc.ai/docs/api-model) wrapping the transformer. Defaults to [TransformerModel](/api/architectures#TransformerModel). ~~Model[List[Doc], FullTransformerBatch]~~ |
```python
https://github.com/explosion/spacy-transformers/blob/master/spacy_transformers/pipeline_component.py
@ -97,10 +98,9 @@ Construct a `Transformer` component. One or more subsequent spaCy components can
use the transformer outputs as features in its model, with gradients
backpropagated to the single shared weights. The activations from the
transformer are saved in the [`Doc._.trf_data`](#custom-attributes) extension
attribute by default, but you can provide a different `annotation_setter` to
customize this behaviour. In your application, you would normally use a shortcut
and instantiate the component using its string name and
[`nlp.add_pipe`](/api/language#create_pipe).
attribute. You can also provide a callback to set additional annotations. In
your application, you would normally use a shortcut for this and instantiate the
component using its string name and [`nlp.add_pipe`](/api/language#create_pipe).
| Name | Description |
| ------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
@ -205,9 +205,8 @@ modifying them.
Assign the extracted features to the Doc objects. By default, the
[`TransformerData`](/api/transformer#transformerdata) object is written to the
[`Doc._.trf_data`](#custom-attributes) attribute. This behaviour can be
customized by providing a different `annotation_setter` argument upon
construction.
[`Doc._.trf_data`](#custom-attributes) attribute. Your annotation_setter
callback is then called, if provided.
> #### Example
>
@ -520,23 +519,20 @@ right context.
## Annotation setters {#annotation_setters tag="registered functions" source="github.com/explosion/spacy-transformers/blob/master/spacy_transformers/annotation_setters.py"}
Annotation setters are functions that take a batch of `Doc` objects and a
[`FullTransformerBatch`](/api/transformer#fulltransformerbatch) and store the
annotations on the `Doc`, e.g. to set custom or built-in attributes. You can
register custom annotation setters using the `@registry.annotation_setters`
decorator. The default annotation setter used by the `Transformer` pipeline
component is `trfdata_setter`, which sets the custom `Doc._.trf_data` attribute.
[`FullTransformerBatch`](/api/transformer#fulltransformerbatch) and can set
additional annotations on the `Doc`, e.g. to set custom or built-in attributes.
You can register custom annotation setters using the
`@registry.annotation_setters` decorator.
> #### Example
>
> ```python
> @registry.annotation_setters("spacy-transformers.trfdata_setter.v1")
> def configure_trfdata_setter() -> Callable:
> @registry.annotation_setters("spacy-transformers.null_annotation_setter.v1")
> def configure_null_annotation_setter() -> Callable:
> def setter(docs: List[Doc], trf_data: FullTransformerBatch) -> None:
> doc_data = list(trf_data.doc_data)
> for doc, data in zip(docs, doc_data):
> doc._.trf_data = data
> pass
>
> return setter
> return setter
> ```
| Name | Description |
@ -546,9 +542,9 @@ component is `trfdata_setter`, which sets the custom `Doc._.trf_data` attribute.
The following built-in functions are available:
| Name | Description |
| -------------------------------------- | ------------------------------------------------------------- |
| `spacy-transformers.trfdata_setter.v1` | Set the annotations to the custom attribute `doc._.trf_data`. |
| Name | Description |
| ---------------------------------------------- | ------------------------------------- |
| `spacy-transformers.null_annotation_setter.v1` | Don't set any additional annotations. |
## Custom attributes {#custom-attributes}

View File

@ -252,12 +252,13 @@ for doc in nlp.pipe(["some text", "some other text"]):
```
You can also customize how the [`Transformer`](/api/transformer) component sets
annotations onto the [`Doc`](/api/doc), by customizing the `annotation_setter`.
This callback will be called with the raw input and output data for the whole
batch, along with the batch of `Doc` objects, allowing you to implement whatever
you need. The annotation setter is called with a batch of [`Doc`](/api/doc)
objects and a [`FullTransformerBatch`](/api/transformer#fulltransformerbatch)
containing the transformers data for the batch.
annotations onto the [`Doc`](/api/doc), by specifying a custom
`annotation_setter`. This callback will be called with the raw input and output
data for the whole batch, along with the batch of `Doc` objects, allowing you to
implement whatever you need. The annotation setter is called with a batch of
[`Doc`](/api/doc) objects and a
[`FullTransformerBatch`](/api/transformer#fulltransformerbatch) containing the
transformers data for the batch.
```python
def custom_annotation_setter(docs, trf_data):
@ -370,9 +371,9 @@ To change any of the settings, you can edit the `config.cfg` and re-run the
training. To change any of the functions, like the span getter, you can replace
the name of the referenced function e.g. `@span_getters = "sent_spans.v1"` to
process sentences. You can also register your own functions using the
[`span_getters` registry](/api/top-level#registry). For instance, the following
custom function returns [`Span`](/api/span) objects following sentence
boundaries, unless a sentence succeeds a certain amount of tokens, in which case
[`span_getters` registry](/api/top-level#registry). For instance, the following
custom function returns [`Span`](/api/span) objects following sentence
boundaries, unless a sentence succeeds a certain amount of tokens, in which case
subsentences of at most `max_length` tokens are returned.
> #### config.cfg