revert annotations refactor

2025-08-22 13:04:56 +03:00 · 2020-08-31 14:40:55 +02:00 · 2020-08-31 14:40:55 +02:00 · e47ea88aeb
commit e47ea88aeb
parent 13ee742fb4
3 changed files with 68 additions and 71 deletions
--- a/website/docs/api/top-level.md
+++ b/website/docs/api/top-level.md
@ -306,25 +306,24 @@ factories.
 >         i += 1
 > ```

-| Registry name        | Description                                                                                                                                                                                                                                        |
-| -------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| `annotation_setters` | Registry for functions that store Tok2Vec annotations on `Doc` objects.                                                                                                                                                                            |
-| `architectures`      | Registry for functions that create [model architectures](/api/architectures). Can be used to register custom model architectures and reference them in the `config.cfg`.                                                                           |
-| `assets`             | Registry for data assets, knowledge bases etc.                                                                                                                                                                                                     |
-| `batchers`           | Registry for training and evaluation [data batchers](#batchers).                                                                                                                                                                                   |
-| `callbacks`          | Registry for custom callbacks to [modify the `nlp` object](/usage/training#custom-code-nlp-callbacks) before training.                                                                                                                             |
-| `displacy_colors`    | Registry for custom color scheme for the [`displacy` NER visualizer](/usage/visualizers). Automatically reads from [entry points](/usage/saving-loading#entry-points).                                                                             |
-| `factories`          | Registry for functions that create [pipeline components](/usage/processing-pipelines#custom-components). Added automatically when you use the `@spacy.component` decorator and also reads from [entry points](/usage/saving-loading#entry-points). |
-| `initializers`       | Registry for functions that create [initializers](https://thinc.ai/docs/api-initializers).                                                                                                                                                         |
-| `languages`          | Registry for language-specific `Language` subclasses. Automatically reads from [entry points](/usage/saving-loading#entry-points).                                                                                                                 |
-| `layers`             | Registry for functions that create [layers](https://thinc.ai/docs/api-layers).                                                                                                                                                                     |
-| `loggers`            | Registry for functions that log [training results](/usage/training).                                                                                                                                                                               |
-| `lookups`            | Registry for large lookup tables available via `vocab.lookups`.                                                                                                                                                                                    |
-| `losses`             | Registry for functions that create [losses](https://thinc.ai/docs/api-loss).                                                                                                                                                                       |
-| `optimizers`         | Registry for functions that create [optimizers](https://thinc.ai/docs/api-optimizers).                                                                                                                                                             |
-| `readers`            | Registry for training and evaluation data readers like [`Corpus`](/api/corpus).                                                                                                                                                                    |
-| `schedules`          | Registry for functions that create [schedules](https://thinc.ai/docs/api-schedules).                                                                                                                                                               |
-| `tokenizers`         | Registry for tokenizer factories. Registered functions should return a callback that receives the `nlp` object and returns a [`Tokenizer`](/api/tokenizer) or a custom callable.                                                                   |
+| Registry name     | Description                                                                                                                                                                                                                                        |
+| ----------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `architectures`   | Registry for functions that create [model architectures](/api/architectures). Can be used to register custom model architectures and reference them in the `config.cfg`.                                                                           |
+| `assets`          | Registry for data assets, knowledge bases etc.                                                                                                                                                                                                     |
+| `batchers`        | Registry for training and evaluation [data batchers](#batchers).                                                                                                                                                                                   |
+| `callbacks`       | Registry for custom callbacks to [modify the `nlp` object](/usage/training#custom-code-nlp-callbacks) before training.                                                                                                                             |
+| `displacy_colors` | Registry for custom color scheme for the [`displacy` NER visualizer](/usage/visualizers). Automatically reads from [entry points](/usage/saving-loading#entry-points).                                                                             |
+| `factories`       | Registry for functions that create [pipeline components](/usage/processing-pipelines#custom-components). Added automatically when you use the `@spacy.component` decorator and also reads from [entry points](/usage/saving-loading#entry-points). |
+| `initializers`    | Registry for functions that create [initializers](https://thinc.ai/docs/api-initializers).                                                                                                                                                         |
+| `languages`       | Registry for language-specific `Language` subclasses. Automatically reads from [entry points](/usage/saving-loading#entry-points).                                                                                                                 |
+| `layers`          | Registry for functions that create [layers](https://thinc.ai/docs/api-layers).                                                                                                                                                                     |
+| `loggers`         | Registry for functions that log [training results](/usage/training).                                                                                                                                                                               |
+| `lookups`         | Registry for large lookup tables available via `vocab.lookups`.                                                                                                                                                                                    |
+| `losses`          | Registry for functions that create [losses](https://thinc.ai/docs/api-loss).                                                                                                                                                                       |
+| `optimizers`      | Registry for functions that create [optimizers](https://thinc.ai/docs/api-optimizers).                                                                                                                                                             |
+| `readers`         | Registry for training and evaluation data readers like [`Corpus`](/api/corpus).                                                                                                                                                                    |
+| `schedules`       | Registry for functions that create [schedules](https://thinc.ai/docs/api-schedules).                                                                                                                                                               |
+| `tokenizers`      | Registry for tokenizer factories. Registered functions should return a callback that receives the `nlp` object and returns a [`Tokenizer`](/api/tokenizer) or a custom callable.                                                                   |

 ### spacy-transformers registry {#registry-transformers}

@ -338,17 +337,18 @@ See the [`Transformer`](/api/transformer) API reference and
 > ```python
 > import spacy_transformers
 >
-> @spacy_transformers.registry.span_getters("my_span_getter.v1")
-> def configure_custom_span_getter() -> Callable:
->     def span_getter(docs: List[Doc]) -> List[List[Span]]:
->        # Transform each Doc into a List of Span objects
+> @spacy_transformers.registry.annotation_setters("my_annotation_setter.v1")
+> def configure_custom_annotation_setter():
+>     def annotation_setter(docs, trf_data) -> None:
+>        # Set annotations on the docs
 >
->     return span_getter
+>     return annotation_setter
 > ```

-| Registry name                                   | Description                                                                                                                                  |
-| ----------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------- |
-| [`span_getters`](/api/transformer#span_getters) | Registry for functions that take a batch of `Doc` objects and return a list of `Span` objects to process by the transformer, e.g. sentences. |
+| Registry name                                               | Description                                                                                                                                                                                                                                       |
+| ----------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| [`span_getters`](/api/transformer#span_getters)             | Registry for functions that take a batch of `Doc` objects and return a list of `Span` objects to process by the transformer, e.g. sentences.                                                                                                      |
+| [`annotation_setters`](/api/transformer#annotation_setters) | Registry for functions that create annotation setters. Annotation setters are functions that take a batch of `Doc` objects and a [`FullTransformerBatch`](/api/transformer#fulltransformerbatch) and can set additional annotations on the `Doc`. |

 ## Loggers {#loggers source="spacy/gold/loggers.py" new="3"}

--- a/website/docs/api/transformer.md
+++ b/website/docs/api/transformer.md
@ -33,15 +33,16 @@ the [TransformerListener](/api/architectures#TransformerListener) layer. This
 works similarly to spaCy's [Tok2Vec](/api/tok2vec) component and
 [Tok2VecListener](/api/architectures/Tok2VecListener) sublayer.

-We calculate an alignment between the word-piece tokens and the spaCy
-tokenization, so that we can use the last hidden states to store the information
-on the `Doc`. When multiple word-piece tokens align to the same spaCy token, the
-spaCy token receives the sum of their values. By default, the information is
-written to the [`Doc._.trf_data`](#custom-attributes) extension attribute, but
-you can implement a custom [`@annotation_setter`](#annotation_setters) to change
-this behaviour. The package also adds the function registry
-[`@span_getters`](#span_getters) with several built-in registered functions. For
-more details, see the [usage documentation](/usage/embeddings-transformers).
+The component assigns the output of the transformer to the `Doc`'s extension
+attributes. We also calculate an alignment between the word-piece tokens and the
+spaCy tokenization, so that we can use the last hidden states to set the
+`Doc.tensor` attribute. When multiple word-piece tokens align to the same spaCy
+token, the spaCy token receives the sum of their values. To access the values,
+you can use the custom [`Doc._.trf_data`](#custom-attributes) attribute. The
+package also adds the function registries [`@span_getters`](#span_getters) and
+[`@annotation_setters`](#annotation_setters) with several built-in registered
+functions. For more details, see the
+[usage documentation](/usage/embeddings-transformers).

 ## Config and implementation {#config}

@ -60,11 +61,11 @@ on the transformer architectures and their arguments and hyperparameters.
 > nlp.add_pipe("transformer", config=DEFAULT_CONFIG)
 > ```

-| Setting             | Description                                                                                                                                                                                                                               |
-| ------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| `max_batch_items`   | Maximum size of a padded batch. Defaults to `4096`. ~~int~~                                                                                                                                                                               |
-| `annotation_setter` | Function that takes a batch of `Doc` objects and transformer outputs to store the annotations on the `Doc`. Defaults to `trfdata_setter` which sets the `Doc._.trf_data` attribute. ~~Callable[[List[Doc], FullTransformerBatch], None]~~ |
-| `model`             | The Thinc [`Model`](https://thinc.ai/docs/api-model) wrapping the transformer. Defaults to [TransformerModel](/api/architectures#TransformerModel). ~~Model[List[Doc], FullTransformerBatch]~~                                            |
+| Setting             | Description                                                                                                                                                                                                                                                                                                           |
+| ------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `max_batch_items`   | Maximum size of a padded batch. Defaults to `4096`. ~~int~~                                                                                                                                                                                                                                                           |
+| `annotation_setter` | Function that takes a batch of `Doc` objects and transformer outputs to set additional annotations on the `Doc`. The `Doc._.transformer_data` attribute is set prior to calling the callback. Defaults to `null_annotation_setter` (no additional annotations). ~~Callable[[List[Doc], FullTransformerBatch], None]~~ |
+| `model`             | The Thinc [`Model`](https://thinc.ai/docs/api-model) wrapping the transformer. Defaults to [TransformerModel](/api/architectures#TransformerModel). ~~Model[List[Doc], FullTransformerBatch]~~                                                                                                                        |

 ```python
 https://github.com/explosion/spacy-transformers/blob/master/spacy_transformers/pipeline_component.py
@ -97,10 +98,9 @@ Construct a `Transformer` component. One or more subsequent spaCy components can
 use the transformer outputs as features in its model, with gradients
 backpropagated to the single shared weights. The activations from the
 transformer are saved in the [`Doc._.trf_data`](#custom-attributes) extension
-attribute by default, but you can provide a different `annotation_setter` to
-customize this behaviour. In your application, you would normally use a shortcut
-and instantiate the component using its string name and
-[`nlp.add_pipe`](/api/language#create_pipe).
+attribute. You can also provide a callback to set additional annotations. In
+your application, you would normally use a shortcut for this and instantiate the
+component using its string name and [`nlp.add_pipe`](/api/language#create_pipe).

 | Name                | Description                                                                                                                                                                                                                                        |
 | ------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
@ -205,9 +205,8 @@ modifying them.

 Assign the extracted features to the Doc objects. By default, the
 [`TransformerData`](/api/transformer#transformerdata) object is written to the
-[`Doc._.trf_data`](#custom-attributes) attribute. This behaviour can be
-customized by providing a different `annotation_setter` argument upon
-construction.
+[`Doc._.trf_data`](#custom-attributes) attribute. Your annotation_setter
+callback is then called, if provided.

 > #### Example
 >
@ -520,23 +519,20 @@ right context.
 ## Annotation setters {#annotation_setters tag="registered functions" source="github.com/explosion/spacy-transformers/blob/master/spacy_transformers/annotation_setters.py"}

 Annotation setters are functions that take a batch of `Doc` objects and a
-[`FullTransformerBatch`](/api/transformer#fulltransformerbatch) and store the
-annotations on the `Doc`, e.g. to set custom or built-in attributes. You can
-register custom annotation setters using the `@registry.annotation_setters`
-decorator. The default annotation setter used by the `Transformer` pipeline
-component is `trfdata_setter`, which sets the custom `Doc._.trf_data` attribute.
+[`FullTransformerBatch`](/api/transformer#fulltransformerbatch) and can set
+additional annotations on the `Doc`, e.g. to set custom or built-in attributes.
+You can register custom annotation setters using the
+`@registry.annotation_setters` decorator.

 > #### Example
 >
 > ```python
-> @registry.annotation_setters("spacy-transformers.trfdata_setter.v1")
-> def configure_trfdata_setter() -> Callable:
+> @registry.annotation_setters("spacy-transformers.null_annotation_setter.v1")
+> def configure_null_annotation_setter() -> Callable:
 >     def setter(docs: List[Doc], trf_data: FullTransformerBatch) -> None:
->         doc_data = list(trf_data.doc_data)
->         for doc, data in zip(docs, doc_data):
->             doc._.trf_data = data
+>         pass
 >
->     return setter
+>         return setter
 > ```

 | Name       | Description                                                   |
@ -546,9 +542,9 @@ component is `trfdata_setter`, which sets the custom `Doc._.trf_data` attribute.

 The following built-in functions are available:

-| Name                                   | Description                                                   |
-| -------------------------------------- | ------------------------------------------------------------- |
-| `spacy-transformers.trfdata_setter.v1` | Set the annotations to the custom attribute `doc._.trf_data`. |
+| Name                                           | Description                           |
+| ---------------------------------------------- | ------------------------------------- |
+| `spacy-transformers.null_annotation_setter.v1` | Don't set any additional annotations. |

 ## Custom attributes {#custom-attributes}

--- a/website/docs/usage/embeddings-transformers.md
+++ b/website/docs/usage/embeddings-transformers.md
@ -252,12 +252,13 @@ for doc in nlp.pipe(["some text", "some other text"]):
 ```

 You can also customize how the [`Transformer`](/api/transformer) component sets
-annotations onto the [`Doc`](/api/doc), by customizing the `annotation_setter`.
-This callback will be called with the raw input and output data for the whole
-batch, along with the batch of `Doc` objects, allowing you to implement whatever
-you need. The annotation setter is called with a batch of [`Doc`](/api/doc)
-objects and a [`FullTransformerBatch`](/api/transformer#fulltransformerbatch)
-containing the transformers data for the batch.
+annotations onto the [`Doc`](/api/doc), by specifying a custom
+`annotation_setter`. This callback will be called with the raw input and output
+data for the whole batch, along with the batch of `Doc` objects, allowing you to
+implement whatever you need. The annotation setter is called with a batch of
+[`Doc`](/api/doc) objects and a
+[`FullTransformerBatch`](/api/transformer#fulltransformerbatch) containing the
+transformers data for the batch.

 ```python
 def custom_annotation_setter(docs, trf_data):
@ -370,9 +371,9 @@ To change any of the settings, you can edit the `config.cfg` and re-run the
 training. To change any of the functions, like the span getter, you can replace
 the name of the referenced function – e.g. `@span_getters = "sent_spans.v1"` to
 process sentences. You can also register your own functions using the
-[`span_getters` registry](/api/top-level#registry). For instance, the following 
-custom function returns [`Span`](/api/span) objects following sentence 
-boundaries, unless a sentence succeeds a certain amount of tokens, in which case 
+[`span_getters` registry](/api/top-level#registry). For instance, the following
+custom function returns [`Span`](/api/span) objects following sentence
+boundaries, unless a sentence succeeds a certain amount of tokens, in which case
 subsentences of at most `max_length` tokens are returned.

 > #### config.cfg