revert annotations refactor

2025-07-18 20:22:25 +03:00 · 2020-08-31 14:40:55 +02:00 · 2020-08-31 14:40:55 +02:00 · e47ea88aeb
commit e47ea88aeb
parent 13ee742fb4
3 changed files with 68 additions and 71 deletions
--- a/website/docs/api/top-level.md
+++ b/website/docs/api/top-level.md
@ -306,25 +306,24 @@ factories.
 >         i += 1
 > ```
-| Registry name        | Description                                                                                                                                                                                                                                        |
+| Registry name     | Description                                                                                                                                                                                                                                        |
-| -------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| ----------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| `annotation_setters` | Registry for functions that store Tok2Vec annotations on `Doc` objects.                                                                                                                                                                            |
+| `architectures`   | Registry for functions that create [model architectures](/api/architectures). Can be used to register custom model architectures and reference them in the `config.cfg`.                                                                           |
-| `architectures`      | Registry for functions that create [model architectures](/api/architectures). Can be used to register custom model architectures and reference them in the `config.cfg`.                                                                           |
+| `assets`          | Registry for data assets, knowledge bases etc.                                                                                                                                                                                                     |
-| `assets`             | Registry for data assets, knowledge bases etc.                                                                                                                                                                                                     |
+| `batchers`        | Registry for training and evaluation [data batchers](#batchers).                                                                                                                                                                                   |
-| `batchers`           | Registry for training and evaluation [data batchers](#batchers).                                                                                                                                                                                   |
+| `callbacks`       | Registry for custom callbacks to [modify the `nlp` object](/usage/training#custom-code-nlp-callbacks) before training.                                                                                                                             |
-| `callbacks`          | Registry for custom callbacks to [modify the `nlp` object](/usage/training#custom-code-nlp-callbacks) before training.                                                                                                                             |
+| `displacy_colors` | Registry for custom color scheme for the [`displacy` NER visualizer](/usage/visualizers). Automatically reads from [entry points](/usage/saving-loading#entry-points).                                                                             |
-| `displacy_colors`    | Registry for custom color scheme for the [`displacy` NER visualizer](/usage/visualizers). Automatically reads from [entry points](/usage/saving-loading#entry-points).                                                                             |
+| `factories`       | Registry for functions that create [pipeline components](/usage/processing-pipelines#custom-components). Added automatically when you use the `@spacy.component` decorator and also reads from [entry points](/usage/saving-loading#entry-points). |
-| `factories`          | Registry for functions that create [pipeline components](/usage/processing-pipelines#custom-components). Added automatically when you use the `@spacy.component` decorator and also reads from [entry points](/usage/saving-loading#entry-points). |
+| `initializers`    | Registry for functions that create [initializers](https://thinc.ai/docs/api-initializers).                                                                                                                                                         |
-| `initializers`       | Registry for functions that create [initializers](https://thinc.ai/docs/api-initializers).                                                                                                                                                         |
+| `languages`       | Registry for language-specific `Language` subclasses. Automatically reads from [entry points](/usage/saving-loading#entry-points).                                                                                                                 |
-| `languages`          | Registry for language-specific `Language` subclasses. Automatically reads from [entry points](/usage/saving-loading#entry-points).                                                                                                                 |
+| `layers`          | Registry for functions that create [layers](https://thinc.ai/docs/api-layers).                                                                                                                                                                     |
-| `layers`             | Registry for functions that create [layers](https://thinc.ai/docs/api-layers).                                                                                                                                                                     |
+| `loggers`         | Registry for functions that log [training results](/usage/training).                                                                                                                                                                               |
-| `loggers`            | Registry for functions that log [training results](/usage/training).                                                                                                                                                                               |
+| `lookups`         | Registry for large lookup tables available via `vocab.lookups`.                                                                                                                                                                                    |
-| `lookups`            | Registry for large lookup tables available via `vocab.lookups`.                                                                                                                                                                                    |
+| `losses`          | Registry for functions that create [losses](https://thinc.ai/docs/api-loss).                                                                                                                                                                       |
-| `losses`             | Registry for functions that create [losses](https://thinc.ai/docs/api-loss).                                                                                                                                                                       |
+| `optimizers`      | Registry for functions that create [optimizers](https://thinc.ai/docs/api-optimizers).                                                                                                                                                             |
-| `optimizers`         | Registry for functions that create [optimizers](https://thinc.ai/docs/api-optimizers).                                                                                                                                                             |
+| `readers`         | Registry for training and evaluation data readers like [`Corpus`](/api/corpus).                                                                                                                                                                    |
-| `readers`            | Registry for training and evaluation data readers like [`Corpus`](/api/corpus).                                                                                                                                                                    |
+| `schedules`       | Registry for functions that create [schedules](https://thinc.ai/docs/api-schedules).                                                                                                                                                               |
-| `schedules`          | Registry for functions that create [schedules](https://thinc.ai/docs/api-schedules).                                                                                                                                                               |
+| `tokenizers`      | Registry for tokenizer factories. Registered functions should return a callback that receives the `nlp` object and returns a [`Tokenizer`](/api/tokenizer) or a custom callable.                                                                   |
 | `tokenizers`         | Registry for tokenizer factories. Registered functions should return a callback that receives the `nlp` object and returns a [`Tokenizer`](/api/tokenizer) or a custom callable.                                                                   |
 ### spacy-transformers registry {#registry-transformers}
@ -338,17 +337,18 @@ See the [`Transformer`](/api/transformer) API reference and
 > ```python
 > import spacy_transformers
 >
-> @spacy_transformers.registry.span_getters("my_span_getter.v1")
+> @spacy_transformers.registry.annotation_setters("my_annotation_setter.v1")
-> def configure_custom_span_getter() -> Callable:
+> def configure_custom_annotation_setter():
->     def span_getter(docs: List[Doc]) -> List[List[Span]]:
+>     def annotation_setter(docs, trf_data) -> None:
->        # Transform each Doc into a List of Span objects
+>        # Set annotations on the docs
 >
->     return span_getter
+>     return annotation_setter
 > ```
-| Registry name                                   | Description                                                                                                                                  |
+| Registry name                                               | Description                                                                                                                                                                                                                                       |
-| ----------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------- |
+| ----------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| [`span_getters`](/api/transformer#span_getters) | Registry for functions that take a batch of `Doc` objects and return a list of `Span` objects to process by the transformer, e.g. sentences. |
+| [`span_getters`](/api/transformer#span_getters)             | Registry for functions that take a batch of `Doc` objects and return a list of `Span` objects to process by the transformer, e.g. sentences.                                                                                                      |
 | [`annotation_setters`](/api/transformer#annotation_setters) | Registry for functions that create annotation setters. Annotation setters are functions that take a batch of `Doc` objects and a [`FullTransformerBatch`](/api/transformer#fulltransformerbatch) and can set additional annotations on the `Doc`. |
 ## Loggers {#loggers source="spacy/gold/loggers.py" new="3"}
--- a/website/docs/api/transformer.md
+++ b/website/docs/api/transformer.md
@ -33,15 +33,16 @@ the [TransformerListener](/api/architectures#TransformerListener) layer. This
 works similarly to spaCy's [Tok2Vec](/api/tok2vec) component and
 [Tok2VecListener](/api/architectures/Tok2VecListener) sublayer.
-We calculate an alignment between the word-piece tokens and the spaCy
+The component assigns the output of the transformer to the `Doc`'s extension
-tokenization, so that we can use the last hidden states to store the information
+attributes. We also calculate an alignment between the word-piece tokens and the
-on the `Doc`. When multiple word-piece tokens align to the same spaCy token, the
+spaCy tokenization, so that we can use the last hidden states to set the
-spaCy token receives the sum of their values. By default, the information is
+`Doc.tensor` attribute. When multiple word-piece tokens align to the same spaCy
-written to the [`Doc._.trf_data`](#custom-attributes) extension attribute, but
+token, the spaCy token receives the sum of their values. To access the values,
-you can implement a custom [`@annotation_setter`](#annotation_setters) to change
+you can use the custom [`Doc._.trf_data`](#custom-attributes) attribute. The
-this behaviour. The package also adds the function registry
+package also adds the function registries [`@span_getters`](#span_getters) and
-[`@span_getters`](#span_getters) with several built-in registered functions. For
+[`@annotation_setters`](#annotation_setters) with several built-in registered
-more details, see the [usage documentation](/usage/embeddings-transformers).
+functions. For more details, see the
 [usage documentation](/usage/embeddings-transformers).
 ## Config and implementation {#config}
@ -60,11 +61,11 @@ on the transformer architectures and their arguments and hyperparameters.
 > nlp.add_pipe("transformer", config=DEFAULT_CONFIG)
 > ```
-| Setting             | Description                                                                                                                                                                                                                               |
+| Setting             | Description                                                                                                                                                                                                                                                                                                           |
-| ------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| ------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| `max_batch_items`   | Maximum size of a padded batch. Defaults to `4096`. ~~int~~                                                                                                                                                                               |
+| `max_batch_items`   | Maximum size of a padded batch. Defaults to `4096`. ~~int~~                                                                                                                                                                                                                                                           |
-| `annotation_setter` | Function that takes a batch of `Doc` objects and transformer outputs to store the annotations on the `Doc`. Defaults to `trfdata_setter` which sets the `Doc._.trf_data` attribute. ~~Callable[[List[Doc], FullTransformerBatch], None]~~ |
+| `annotation_setter` | Function that takes a batch of `Doc` objects and transformer outputs to set additional annotations on the `Doc`. The `Doc._.transformer_data` attribute is set prior to calling the callback. Defaults to `null_annotation_setter` (no additional annotations). ~~Callable[[List[Doc], FullTransformerBatch], None]~~ |
-| `model`             | The Thinc [`Model`](https://thinc.ai/docs/api-model) wrapping the transformer. Defaults to [TransformerModel](/api/architectures#TransformerModel). ~~Model[List[Doc], FullTransformerBatch]~~                                            |
+| `model`             | The Thinc [`Model`](https://thinc.ai/docs/api-model) wrapping the transformer. Defaults to [TransformerModel](/api/architectures#TransformerModel). ~~Model[List[Doc], FullTransformerBatch]~~                                                                                                                        |
 ```python
 https://github.com/explosion/spacy-transformers/blob/master/spacy_transformers/pipeline_component.py
@ -97,10 +98,9 @@ Construct a `Transformer` component. One or more subsequent spaCy components can
 use the transformer outputs as features in its model, with gradients
 backpropagated to the single shared weights. The activations from the
 transformer are saved in the [`Doc._.trf_data`](#custom-attributes) extension
-attribute by default, but you can provide a different `annotation_setter` to
+attribute. You can also provide a callback to set additional annotations. In
-customize this behaviour. In your application, you would normally use a shortcut
+your application, you would normally use a shortcut for this and instantiate the
-and instantiate the component using its string name and
+component using its string name and [`nlp.add_pipe`](/api/language#create_pipe).
 [`nlp.add_pipe`](/api/language#create_pipe).
 | Name                | Description                                                                                                                                                                                                                                        |
 | ------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
@ -205,9 +205,8 @@ modifying them.
 Assign the extracted features to the Doc objects. By default, the
 [`TransformerData`](/api/transformer#transformerdata) object is written to the
-[`Doc._.trf_data`](#custom-attributes) attribute. This behaviour can be
+[`Doc._.trf_data`](#custom-attributes) attribute. Your annotation_setter
-customized by providing a different `annotation_setter` argument upon
+callback is then called, if provided.
 construction.
 > #### Example
 >
@ -520,23 +519,20 @@ right context.
 ## Annotation setters {#annotation_setters tag="registered functions" source="github.com/explosion/spacy-transformers/blob/master/spacy_transformers/annotation_setters.py"}
 Annotation setters are functions that take a batch of `Doc` objects and a
-[`FullTransformerBatch`](/api/transformer#fulltransformerbatch) and store the
+[`FullTransformerBatch`](/api/transformer#fulltransformerbatch) and can set
-annotations on the `Doc`, e.g. to set custom or built-in attributes. You can
+additional annotations on the `Doc`, e.g. to set custom or built-in attributes.
-register custom annotation setters using the `@registry.annotation_setters`
+You can register custom annotation setters using the
-decorator. The default annotation setter used by the `Transformer` pipeline
+`@registry.annotation_setters` decorator.
 component is `trfdata_setter`, which sets the custom `Doc._.trf_data` attribute.
 > #### Example
 >
 > ```python
-> @registry.annotation_setters("spacy-transformers.trfdata_setter.v1")
+> @registry.annotation_setters("spacy-transformers.null_annotation_setter.v1")
-> def configure_trfdata_setter() -> Callable:
+> def configure_null_annotation_setter() -> Callable:
 >     def setter(docs: List[Doc], trf_data: FullTransformerBatch) -> None:
->         doc_data = list(trf_data.doc_data)
+>         pass
 >         for doc, data in zip(docs, doc_data):
 >             doc._.trf_data = data
 >
->     return setter
+>         return setter
 > ```
 | Name       | Description                                                   |
@ -546,9 +542,9 @@ component is `trfdata_setter`, which sets the custom `Doc._.trf_data` attribute.
 The following built-in functions are available:
-| Name                                   | Description                                                   |
+| Name                                           | Description                           |
-| -------------------------------------- | ------------------------------------------------------------- |
+| ---------------------------------------------- | ------------------------------------- |
-| `spacy-transformers.trfdata_setter.v1` | Set the annotations to the custom attribute `doc._.trf_data`. |
+| `spacy-transformers.null_annotation_setter.v1` | Don't set any additional annotations. |
 ## Custom attributes {#custom-attributes}
--- a/website/docs/usage/embeddings-transformers.md
+++ b/website/docs/usage/embeddings-transformers.md
@ -252,12 +252,13 @@ for doc in nlp.pipe(["some text", "some other text"]):
 ```
 You can also customize how the [`Transformer`](/api/transformer) component sets
-annotations onto the [`Doc`](/api/doc), by customizing the `annotation_setter`.
+annotations onto the [`Doc`](/api/doc), by specifying a custom
-This callback will be called with the raw input and output data for the whole
+`annotation_setter`. This callback will be called with the raw input and output
-batch, along with the batch of `Doc` objects, allowing you to implement whatever
+data for the whole batch, along with the batch of `Doc` objects, allowing you to
-you need. The annotation setter is called with a batch of [`Doc`](/api/doc)
+implement whatever you need. The annotation setter is called with a batch of
-objects and a [`FullTransformerBatch`](/api/transformer#fulltransformerbatch)
+[`Doc`](/api/doc) objects and a
-containing the transformers data for the batch.
+[`FullTransformerBatch`](/api/transformer#fulltransformerbatch) containing the
 transformers data for the batch.
 ```python
 def custom_annotation_setter(docs, trf_data):
@ -370,9 +371,9 @@ To change any of the settings, you can edit the `config.cfg` and re-run the
 training. To change any of the functions, like the span getter, you can replace
 the name of the referenced function – e.g. `@span_getters = "sent_spans.v1"` to
 process sentences. You can also register your own functions using the
-[`span_getters` registry](/api/top-level#registry). For instance, the following 
+[`span_getters` registry](/api/top-level#registry). For instance, the following
-custom function returns [`Span`](/api/span) objects following sentence 
+custom function returns [`Span`](/api/span) objects following sentence
-boundaries, unless a sentence succeeds a certain amount of tokens, in which case 
+boundaries, unless a sentence succeeds a certain amount of tokens, in which case
 subsentences of at most `max_length` tokens are returned.
 > #### config.cfg