mirror of
https://github.com/explosion/spaCy.git
synced 2025-05-29 10:13:19 +03:00
revert annotations refactor
This commit is contained in:
parent
13ee742fb4
commit
e47ea88aeb
|
@ -307,8 +307,7 @@ factories.
|
|||
> ```
|
||||
|
||||
| Registry name | Description |
|
||||
| -------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `annotation_setters` | Registry for functions that store Tok2Vec annotations on `Doc` objects. |
|
||||
| ----------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `architectures` | Registry for functions that create [model architectures](/api/architectures). Can be used to register custom model architectures and reference them in the `config.cfg`. |
|
||||
| `assets` | Registry for data assets, knowledge bases etc. |
|
||||
| `batchers` | Registry for training and evaluation [data batchers](#batchers). |
|
||||
|
@ -338,17 +337,18 @@ See the [`Transformer`](/api/transformer) API reference and
|
|||
> ```python
|
||||
> import spacy_transformers
|
||||
>
|
||||
> @spacy_transformers.registry.span_getters("my_span_getter.v1")
|
||||
> def configure_custom_span_getter() -> Callable:
|
||||
> def span_getter(docs: List[Doc]) -> List[List[Span]]:
|
||||
> # Transform each Doc into a List of Span objects
|
||||
> @spacy_transformers.registry.annotation_setters("my_annotation_setter.v1")
|
||||
> def configure_custom_annotation_setter():
|
||||
> def annotation_setter(docs, trf_data) -> None:
|
||||
> # Set annotations on the docs
|
||||
>
|
||||
> return span_getter
|
||||
> return annotation_setter
|
||||
> ```
|
||||
|
||||
| Registry name | Description |
|
||||
| ----------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| ----------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| [`span_getters`](/api/transformer#span_getters) | Registry for functions that take a batch of `Doc` objects and return a list of `Span` objects to process by the transformer, e.g. sentences. |
|
||||
| [`annotation_setters`](/api/transformer#annotation_setters) | Registry for functions that create annotation setters. Annotation setters are functions that take a batch of `Doc` objects and a [`FullTransformerBatch`](/api/transformer#fulltransformerbatch) and can set additional annotations on the `Doc`. |
|
||||
|
||||
## Loggers {#loggers source="spacy/gold/loggers.py" new="3"}
|
||||
|
||||
|
|
|
@ -33,15 +33,16 @@ the [TransformerListener](/api/architectures#TransformerListener) layer. This
|
|||
works similarly to spaCy's [Tok2Vec](/api/tok2vec) component and
|
||||
[Tok2VecListener](/api/architectures/Tok2VecListener) sublayer.
|
||||
|
||||
We calculate an alignment between the word-piece tokens and the spaCy
|
||||
tokenization, so that we can use the last hidden states to store the information
|
||||
on the `Doc`. When multiple word-piece tokens align to the same spaCy token, the
|
||||
spaCy token receives the sum of their values. By default, the information is
|
||||
written to the [`Doc._.trf_data`](#custom-attributes) extension attribute, but
|
||||
you can implement a custom [`@annotation_setter`](#annotation_setters) to change
|
||||
this behaviour. The package also adds the function registry
|
||||
[`@span_getters`](#span_getters) with several built-in registered functions. For
|
||||
more details, see the [usage documentation](/usage/embeddings-transformers).
|
||||
The component assigns the output of the transformer to the `Doc`'s extension
|
||||
attributes. We also calculate an alignment between the word-piece tokens and the
|
||||
spaCy tokenization, so that we can use the last hidden states to set the
|
||||
`Doc.tensor` attribute. When multiple word-piece tokens align to the same spaCy
|
||||
token, the spaCy token receives the sum of their values. To access the values,
|
||||
you can use the custom [`Doc._.trf_data`](#custom-attributes) attribute. The
|
||||
package also adds the function registries [`@span_getters`](#span_getters) and
|
||||
[`@annotation_setters`](#annotation_setters) with several built-in registered
|
||||
functions. For more details, see the
|
||||
[usage documentation](/usage/embeddings-transformers).
|
||||
|
||||
## Config and implementation {#config}
|
||||
|
||||
|
@ -61,9 +62,9 @@ on the transformer architectures and their arguments and hyperparameters.
|
|||
> ```
|
||||
|
||||
| Setting | Description |
|
||||
| ------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| ------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `max_batch_items` | Maximum size of a padded batch. Defaults to `4096`. ~~int~~ |
|
||||
| `annotation_setter` | Function that takes a batch of `Doc` objects and transformer outputs to store the annotations on the `Doc`. Defaults to `trfdata_setter` which sets the `Doc._.trf_data` attribute. ~~Callable[[List[Doc], FullTransformerBatch], None]~~ |
|
||||
| `annotation_setter` | Function that takes a batch of `Doc` objects and transformer outputs to set additional annotations on the `Doc`. The `Doc._.transformer_data` attribute is set prior to calling the callback. Defaults to `null_annotation_setter` (no additional annotations). ~~Callable[[List[Doc], FullTransformerBatch], None]~~ |
|
||||
| `model` | The Thinc [`Model`](https://thinc.ai/docs/api-model) wrapping the transformer. Defaults to [TransformerModel](/api/architectures#TransformerModel). ~~Model[List[Doc], FullTransformerBatch]~~ |
|
||||
|
||||
```python
|
||||
|
@ -97,10 +98,9 @@ Construct a `Transformer` component. One or more subsequent spaCy components can
|
|||
use the transformer outputs as features in its model, with gradients
|
||||
backpropagated to the single shared weights. The activations from the
|
||||
transformer are saved in the [`Doc._.trf_data`](#custom-attributes) extension
|
||||
attribute by default, but you can provide a different `annotation_setter` to
|
||||
customize this behaviour. In your application, you would normally use a shortcut
|
||||
and instantiate the component using its string name and
|
||||
[`nlp.add_pipe`](/api/language#create_pipe).
|
||||
attribute. You can also provide a callback to set additional annotations. In
|
||||
your application, you would normally use a shortcut for this and instantiate the
|
||||
component using its string name and [`nlp.add_pipe`](/api/language#create_pipe).
|
||||
|
||||
| Name | Description |
|
||||
| ------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
|
@ -205,9 +205,8 @@ modifying them.
|
|||
|
||||
Assign the extracted features to the Doc objects. By default, the
|
||||
[`TransformerData`](/api/transformer#transformerdata) object is written to the
|
||||
[`Doc._.trf_data`](#custom-attributes) attribute. This behaviour can be
|
||||
customized by providing a different `annotation_setter` argument upon
|
||||
construction.
|
||||
[`Doc._.trf_data`](#custom-attributes) attribute. Your annotation_setter
|
||||
callback is then called, if provided.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
|
@ -520,21 +519,18 @@ right context.
|
|||
## Annotation setters {#annotation_setters tag="registered functions" source="github.com/explosion/spacy-transformers/blob/master/spacy_transformers/annotation_setters.py"}
|
||||
|
||||
Annotation setters are functions that take a batch of `Doc` objects and a
|
||||
[`FullTransformerBatch`](/api/transformer#fulltransformerbatch) and store the
|
||||
annotations on the `Doc`, e.g. to set custom or built-in attributes. You can
|
||||
register custom annotation setters using the `@registry.annotation_setters`
|
||||
decorator. The default annotation setter used by the `Transformer` pipeline
|
||||
component is `trfdata_setter`, which sets the custom `Doc._.trf_data` attribute.
|
||||
[`FullTransformerBatch`](/api/transformer#fulltransformerbatch) and can set
|
||||
additional annotations on the `Doc`, e.g. to set custom or built-in attributes.
|
||||
You can register custom annotation setters using the
|
||||
`@registry.annotation_setters` decorator.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> @registry.annotation_setters("spacy-transformers.trfdata_setter.v1")
|
||||
> def configure_trfdata_setter() -> Callable:
|
||||
> @registry.annotation_setters("spacy-transformers.null_annotation_setter.v1")
|
||||
> def configure_null_annotation_setter() -> Callable:
|
||||
> def setter(docs: List[Doc], trf_data: FullTransformerBatch) -> None:
|
||||
> doc_data = list(trf_data.doc_data)
|
||||
> for doc, data in zip(docs, doc_data):
|
||||
> doc._.trf_data = data
|
||||
> pass
|
||||
>
|
||||
> return setter
|
||||
> ```
|
||||
|
@ -547,8 +543,8 @@ component is `trfdata_setter`, which sets the custom `Doc._.trf_data` attribute.
|
|||
The following built-in functions are available:
|
||||
|
||||
| Name | Description |
|
||||
| -------------------------------------- | ------------------------------------------------------------- |
|
||||
| `spacy-transformers.trfdata_setter.v1` | Set the annotations to the custom attribute `doc._.trf_data`. |
|
||||
| ---------------------------------------------- | ------------------------------------- |
|
||||
| `spacy-transformers.null_annotation_setter.v1` | Don't set any additional annotations. |
|
||||
|
||||
## Custom attributes {#custom-attributes}
|
||||
|
||||
|
|
|
@ -252,12 +252,13 @@ for doc in nlp.pipe(["some text", "some other text"]):
|
|||
```
|
||||
|
||||
You can also customize how the [`Transformer`](/api/transformer) component sets
|
||||
annotations onto the [`Doc`](/api/doc), by customizing the `annotation_setter`.
|
||||
This callback will be called with the raw input and output data for the whole
|
||||
batch, along with the batch of `Doc` objects, allowing you to implement whatever
|
||||
you need. The annotation setter is called with a batch of [`Doc`](/api/doc)
|
||||
objects and a [`FullTransformerBatch`](/api/transformer#fulltransformerbatch)
|
||||
containing the transformers data for the batch.
|
||||
annotations onto the [`Doc`](/api/doc), by specifying a custom
|
||||
`annotation_setter`. This callback will be called with the raw input and output
|
||||
data for the whole batch, along with the batch of `Doc` objects, allowing you to
|
||||
implement whatever you need. The annotation setter is called with a batch of
|
||||
[`Doc`](/api/doc) objects and a
|
||||
[`FullTransformerBatch`](/api/transformer#fulltransformerbatch) containing the
|
||||
transformers data for the batch.
|
||||
|
||||
```python
|
||||
def custom_annotation_setter(docs, trf_data):
|
||||
|
|
Loading…
Reference in New Issue
Block a user