mirror of
https://github.com/explosion/spaCy.git
synced 2024-12-25 17:36:30 +03:00
revert annotations refactor
This commit is contained in:
parent
13ee742fb4
commit
e47ea88aeb
|
@ -306,25 +306,24 @@ factories.
|
|||
> i += 1
|
||||
> ```
|
||||
|
||||
| Registry name | Description |
|
||||
| -------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `annotation_setters` | Registry for functions that store Tok2Vec annotations on `Doc` objects. |
|
||||
| `architectures` | Registry for functions that create [model architectures](/api/architectures). Can be used to register custom model architectures and reference them in the `config.cfg`. |
|
||||
| `assets` | Registry for data assets, knowledge bases etc. |
|
||||
| `batchers` | Registry for training and evaluation [data batchers](#batchers). |
|
||||
| `callbacks` | Registry for custom callbacks to [modify the `nlp` object](/usage/training#custom-code-nlp-callbacks) before training. |
|
||||
| `displacy_colors` | Registry for custom color scheme for the [`displacy` NER visualizer](/usage/visualizers). Automatically reads from [entry points](/usage/saving-loading#entry-points). |
|
||||
| `factories` | Registry for functions that create [pipeline components](/usage/processing-pipelines#custom-components). Added automatically when you use the `@spacy.component` decorator and also reads from [entry points](/usage/saving-loading#entry-points). |
|
||||
| `initializers` | Registry for functions that create [initializers](https://thinc.ai/docs/api-initializers). |
|
||||
| `languages` | Registry for language-specific `Language` subclasses. Automatically reads from [entry points](/usage/saving-loading#entry-points). |
|
||||
| `layers` | Registry for functions that create [layers](https://thinc.ai/docs/api-layers). |
|
||||
| `loggers` | Registry for functions that log [training results](/usage/training). |
|
||||
| `lookups` | Registry for large lookup tables available via `vocab.lookups`. |
|
||||
| `losses` | Registry for functions that create [losses](https://thinc.ai/docs/api-loss). |
|
||||
| `optimizers` | Registry for functions that create [optimizers](https://thinc.ai/docs/api-optimizers). |
|
||||
| `readers` | Registry for training and evaluation data readers like [`Corpus`](/api/corpus). |
|
||||
| `schedules` | Registry for functions that create [schedules](https://thinc.ai/docs/api-schedules). |
|
||||
| `tokenizers` | Registry for tokenizer factories. Registered functions should return a callback that receives the `nlp` object and returns a [`Tokenizer`](/api/tokenizer) or a custom callable. |
|
||||
| Registry name | Description |
|
||||
| ----------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `architectures` | Registry for functions that create [model architectures](/api/architectures). Can be used to register custom model architectures and reference them in the `config.cfg`. |
|
||||
| `assets` | Registry for data assets, knowledge bases etc. |
|
||||
| `batchers` | Registry for training and evaluation [data batchers](#batchers). |
|
||||
| `callbacks` | Registry for custom callbacks to [modify the `nlp` object](/usage/training#custom-code-nlp-callbacks) before training. |
|
||||
| `displacy_colors` | Registry for custom color scheme for the [`displacy` NER visualizer](/usage/visualizers). Automatically reads from [entry points](/usage/saving-loading#entry-points). |
|
||||
| `factories` | Registry for functions that create [pipeline components](/usage/processing-pipelines#custom-components). Added automatically when you use the `@spacy.component` decorator and also reads from [entry points](/usage/saving-loading#entry-points). |
|
||||
| `initializers` | Registry for functions that create [initializers](https://thinc.ai/docs/api-initializers). |
|
||||
| `languages` | Registry for language-specific `Language` subclasses. Automatically reads from [entry points](/usage/saving-loading#entry-points). |
|
||||
| `layers` | Registry for functions that create [layers](https://thinc.ai/docs/api-layers). |
|
||||
| `loggers` | Registry for functions that log [training results](/usage/training). |
|
||||
| `lookups` | Registry for large lookup tables available via `vocab.lookups`. |
|
||||
| `losses` | Registry for functions that create [losses](https://thinc.ai/docs/api-loss). |
|
||||
| `optimizers` | Registry for functions that create [optimizers](https://thinc.ai/docs/api-optimizers). |
|
||||
| `readers` | Registry for training and evaluation data readers like [`Corpus`](/api/corpus). |
|
||||
| `schedules` | Registry for functions that create [schedules](https://thinc.ai/docs/api-schedules). |
|
||||
| `tokenizers` | Registry for tokenizer factories. Registered functions should return a callback that receives the `nlp` object and returns a [`Tokenizer`](/api/tokenizer) or a custom callable. |
|
||||
|
||||
### spacy-transformers registry {#registry-transformers}
|
||||
|
||||
|
@ -338,17 +337,18 @@ See the [`Transformer`](/api/transformer) API reference and
|
|||
> ```python
|
||||
> import spacy_transformers
|
||||
>
|
||||
> @spacy_transformers.registry.span_getters("my_span_getter.v1")
|
||||
> def configure_custom_span_getter() -> Callable:
|
||||
> def span_getter(docs: List[Doc]) -> List[List[Span]]:
|
||||
> # Transform each Doc into a List of Span objects
|
||||
> @spacy_transformers.registry.annotation_setters("my_annotation_setter.v1")
|
||||
> def configure_custom_annotation_setter():
|
||||
> def annotation_setter(docs, trf_data) -> None:
|
||||
> # Set annotations on the docs
|
||||
>
|
||||
> return span_getter
|
||||
> return annotation_setter
|
||||
> ```
|
||||
|
||||
| Registry name | Description |
|
||||
| ----------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| [`span_getters`](/api/transformer#span_getters) | Registry for functions that take a batch of `Doc` objects and return a list of `Span` objects to process by the transformer, e.g. sentences. |
|
||||
| Registry name | Description |
|
||||
| ----------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| [`span_getters`](/api/transformer#span_getters) | Registry for functions that take a batch of `Doc` objects and return a list of `Span` objects to process by the transformer, e.g. sentences. |
|
||||
| [`annotation_setters`](/api/transformer#annotation_setters) | Registry for functions that create annotation setters. Annotation setters are functions that take a batch of `Doc` objects and a [`FullTransformerBatch`](/api/transformer#fulltransformerbatch) and can set additional annotations on the `Doc`. |
|
||||
|
||||
## Loggers {#loggers source="spacy/gold/loggers.py" new="3"}
|
||||
|
||||
|
|
|
@ -33,15 +33,16 @@ the [TransformerListener](/api/architectures#TransformerListener) layer. This
|
|||
works similarly to spaCy's [Tok2Vec](/api/tok2vec) component and
|
||||
[Tok2VecListener](/api/architectures/Tok2VecListener) sublayer.
|
||||
|
||||
We calculate an alignment between the word-piece tokens and the spaCy
|
||||
tokenization, so that we can use the last hidden states to store the information
|
||||
on the `Doc`. When multiple word-piece tokens align to the same spaCy token, the
|
||||
spaCy token receives the sum of their values. By default, the information is
|
||||
written to the [`Doc._.trf_data`](#custom-attributes) extension attribute, but
|
||||
you can implement a custom [`@annotation_setter`](#annotation_setters) to change
|
||||
this behaviour. The package also adds the function registry
|
||||
[`@span_getters`](#span_getters) with several built-in registered functions. For
|
||||
more details, see the [usage documentation](/usage/embeddings-transformers).
|
||||
The component assigns the output of the transformer to the `Doc`'s extension
|
||||
attributes. We also calculate an alignment between the word-piece tokens and the
|
||||
spaCy tokenization, so that we can use the last hidden states to set the
|
||||
`Doc.tensor` attribute. When multiple word-piece tokens align to the same spaCy
|
||||
token, the spaCy token receives the sum of their values. To access the values,
|
||||
you can use the custom [`Doc._.trf_data`](#custom-attributes) attribute. The
|
||||
package also adds the function registries [`@span_getters`](#span_getters) and
|
||||
[`@annotation_setters`](#annotation_setters) with several built-in registered
|
||||
functions. For more details, see the
|
||||
[usage documentation](/usage/embeddings-transformers).
|
||||
|
||||
## Config and implementation {#config}
|
||||
|
||||
|
@ -60,11 +61,11 @@ on the transformer architectures and their arguments and hyperparameters.
|
|||
> nlp.add_pipe("transformer", config=DEFAULT_CONFIG)
|
||||
> ```
|
||||
|
||||
| Setting | Description |
|
||||
| ------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `max_batch_items` | Maximum size of a padded batch. Defaults to `4096`. ~~int~~ |
|
||||
| `annotation_setter` | Function that takes a batch of `Doc` objects and transformer outputs to store the annotations on the `Doc`. Defaults to `trfdata_setter` which sets the `Doc._.trf_data` attribute. ~~Callable[[List[Doc], FullTransformerBatch], None]~~ |
|
||||
| `model` | The Thinc [`Model`](https://thinc.ai/docs/api-model) wrapping the transformer. Defaults to [TransformerModel](/api/architectures#TransformerModel). ~~Model[List[Doc], FullTransformerBatch]~~ |
|
||||
| Setting | Description |
|
||||
| ------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `max_batch_items` | Maximum size of a padded batch. Defaults to `4096`. ~~int~~ |
|
||||
| `annotation_setter` | Function that takes a batch of `Doc` objects and transformer outputs to set additional annotations on the `Doc`. The `Doc._.transformer_data` attribute is set prior to calling the callback. Defaults to `null_annotation_setter` (no additional annotations). ~~Callable[[List[Doc], FullTransformerBatch], None]~~ |
|
||||
| `model` | The Thinc [`Model`](https://thinc.ai/docs/api-model) wrapping the transformer. Defaults to [TransformerModel](/api/architectures#TransformerModel). ~~Model[List[Doc], FullTransformerBatch]~~ |
|
||||
|
||||
```python
|
||||
https://github.com/explosion/spacy-transformers/blob/master/spacy_transformers/pipeline_component.py
|
||||
|
@ -97,10 +98,9 @@ Construct a `Transformer` component. One or more subsequent spaCy components can
|
|||
use the transformer outputs as features in its model, with gradients
|
||||
backpropagated to the single shared weights. The activations from the
|
||||
transformer are saved in the [`Doc._.trf_data`](#custom-attributes) extension
|
||||
attribute by default, but you can provide a different `annotation_setter` to
|
||||
customize this behaviour. In your application, you would normally use a shortcut
|
||||
and instantiate the component using its string name and
|
||||
[`nlp.add_pipe`](/api/language#create_pipe).
|
||||
attribute. You can also provide a callback to set additional annotations. In
|
||||
your application, you would normally use a shortcut for this and instantiate the
|
||||
component using its string name and [`nlp.add_pipe`](/api/language#create_pipe).
|
||||
|
||||
| Name | Description |
|
||||
| ------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
|
@ -205,9 +205,8 @@ modifying them.
|
|||
|
||||
Assign the extracted features to the Doc objects. By default, the
|
||||
[`TransformerData`](/api/transformer#transformerdata) object is written to the
|
||||
[`Doc._.trf_data`](#custom-attributes) attribute. This behaviour can be
|
||||
customized by providing a different `annotation_setter` argument upon
|
||||
construction.
|
||||
[`Doc._.trf_data`](#custom-attributes) attribute. Your annotation_setter
|
||||
callback is then called, if provided.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
|
@ -520,23 +519,20 @@ right context.
|
|||
## Annotation setters {#annotation_setters tag="registered functions" source="github.com/explosion/spacy-transformers/blob/master/spacy_transformers/annotation_setters.py"}
|
||||
|
||||
Annotation setters are functions that take a batch of `Doc` objects and a
|
||||
[`FullTransformerBatch`](/api/transformer#fulltransformerbatch) and store the
|
||||
annotations on the `Doc`, e.g. to set custom or built-in attributes. You can
|
||||
register custom annotation setters using the `@registry.annotation_setters`
|
||||
decorator. The default annotation setter used by the `Transformer` pipeline
|
||||
component is `trfdata_setter`, which sets the custom `Doc._.trf_data` attribute.
|
||||
[`FullTransformerBatch`](/api/transformer#fulltransformerbatch) and can set
|
||||
additional annotations on the `Doc`, e.g. to set custom or built-in attributes.
|
||||
You can register custom annotation setters using the
|
||||
`@registry.annotation_setters` decorator.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> @registry.annotation_setters("spacy-transformers.trfdata_setter.v1")
|
||||
> def configure_trfdata_setter() -> Callable:
|
||||
> @registry.annotation_setters("spacy-transformers.null_annotation_setter.v1")
|
||||
> def configure_null_annotation_setter() -> Callable:
|
||||
> def setter(docs: List[Doc], trf_data: FullTransformerBatch) -> None:
|
||||
> doc_data = list(trf_data.doc_data)
|
||||
> for doc, data in zip(docs, doc_data):
|
||||
> doc._.trf_data = data
|
||||
> pass
|
||||
>
|
||||
> return setter
|
||||
> return setter
|
||||
> ```
|
||||
|
||||
| Name | Description |
|
||||
|
@ -546,9 +542,9 @@ component is `trfdata_setter`, which sets the custom `Doc._.trf_data` attribute.
|
|||
|
||||
The following built-in functions are available:
|
||||
|
||||
| Name | Description |
|
||||
| -------------------------------------- | ------------------------------------------------------------- |
|
||||
| `spacy-transformers.trfdata_setter.v1` | Set the annotations to the custom attribute `doc._.trf_data`. |
|
||||
| Name | Description |
|
||||
| ---------------------------------------------- | ------------------------------------- |
|
||||
| `spacy-transformers.null_annotation_setter.v1` | Don't set any additional annotations. |
|
||||
|
||||
## Custom attributes {#custom-attributes}
|
||||
|
||||
|
|
|
@ -252,12 +252,13 @@ for doc in nlp.pipe(["some text", "some other text"]):
|
|||
```
|
||||
|
||||
You can also customize how the [`Transformer`](/api/transformer) component sets
|
||||
annotations onto the [`Doc`](/api/doc), by customizing the `annotation_setter`.
|
||||
This callback will be called with the raw input and output data for the whole
|
||||
batch, along with the batch of `Doc` objects, allowing you to implement whatever
|
||||
you need. The annotation setter is called with a batch of [`Doc`](/api/doc)
|
||||
objects and a [`FullTransformerBatch`](/api/transformer#fulltransformerbatch)
|
||||
containing the transformers data for the batch.
|
||||
annotations onto the [`Doc`](/api/doc), by specifying a custom
|
||||
`annotation_setter`. This callback will be called with the raw input and output
|
||||
data for the whole batch, along with the batch of `Doc` objects, allowing you to
|
||||
implement whatever you need. The annotation setter is called with a batch of
|
||||
[`Doc`](/api/doc) objects and a
|
||||
[`FullTransformerBatch`](/api/transformer#fulltransformerbatch) containing the
|
||||
transformers data for the batch.
|
||||
|
||||
```python
|
||||
def custom_annotation_setter(docs, trf_data):
|
||||
|
@ -370,9 +371,9 @@ To change any of the settings, you can edit the `config.cfg` and re-run the
|
|||
training. To change any of the functions, like the span getter, you can replace
|
||||
the name of the referenced function – e.g. `@span_getters = "sent_spans.v1"` to
|
||||
process sentences. You can also register your own functions using the
|
||||
[`span_getters` registry](/api/top-level#registry). For instance, the following
|
||||
custom function returns [`Span`](/api/span) objects following sentence
|
||||
boundaries, unless a sentence succeeds a certain amount of tokens, in which case
|
||||
[`span_getters` registry](/api/top-level#registry). For instance, the following
|
||||
custom function returns [`Span`](/api/span) objects following sentence
|
||||
boundaries, unless a sentence succeeds a certain amount of tokens, in which case
|
||||
subsentences of at most `max_length` tokens are returned.
|
||||
|
||||
> #### config.cfg
|
||||
|
|
Loading…
Reference in New Issue
Block a user