various fixes

This commit is contained in:
svlandeg 2020-08-27 19:24:44 +02:00
parent 329e490560
commit 556e975a30
2 changed files with 38 additions and 37 deletions

View File

@ -49,8 +49,8 @@ The default config is defined by the pipeline component factory and describes
how the component should be configured. You can override its settings via the how the component should be configured. You can override its settings via the
`config` argument on [`nlp.add_pipe`](/api/language#add_pipe) or in your `config` argument on [`nlp.add_pipe`](/api/language#add_pipe) or in your
[`config.cfg` for training](/usage/training#config). See the [`config.cfg` for training](/usage/training#config). See the
[model architectures](/api/architectures) documentation for details on the [model architectures](/api/architectures#transformers) documentation for details
architectures and their arguments and hyperparameters. on the transformer architectures and their arguments and hyperparameters.
> #### Example > #### Example
> >
@ -60,11 +60,11 @@ architectures and their arguments and hyperparameters.
> nlp.add_pipe("transformer", config=DEFAULT_CONFIG) > nlp.add_pipe("transformer", config=DEFAULT_CONFIG)
> ``` > ```
| Setting | Description | | Setting | Description |
| ------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | ------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `max_batch_items` | Maximum size of a padded batch. Defaults to `4096`. ~~int~~ | | `max_batch_items` | Maximum size of a padded batch. Defaults to `4096`. ~~int~~ |
| `annotation_setter` | Function that takes a batch of `Doc` objects and transformer outputs to store the annotations on the `Doc`. Defaults to `trfdata_setter` which sets the `Doc._.transformer_data` attribute. ~~Callable[[List[Doc], FullTransformerBatch], None]~~ | | `annotation_setter` | Function that takes a batch of `Doc` objects and transformer outputs to store the annotations on the `Doc`. Defaults to `trfdata_setter` which sets the `Doc._.trf_data` attribute. ~~Callable[[List[Doc], FullTransformerBatch], None]~~ |
| `model` | The Thinc [`Model`](https://thinc.ai/docs/api-model) wrapping the transformer. Defaults to [TransformerModel](/api/architectures#TransformerModel). ~~Model[List[Doc], FullTransformerBatch]~~ | | `model` | The Thinc [`Model`](https://thinc.ai/docs/api-model) wrapping the transformer. Defaults to [TransformerModel](/api/architectures#TransformerModel). ~~Model[List[Doc], FullTransformerBatch]~~ |
```python ```python
https://github.com/explosion/spacy-transformers/blob/master/spacy_transformers/pipeline_component.py https://github.com/explosion/spacy-transformers/blob/master/spacy_transformers/pipeline_component.py
@ -97,18 +97,19 @@ Construct a `Transformer` component. One or more subsequent spaCy components can
use the transformer outputs as features in its model, with gradients use the transformer outputs as features in its model, with gradients
backpropagated to the single shared weights. The activations from the backpropagated to the single shared weights. The activations from the
transformer are saved in the [`Doc._.trf_data`](#custom-attributes) extension transformer are saved in the [`Doc._.trf_data`](#custom-attributes) extension
attribute. You can also provide a callback to set additional annotations. In attribute by default, but you can provide a different `annotation_setter` to
your application, you would normally use a shortcut for this and instantiate the customize this behaviour. In your application, you would normally use a shortcut
component using its string name and [`nlp.add_pipe`](/api/language#create_pipe). and instantiate the component using its string name and
[`nlp.add_pipe`](/api/language#create_pipe).
| Name | Description | | Name | Description |
| ------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | ------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `vocab` | The shared vocabulary. ~~Vocab~~ | | `vocab` | The shared vocabulary. ~~Vocab~~ |
| `model` | The Thinc [`Model`](https://thinc.ai/docs/api-model) wrapping the transformer. Usually you will want to use the [TransformerModel](/api/architectures#TransformerModel) layer for this. ~~Model[List[Doc], FullTransformerBatch]~~ | | `model` | The Thinc [`Model`](https://thinc.ai/docs/api-model) wrapping the transformer. Usually you will want to use the [TransformerModel](/api/architectures#TransformerModel) layer for this. ~~Model[List[Doc], FullTransformerBatch]~~ |
| `annotation_setter` | Function that takes a batch of `Doc` objects and transformer outputs can set additional annotations on the `Doc`. The `Doc._.transformer_data` attribute is set prior to calling the callback. By default, no annotations are set. ~~Callable[[List[Doc], FullTransformerBatch], None]~~ | | `annotation_setter` | Function that takes a batch of `Doc` objects and transformer outputs and stores the annotations on the `Doc`. By default, the function `trfdata_setter` sets the `Doc._.trf_data` attribute. ~~Callable[[List[Doc], FullTransformerBatch], None]~~ |
| _keyword-only_ | | | _keyword-only_ | |
| `name` | String name of the component instance. Used to add entries to the `losses` during training. ~~str~~ | | `name` | String name of the component instance. Used to add entries to the `losses` during training. ~~str~~ |
| `max_batch_items` | Maximum size of a padded batch. Defaults to `128*32`. ~~int~~ | | `max_batch_items` | Maximum size of a padded batch. Defaults to `128*32`. ~~int~~ |
## Transformer.\_\_call\_\_ {#call tag="method"} ## Transformer.\_\_call\_\_ {#call tag="method"}
@ -204,8 +205,9 @@ modifying them.
Assign the extracted features to the Doc objects. By default, the Assign the extracted features to the Doc objects. By default, the
[`TransformerData`](/api/transformer#transformerdata) object is written to the [`TransformerData`](/api/transformer#transformerdata) object is written to the
[`Doc._.trf_data`](#custom-attributes) attribute. Your annotation_setter [`Doc._.trf_data`](#custom-attributes) attribute. This behaviour can be
callback is then called, if provided. customized by providing a different `annotation_setter` argument upon
construction.
> #### Example > #### Example
> >
@ -382,9 +384,8 @@ return tensors that refer to a whole padded batch of documents. These tensors
are wrapped into the are wrapped into the
[FullTransformerBatch](/api/transformer#fulltransformerbatch) object. The [FullTransformerBatch](/api/transformer#fulltransformerbatch) object. The
`FullTransformerBatch` then splits out the per-document data, which is handled `FullTransformerBatch` then splits out the per-document data, which is handled
by this class. Instances of this class by this class. Instances of this class are typically assigned to the
are`typically assigned to the [Doc._.trf_data`](/api/transformer#custom-attributes) [`Doc._.trf_data`](/api/transformer#custom-attributes) extension attribute.
extension attribute.
| Name | Description | | Name | Description |
| --------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | --------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
@ -446,8 +447,9 @@ overlap, and you can also omit sections of the Doc if they are not relevant.
Span getters can be referenced in the `[components.transformer.model.get_spans]` Span getters can be referenced in the `[components.transformer.model.get_spans]`
block of the config to customize the sequences processed by the transformer. You block of the config to customize the sequences processed by the transformer. You
can also register custom span getters using the `@spacy.registry.span_getters` can also register
decorator. [custom span getters](/usage/embeddings-transformers#transformers-training-custom-settings)
using the `@spacy.registry.span_getters` decorator.
> #### Example > #### Example
> >
@ -522,8 +524,7 @@ Annotation setters are functions that take a batch of `Doc` objects and a
annotations on the `Doc`, e.g. to set custom or built-in attributes. You can annotations on the `Doc`, e.g. to set custom or built-in attributes. You can
register custom annotation setters using the `@registry.annotation_setters` register custom annotation setters using the `@registry.annotation_setters`
decorator. The default annotation setter used by the `Transformer` pipeline decorator. The default annotation setter used by the `Transformer` pipeline
component is `trfdata_setter`, which sets the custom `Doc._.transformer_data` component is `trfdata_setter`, which sets the custom `Doc._.trf_data` attribute.
attribute.
> #### Example > #### Example
> >
@ -554,6 +555,6 @@ The following built-in functions are available:
The component sets the following The component sets the following
[custom extension attributes](/usage/processing-pipeline#custom-components-attributes): [custom extension attributes](/usage/processing-pipeline#custom-components-attributes):
| Name | Description | | Name | Description |
| -------------- | ------------------------------------------------------------------------ | | ---------------- | ------------------------------------------------------------------------ |
| `Doc.trf_data` | Transformer tokens and outputs for the `Doc` object. ~~TransformerData~~ | | `Doc._.trf_data` | Transformer tokens and outputs for the `Doc` object. ~~TransformerData~~ |

View File

@ -429,8 +429,8 @@ The same idea applies to task models that power the **downstream components**.
Most of spaCy's built-in model creation functions support a `tok2vec` argument, Most of spaCy's built-in model creation functions support a `tok2vec` argument,
which should be a Thinc layer of type ~~Model[List[Doc], List[Floats2d]]~~. This which should be a Thinc layer of type ~~Model[List[Doc], List[Floats2d]]~~. This
is where we'll plug in our transformer model, using the is where we'll plug in our transformer model, using the
[TransformerListener](/api/architectures#TransformerListener) layer, which sneakily [TransformerListener](/api/architectures#TransformerListener) layer, which
delegates to the `Transformer` pipeline component. sneakily delegates to the `Transformer` pipeline component.
```ini ```ini
### config.cfg (excerpt) {highlight="12"} ### config.cfg (excerpt) {highlight="12"}
@ -452,11 +452,11 @@ grad_factor = 1.0
@layers = "reduce_mean.v1" @layers = "reduce_mean.v1"
``` ```
The [TransformerListener](/api/architectures#TransformerListener) layer expects a The [TransformerListener](/api/architectures#TransformerListener) layer expects
[pooling layer](https://thinc.ai/docs/api-layers#reduction-ops) as the argument a [pooling layer](https://thinc.ai/docs/api-layers#reduction-ops) as the
`pooling`, which needs to be of type ~~Model[Ragged, Floats2d]~~. This layer argument `pooling`, which needs to be of type ~~Model[Ragged, Floats2d]~~. This
determines how the vector for each spaCy token will be computed from the zero or layer determines how the vector for each spaCy token will be computed from the
more source rows the token is aligned against. Here we use the zero or more source rows the token is aligned against. Here we use the
[`reduce_mean`](https://thinc.ai/docs/api-layers#reduce_mean) layer, which [`reduce_mean`](https://thinc.ai/docs/api-layers#reduce_mean) layer, which
averages the wordpiece rows. We could instead use averages the wordpiece rows. We could instead use
[`reduce_max`](https://thinc.ai/docs/api-layers#reduce_max), or a custom [`reduce_max`](https://thinc.ai/docs/api-layers#reduce_max), or a custom