Merge pull request #5996 from svlandeg/feature/docs-trf-examples [ci skip]

custom transformer examples
This commit is contained in:
Ines Montani 2020-08-29 12:37:57 +02:00 committed by GitHub
commit b6ee284376
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

View File

@ -179,7 +179,7 @@ interoperates with [PyTorch](https://pytorch.org) and the
giving you access to thousands of pretrained models for your pipelines. There
are many [great guides](http://jalammar.github.io/illustrated-transformer/) to
transformer models, but for practical purposes, you can simply think of them as
a drop-in replacement that let you achieve **higher accuracy** in exchange for
drop-in replacements that let you achieve **higher accuracy** in exchange for
**higher training and runtime costs**.
### Setup and installation {#transformers-installation}
@ -225,7 +225,7 @@ transformers as subnetworks directly, you can also use them via the
![The processing pipeline with the transformer component](../images/pipeline_transformer.svg)
The `Transformer` component sets the
By default, the `Transformer` component sets the
[`Doc._.trf_data`](/api/transformer#custom_attributes) extension attribute,
which lets you access the transformers outputs at runtime.
@ -249,8 +249,8 @@ for doc in nlp.pipe(["some text", "some other text"]):
tokvecs = doc._.trf_data.tensors[-1]
```
You can also customize how the [`Transformer`](/api/transformer) component sets
annotations onto the [`Doc`](/api/doc), by customizing the `annotation_setter`.
You can customize how the [`Transformer`](/api/transformer) component sets
annotations onto the [`Doc`](/api/doc), by changing the `annotation_setter`.
This callback will be called with the raw input and output data for the whole
batch, along with the batch of `Doc` objects, allowing you to implement whatever
you need. The annotation setter is called with a batch of [`Doc`](/api/doc)
@ -259,13 +259,15 @@ containing the transformers data for the batch.
```python
def custom_annotation_setter(docs, trf_data):
# TODO:
...
doc_data = list(trf_data.doc_data)
for doc, data in zip(docs, doc_data):
doc._.custom_attr = data
nlp = spacy.load("en_core_trf_lg")
nlp.get_pipe("transformer").annotation_setter = custom_annotation_setter
doc = nlp("This is a text")
print() # TODO:
assert isinstance(doc._.custom_attr, TransformerData)
print(doc._.custom_attr.tensors)
```
### Training usage {#transformers-training}
@ -299,7 +301,7 @@ component:
>
> ```python
> from spacy_transformers import Transformer, TransformerModel
> from spacy_transformers.annotation_setters import null_annotation_setter
> from spacy_transformers.annotation_setters import configure_trfdata_setter
> from spacy_transformers.span_getters import get_doc_spans
>
> trf = Transformer(
@ -309,7 +311,7 @@ component:
> get_spans=get_doc_spans,
> tokenizer_config={"use_fast": True},
> ),
> annotation_setter=null_annotation_setter,
> annotation_setter=configure_trfdata_setter(),
> max_batch_items=4096,
> )
> ```
@ -329,7 +331,7 @@ tokenizer_config = {"use_fast": true}
@span_getters = "doc_spans.v1"
[components.transformer.annotation_setter]
@annotation_setters = "spacy-transformers.null_annotation_setter.v1"
@annotation_setters = "spacy-transformers.trfdata_setter.v1"
```
@ -343,9 +345,9 @@ in a block starts with `@`, it's **resolved to a function** and all other
settings are passed to the function as arguments. In this case, `name`,
`tokenizer_config` and `get_spans`.
`get_spans` is a function that takes a batch of `Doc` object and returns lists
`get_spans` is a function that takes a batch of `Doc` objects and returns lists
of potentially overlapping `Span` objects to process by the transformer. Several
[built-in functions](/api/transformer#span-getters) are available for example,
[built-in functions](/api/transformer#span_getters) are available for example,
to process the whole document or individual sentences. When the config is
resolved, the function is created and passed into the model as an argument.
@ -366,13 +368,17 @@ To change any of the settings, you can edit the `config.cfg` and re-run the
training. To change any of the functions, like the span getter, you can replace
the name of the referenced function e.g. `@span_getters = "sent_spans.v1"` to
process sentences. You can also register your own functions using the
`span_getters` registry:
`span_getters` registry. For instance, the following custom function returns
`Span` objects following sentence boundaries, unless a sentence succeeds a
certain amount of tokens, in which case subsentences of at most `max_length`
tokens are returned.
> #### config.cfg
>
> ```ini
> [components.transformer.model.get_spans]
> @span_getters = "custom_sent_spans"
> max_length = 25
> ```
```python
@ -380,12 +386,23 @@ process sentences. You can also register your own functions using the
import spacy_transformers
@spacy_transformers.registry.span_getters("custom_sent_spans")
def configure_custom_sent_spans():
# TODO: write custom example
def get_sent_spans(docs):
return [list(doc.sents) for doc in docs]
def configure_custom_sent_spans(max_length: int):
def get_custom_sent_spans(docs):
spans = []
for doc in docs:
spans.append([])
for sent in doc.sents:
start = 0
end = max_length
while end <= len(sent):
spans[-1].append(sent[start:end])
start += max_length
end += max_length
if start < len(sent):
spans[-1].append(sent[start:len(sent)])
return spans
return get_sent_spans
return get_custom_sent_spans
```
To resolve the config during training, spaCy needs to know about your custom
@ -412,8 +429,8 @@ The same idea applies to task models that power the **downstream components**.
Most of spaCy's built-in model creation functions support a `tok2vec` argument,
which should be a Thinc layer of type ~~Model[List[Doc], List[Floats2d]]~~. This
is where we'll plug in our transformer model, using the
[Tok2VecListener](/api/architectures#Tok2VecListener) layer, which sneakily
delegates to the `Transformer` pipeline component.
[TransformerListener](/api/architectures#TransformerListener) layer, which
sneakily delegates to the `Transformer` pipeline component.
```ini
### config.cfg (excerpt) {highlight="12"}
@ -428,18 +445,18 @@ maxout_pieces = 3
use_upper = false
[nlp.pipeline.ner.model.tok2vec]
@architectures = "spacy-transformers.Tok2VecListener.v1"
@architectures = "spacy-transformers.TransformerListener.v1"
grad_factor = 1.0
[nlp.pipeline.ner.model.tok2vec.pooling]
@layers = "reduce_mean.v1"
```
The [Tok2VecListener](/api/architectures#Tok2VecListener) layer expects a
[pooling layer](https://thinc.ai/docs/api-layers#reduction-ops) as the argument
`pooling`, which needs to be of type ~~Model[Ragged, Floats2d]~~. This layer
determines how the vector for each spaCy token will be computed from the zero or
more source rows the token is aligned against. Here we use the
The [TransformerListener](/api/architectures#TransformerListener) layer expects
a [pooling layer](https://thinc.ai/docs/api-layers#reduction-ops) as the
argument `pooling`, which needs to be of type ~~Model[Ragged, Floats2d]~~. This
layer determines how the vector for each spaCy token will be computed from the
zero or more source rows the token is aligned against. Here we use the
[`reduce_mean`](https://thinc.ai/docs/api-layers#reduce_mean) layer, which
averages the wordpiece rows. We could instead use
[`reduce_max`](https://thinc.ai/docs/api-layers#reduce_max), or a custom
@ -535,8 +552,9 @@ vectors, but combines them via summation with a smaller table of learned
embeddings.
```python
from thinc.api import add, chain, remap_ids, Embed
from thinc.api import add, chain, remap_ids, Embed, FeatureExtractor
from spacy.ml.staticvectors import StaticVectors
from spacy.util import registry
@registry.architectures("my_example.MyEmbedding.v1")
def MyCustomVectors(