spaCy/website/docs/usage/transformers.md

282 lines
11 KiB
Markdown
Raw Normal View History

2020-07-27 01:29:45 +03:00
---
title: Transformers
teaser: Using transformer models like BERT in spaCy
2020-07-29 19:44:10 +03:00
menu:
- ['Installation', 'install']
- ['Runtime Usage', 'runtime']
- ['Training Usage', 'training']
2020-07-29 20:09:44 +03:00
next: /usage/training
2020-07-27 01:29:45 +03:00
---
2020-07-29 19:44:10 +03:00
## Installation {#install hidden="true"}
2020-07-29 12:36:42 +03:00
spaCy v3.0 lets you use almost **any statistical model** to power your pipeline.
2020-07-29 19:44:10 +03:00
You can use models implemented in a variety of
[frameworks](https://thinc.ai/docs/usage-frameworks), including TensorFlow,
2020-07-29 12:36:42 +03:00
PyTorch and MXNet. To keep things sane, spaCy expects models from these
frameworks to be wrapped with a common interface, using our machine learning
library [Thinc](https://thinc.ai). A transformer model is just a statistical
model, so the
[`spacy-transformers`](https://github.com/explosion/spacy-transformers) package
2020-07-29 20:48:26 +03:00
actually has very little work to do: it just has to provide a few functions that
do the required plumbing. It also provides a pipeline component,
2020-07-29 12:36:42 +03:00
[`Transformer`](/api/transformer), that lets you do multi-task learning and lets
you save the transformer outputs for later use.
2020-07-29 19:44:10 +03:00
To use transformers with spaCy, you need the
[`spacy-transformers`](https://github.com/explosion/spacy-transformers) package
installed. It takes care of all the setup behind the scenes, and makes sure the
transformer pipeline component is available to spaCy.
2020-07-29 12:36:42 +03:00
2020-07-29 19:44:10 +03:00
```bash
$ pip install spacy-transformers
```
2020-07-29 12:36:42 +03:00
2020-07-29 19:44:10 +03:00
<!-- TODO: the text below has been copied from the spacy-transformers repo and needs to be updated and adjusted -->
## Runtime usage {#runtime}
Transformer models can be used as **drop-in replacements** for other types of
neural networks, so your spaCy pipeline can include them in a way that's
completely invisible to the user. Users will download, load and use the model in
the standard way, like any other spaCy pipeline. Instead of using the
transformers as subnetworks directly, you can also use them via the
[`Transformer`](/api/transformer) pipeline component.
![The processing pipeline with the transformer component](../images/pipeline_transformer.svg)
The `Transformer` component sets the
[`Doc._.trf_data`](/api/transformer#custom_attributes) extension attribute,
which lets you access the transformers outputs at runtime.
```bash
$ python -m spacy download en_core_trf_lg
```
2020-07-29 12:36:42 +03:00
2020-07-29 19:44:10 +03:00
```python
### Example
import spacy
2020-07-29 12:36:42 +03:00
2020-07-29 19:44:10 +03:00
nlp = spacy.load("en_core_trf_lg")
for doc in nlp.pipe(["some text", "some other text"]):
tokvecs = doc._.trf_data.tensors[-1]
```
You can also customize how the [`Transformer`](/api/transformer) component sets
annotations onto the [`Doc`](/api/doc), by customizing the `annotation_setter`.
This callback will be called with the raw input and output data for the whole
batch, along with the batch of `Doc` objects, allowing you to implement whatever
you need. The annotation setter is called with a batch of [`Doc`](/api/doc)
objects and a [`FullTransformerBatch`](/api/transformer#fulltransformerbatch)
containing the transformers data for the batch.
```python
def custom_annotation_setter(docs, trf_data):
# TODO:
...
nlp = spacy.load("en_core_trf_lg")
nlp.get_pipe("transformer").annotation_setter = custom_annotation_setter
doc = nlp("This is a text")
print() # TODO:
```
## Training usage {#training}
2020-07-29 12:36:42 +03:00
The recommended workflow for training is to use spaCy's
[config system](/usage/training#config), usually via the
2020-07-29 19:44:10 +03:00
[`spacy train`](/api/cli#train) command. The training config defines all
component settings and hyperparameters in one place and lets you describe a tree
of objects by referring to creation functions, including functions you register
2020-07-31 14:26:39 +03:00
yourself. For details on how to get started with training your own model, check
out the [training quickstart](/usage/training#quickstart).
2020-07-29 12:36:42 +03:00
2020-07-29 19:44:10 +03:00
<Project id="en_core_bert">
The easiest way to get started is to clone a transformers-based project
template. Swap in your data, edit the settings and hyperparameters and train,
evaluate, package and visualize your model.
</Project>
2020-07-29 12:36:42 +03:00
2020-07-29 19:44:10 +03:00
The `[components]` section in the [`config.cfg`](#TODO:) describes the pipeline
components and the settings used to construct them, including their model
implementation. Here's a config snippet for the
2020-07-29 20:41:34 +03:00
[`Transformer`](/api/transformer) component, along with matching Python code. In
this case, the `[components.transformer]` block describes the `transformer`
component:
2020-07-29 19:44:10 +03:00
> #### Python equivalent
>
> ```python
> from spacy_transformers import Transformer, TransformerModel
> from spacy_transformers.annotation_setters import null_annotation_setter
> from spacy_transformers.span_getters import get_doc_spans
>
> trf = Transformer(
> nlp.vocab,
> TransformerModel(
> "bert-base-cased",
> get_spans=get_doc_spans,
> tokenizer_config={"use_fast": True},
> ),
> annotation_setter=null_annotation_setter,
> max_batch_items=4096,
> )
> ```
```ini
### config.cfg (excerpt)
2020-07-29 12:36:42 +03:00
[components.transformer]
factory = "transformer"
2020-07-29 19:44:10 +03:00
max_batch_items = 4096
2020-07-29 12:36:42 +03:00
[components.transformer.model]
@architectures = "spacy-transformers.TransformerModel.v1"
name = "bert-base-cased"
tokenizer_config = {"use_fast": true}
[components.transformer.model.get_spans]
2020-07-29 19:44:10 +03:00
@span_getters = "doc_spans.v1"
[components.transformer.annotation_setter]
@annotation_setters = "spacy-transformer.null_annotation_setter.v1"
2020-07-29 12:36:42 +03:00
```
2020-07-29 19:44:10 +03:00
The `[components.transformer.model]` block describes the `model` argument passed
to the transformer component. It's a Thinc
[`Model`](https://thinc.ai/docs/api-model) object that will be passed into the
component. Here, it references the function
[spacy-transformers.TransformerModel.v1](/api/architectures#TransformerModel)
registered in the [`architectures` registry](/api/top-level#registry). If a key
in a block starts with `@`, it's **resolved to a function** and all other
settings are passed to the function as arguments. In this case, `name`,
`tokenizer_config` and `get_spans`.
`get_spans` is a function that takes a batch of `Doc` object and returns lists
of potentially overlapping `Span` objects to process by the transformer. Several
[built-in functions](/api/transformer#span-getters) are available for example,
to process the whole document or individual sentences. When the config is
resolved, the function is created and passed into the model as an argument.
<Infobox variant="warning">
Remember that the `config.cfg` used for training should contain **no missing
values** and requires all settings to be defined. You don't want any hidden
defaults creeping in and changing your results! spaCy will tell you if settings
are missing, and you can run [`spacy init config`](/api/cli#init-config) with to
automatically fill in all defaults.
2020-07-29 19:44:10 +03:00
</Infobox>
### Customizing the settings {#training-custom-settings}
To change any of the settings, you can edit the `config.cfg` and re-run the
training. To change any of the functions, like the span getter, you can replace
the name of the referenced function e.g. `@span_getters = "sent_spans.v1"` to
process sentences. You can also register your own functions using the
`span_getters` registry:
> #### config.cfg
>
> ```ini
> [components.transformer.model.get_spans]
> @span_getters = "custom_sent_spans"
> ```
2020-07-29 12:36:42 +03:00
```python
2020-07-29 19:44:10 +03:00
### code.py
import spacy_transformers
@spacy_transformers.registry.span_getters("custom_sent_spans")
def configure_custom_sent_spans():
# TODO: write custom example
def get_sent_spans(docs):
return [list(doc.sents) for doc in docs]
return get_sent_spans
2020-07-29 12:36:42 +03:00
```
2020-07-29 19:44:10 +03:00
To resolve the config during training, spaCy needs to know about your custom
function. You can make it available via the `--code` argument that can point to
2020-07-29 20:48:26 +03:00
a Python file. For more details on training with custom code, see the
[training documentation](/usage/training#custom-code).
2020-07-29 12:36:42 +03:00
2020-07-29 19:44:10 +03:00
```bash
2020-08-06 20:30:43 +03:00
$ python -m spacy train ./config.cfg --code ./code.py
2020-07-29 19:44:10 +03:00
```
### Customizing the model implementations {#training-custom-model}
The [`Transformer`](/api/transformer) component expects a Thinc
[`Model`](https://thinc.ai/docs/api-model) object to be passed in as its `model`
argument. You're not limited to the implementation provided by
`spacy-transformers` the only requirement is that your registered function
must return an object of type `Model[List[Doc], FullTransformerBatch]`: that is,
a Thinc model that takes a list of [`Doc`](/api/doc) objects, and returns a
[`FullTransformerBatch`](/api/transformer#fulltransformerbatch) object with the
transformer data.
> #### Model type annotations
>
> In the documentation and code base, you may come across type annotations and
> descriptions of [Thinc](https://thinc.ai) model types, like
> `Model[List[Doc], List[Floats2d]]`. This so-called generic type describes the
> layer and its input and output type in this case, it takes a list of `Doc`
> objects as the input and list of 2-dimensional arrays of floats as the output.
> You can read more about defining Thinc
> models [here](https://thinc.ai/docs/usage-models). Also see the
> [type checking](https://thinc.ai/docs/usage-type-checking) for how to enable
> linting in your editor to see live feedback if your inputs and outputs don't
> match.
The same idea applies to task models that power the **downstream components**.
Most of spaCy's built-in model creation functions support a `tok2vec` argument,
which should be a Thinc layer of type `Model[List[Doc], List[Floats2d]]`. This
is where we'll plug in our transformer model, using the
[Tok2VecListener](/api/architectures#Tok2VecListener) layer, which sneakily
delegates to the `Transformer` pipeline component.
2020-07-29 12:36:42 +03:00
2020-07-29 19:44:10 +03:00
```ini
### config.cfg (excerpt) {highlight="12"}
2020-07-29 12:36:42 +03:00
[components.ner]
factory = "ner"
[nlp.pipeline.ner.model]
@architectures = "spacy.TransitionBasedParser.v1"
nr_feature_tokens = 3
hidden_width = 128
maxout_pieces = 3
use_upper = false
[nlp.pipeline.ner.model.tok2vec]
@architectures = "spacy-transformers.Tok2VecListener.v1"
grad_factor = 1.0
[nlp.pipeline.ner.model.tok2vec.pooling]
@layers = "reduce_mean.v1"
```
2020-07-29 19:44:10 +03:00
The [Tok2VecListener](/api/architectures#Tok2VecListener) layer expects a
2020-07-29 20:41:34 +03:00
[pooling layer](https://thinc.ai/docs/api-layers#reduction-ops) as the argument
`pooling`, which needs to be of type `Model[Ragged, Floats2d]`. This layer
determines how the vector for each spaCy token will be computed from the zero or
more source rows the token is aligned against. Here we use the
2020-07-29 19:44:10 +03:00
[`reduce_mean`](https://thinc.ai/docs/api-layers#reduce_mean) layer, which
averages the wordpiece rows. We could instead use `reduce_last`,
[`reduce_max`](https://thinc.ai/docs/api-layers#reduce_max), or a custom
function you write yourself.
<!--TODO: reduce_last: undocumented? -->
2020-07-29 12:36:42 +03:00
You can have multiple components all listening to the same transformer model,
and all passing gradients back to it. By default, all of the gradients will be
2020-07-29 19:44:10 +03:00
**equally weighted**. You can control this with the `grad_factor` setting, which
2020-07-29 12:36:42 +03:00
lets you reweight the gradients from the different listeners. For instance,
setting `grad_factor = 0` would disable gradients from one of the listeners,
while `grad_factor = 2.0` would multiply them by 2. This is similar to having a
custom learning rate for each component. Instead of a constant, you can also
provide a schedule, allowing you to freeze the shared parameters at the start of
training.