2020-07-27 01:29:45 +03:00
|
|
|
|
---
|
|
|
|
|
title: Transformers
|
|
|
|
|
teaser: Using transformer models like BERT in spaCy
|
2020-07-29 19:44:10 +03:00
|
|
|
|
menu:
|
|
|
|
|
- ['Installation', 'install']
|
|
|
|
|
- ['Runtime Usage', 'runtime']
|
|
|
|
|
- ['Training Usage', 'training']
|
2020-07-29 20:09:44 +03:00
|
|
|
|
next: /usage/training
|
2020-07-27 01:29:45 +03:00
|
|
|
|
---
|
|
|
|
|
|
2020-07-29 19:44:10 +03:00
|
|
|
|
## Installation {#install hidden="true"}
|
|
|
|
|
|
2020-08-16 21:13:24 +03:00
|
|
|
|
Transformers are a family of neural network architectures that compute dense,
|
|
|
|
|
context-sensitive representations for the tokens in your documents. Downstream
|
|
|
|
|
models in your pipeline can then use these representations as input features to
|
|
|
|
|
improve their predictions. You can connect multiple components to a single
|
|
|
|
|
transformer model, with any or all of those components giving feedback to the
|
|
|
|
|
transformer to fine-tune it to your tasks. spaCy's transformer support
|
|
|
|
|
interoperates with PyTorch and the [Huggingface transformers](https://huggingface.co/transformers/)
|
|
|
|
|
library, giving you access to thousands of pretrained models for your pipelines.
|
|
|
|
|
There are many [great guides](http://jalammar.github.io/illustrated-transformer/)
|
|
|
|
|
to transformer models, but for practical purposes, you can simply think of them
|
|
|
|
|
as a drop-in replacement that let you achieve higher accuracy in exchange for
|
|
|
|
|
higher training and runtime costs.
|
|
|
|
|
|
|
|
|
|
## System requirements
|
|
|
|
|
|
|
|
|
|
We recommend an NVIDIA GPU with at least 10GB of memory in order to work with
|
|
|
|
|
transformer models. The exact requirements will depend on the transformer you
|
|
|
|
|
model you choose and whether you're training the pipeline or simply running it.
|
|
|
|
|
Training a transformer-based model without a GPU will be too slow for most
|
2020-08-16 21:29:50 +03:00
|
|
|
|
practical purposes. You'll also need to make sure your GPU drivers are up-to-date
|
|
|
|
|
and v9+ of the CUDA runtime is installed.
|
2020-08-16 21:13:24 +03:00
|
|
|
|
|
|
|
|
|
Once you have CUDA installed, you'll need to install two pip packages, `cupy`
|
2020-08-16 21:29:50 +03:00
|
|
|
|
and `spacy-transformers`. [CuPy](https://docs.cupy.dev/en/stable/install.html)
|
|
|
|
|
is just like `numpy`, but for GPU. The best way to install it is to choose a
|
|
|
|
|
wheel that matches the version of CUDA you're using. You may also need to set the
|
|
|
|
|
`CUDA_PATH` environment variable if your CUDA runtime is installed in
|
|
|
|
|
a non-standard location. Putting it all together, if you had installed CUDA 10.2
|
|
|
|
|
in `/opt/nvidia/cuda`, you would run:
|
2020-07-29 12:36:42 +03:00
|
|
|
|
|
2020-08-16 21:13:24 +03:00
|
|
|
|
```
|
|
|
|
|
export CUDA_PATH="/opt/nvidia/cuda"
|
|
|
|
|
pip install cupy-cuda102
|
|
|
|
|
pip install spacy-transformers
|
2020-07-29 19:44:10 +03:00
|
|
|
|
```
|
2020-07-29 12:36:42 +03:00
|
|
|
|
|
2020-08-16 21:29:50 +03:00
|
|
|
|
Provisioning a new machine will require about 5GB of data to be downloaded in total:
|
|
|
|
|
3GB for the CUDA runtime, 800MB for PyTorch, 400MB for CuPy, 500MB for the transformer
|
|
|
|
|
weights, and about 200MB for spaCy and its various requirements.
|
|
|
|
|
|
2020-07-29 19:44:10 +03:00
|
|
|
|
## Runtime usage {#runtime}
|
|
|
|
|
|
|
|
|
|
Transformer models can be used as **drop-in replacements** for other types of
|
|
|
|
|
neural networks, so your spaCy pipeline can include them in a way that's
|
|
|
|
|
completely invisible to the user. Users will download, load and use the model in
|
|
|
|
|
the standard way, like any other spaCy pipeline. Instead of using the
|
|
|
|
|
transformers as subnetworks directly, you can also use them via the
|
|
|
|
|
[`Transformer`](/api/transformer) pipeline component.
|
|
|
|
|
|
|
|
|
|
![The processing pipeline with the transformer component](../images/pipeline_transformer.svg)
|
|
|
|
|
|
|
|
|
|
The `Transformer` component sets the
|
|
|
|
|
[`Doc._.trf_data`](/api/transformer#custom_attributes) extension attribute,
|
|
|
|
|
which lets you access the transformers outputs at runtime.
|
|
|
|
|
|
|
|
|
|
```bash
|
|
|
|
|
$ python -m spacy download en_core_trf_lg
|
|
|
|
|
```
|
2020-07-29 12:36:42 +03:00
|
|
|
|
|
2020-07-29 19:44:10 +03:00
|
|
|
|
```python
|
|
|
|
|
### Example
|
|
|
|
|
import spacy
|
2020-08-16 21:13:24 +03:00
|
|
|
|
from thinc.api import use_pytorch_for_gpu_memory, require_gpu
|
|
|
|
|
|
|
|
|
|
# Use the GPU, with memory allocations directed via PyTorch.
|
|
|
|
|
# This prevents out-of-memory errors that would otherwise occur from competing
|
|
|
|
|
# memory pools.
|
|
|
|
|
use_pytorch_for_gpu_memory()
|
|
|
|
|
require_gpu(0)
|
2020-07-29 12:36:42 +03:00
|
|
|
|
|
2020-07-29 19:44:10 +03:00
|
|
|
|
nlp = spacy.load("en_core_trf_lg")
|
|
|
|
|
for doc in nlp.pipe(["some text", "some other text"]):
|
|
|
|
|
tokvecs = doc._.trf_data.tensors[-1]
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
You can also customize how the [`Transformer`](/api/transformer) component sets
|
|
|
|
|
annotations onto the [`Doc`](/api/doc), by customizing the `annotation_setter`.
|
|
|
|
|
This callback will be called with the raw input and output data for the whole
|
|
|
|
|
batch, along with the batch of `Doc` objects, allowing you to implement whatever
|
|
|
|
|
you need. The annotation setter is called with a batch of [`Doc`](/api/doc)
|
|
|
|
|
objects and a [`FullTransformerBatch`](/api/transformer#fulltransformerbatch)
|
|
|
|
|
containing the transformers data for the batch.
|
|
|
|
|
|
|
|
|
|
```python
|
|
|
|
|
def custom_annotation_setter(docs, trf_data):
|
|
|
|
|
# TODO:
|
|
|
|
|
...
|
|
|
|
|
|
|
|
|
|
nlp = spacy.load("en_core_trf_lg")
|
|
|
|
|
nlp.get_pipe("transformer").annotation_setter = custom_annotation_setter
|
|
|
|
|
doc = nlp("This is a text")
|
|
|
|
|
print() # TODO:
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
## Training usage {#training}
|
2020-07-29 12:36:42 +03:00
|
|
|
|
|
|
|
|
|
The recommended workflow for training is to use spaCy's
|
|
|
|
|
[config system](/usage/training#config), usually via the
|
2020-07-29 19:44:10 +03:00
|
|
|
|
[`spacy train`](/api/cli#train) command. The training config defines all
|
|
|
|
|
component settings and hyperparameters in one place and lets you describe a tree
|
|
|
|
|
of objects by referring to creation functions, including functions you register
|
2020-07-31 14:26:39 +03:00
|
|
|
|
yourself. For details on how to get started with training your own model, check
|
|
|
|
|
out the [training quickstart](/usage/training#quickstart).
|
2020-07-29 12:36:42 +03:00
|
|
|
|
|
2020-07-29 19:44:10 +03:00
|
|
|
|
<Project id="en_core_bert">
|
|
|
|
|
|
|
|
|
|
The easiest way to get started is to clone a transformers-based project
|
|
|
|
|
template. Swap in your data, edit the settings and hyperparameters and train,
|
|
|
|
|
evaluate, package and visualize your model.
|
|
|
|
|
|
|
|
|
|
</Project>
|
2020-07-29 12:36:42 +03:00
|
|
|
|
|
2020-08-10 02:20:10 +03:00
|
|
|
|
The `[components]` section in the [`config.cfg`](/api/data-formats#config)
|
|
|
|
|
describes the pipeline components and the settings used to construct them,
|
|
|
|
|
including their model implementation. Here's a config snippet for the
|
2020-07-29 20:41:34 +03:00
|
|
|
|
[`Transformer`](/api/transformer) component, along with matching Python code. In
|
|
|
|
|
this case, the `[components.transformer]` block describes the `transformer`
|
|
|
|
|
component:
|
2020-07-29 19:44:10 +03:00
|
|
|
|
|
|
|
|
|
> #### Python equivalent
|
|
|
|
|
>
|
|
|
|
|
> ```python
|
|
|
|
|
> from spacy_transformers import Transformer, TransformerModel
|
|
|
|
|
> from spacy_transformers.annotation_setters import null_annotation_setter
|
|
|
|
|
> from spacy_transformers.span_getters import get_doc_spans
|
|
|
|
|
>
|
|
|
|
|
> trf = Transformer(
|
|
|
|
|
> nlp.vocab,
|
|
|
|
|
> TransformerModel(
|
|
|
|
|
> "bert-base-cased",
|
|
|
|
|
> get_spans=get_doc_spans,
|
|
|
|
|
> tokenizer_config={"use_fast": True},
|
|
|
|
|
> ),
|
|
|
|
|
> annotation_setter=null_annotation_setter,
|
|
|
|
|
> max_batch_items=4096,
|
|
|
|
|
> )
|
|
|
|
|
> ```
|
|
|
|
|
|
|
|
|
|
```ini
|
|
|
|
|
### config.cfg (excerpt)
|
2020-07-29 12:36:42 +03:00
|
|
|
|
[components.transformer]
|
|
|
|
|
factory = "transformer"
|
2020-07-29 19:44:10 +03:00
|
|
|
|
max_batch_items = 4096
|
2020-07-29 12:36:42 +03:00
|
|
|
|
|
|
|
|
|
[components.transformer.model]
|
|
|
|
|
@architectures = "spacy-transformers.TransformerModel.v1"
|
|
|
|
|
name = "bert-base-cased"
|
|
|
|
|
tokenizer_config = {"use_fast": true}
|
|
|
|
|
|
|
|
|
|
[components.transformer.model.get_spans]
|
2020-07-29 19:44:10 +03:00
|
|
|
|
@span_getters = "doc_spans.v1"
|
|
|
|
|
|
|
|
|
|
[components.transformer.annotation_setter]
|
|
|
|
|
@annotation_setters = "spacy-transformer.null_annotation_setter.v1"
|
|
|
|
|
|
2020-07-29 12:36:42 +03:00
|
|
|
|
```
|
|
|
|
|
|
2020-07-29 19:44:10 +03:00
|
|
|
|
The `[components.transformer.model]` block describes the `model` argument passed
|
|
|
|
|
to the transformer component. It's a Thinc
|
|
|
|
|
[`Model`](https://thinc.ai/docs/api-model) object that will be passed into the
|
|
|
|
|
component. Here, it references the function
|
|
|
|
|
[spacy-transformers.TransformerModel.v1](/api/architectures#TransformerModel)
|
|
|
|
|
registered in the [`architectures` registry](/api/top-level#registry). If a key
|
|
|
|
|
in a block starts with `@`, it's **resolved to a function** and all other
|
|
|
|
|
settings are passed to the function as arguments. In this case, `name`,
|
|
|
|
|
`tokenizer_config` and `get_spans`.
|
|
|
|
|
|
|
|
|
|
`get_spans` is a function that takes a batch of `Doc` object and returns lists
|
|
|
|
|
of potentially overlapping `Span` objects to process by the transformer. Several
|
|
|
|
|
[built-in functions](/api/transformer#span-getters) are available – for example,
|
|
|
|
|
to process the whole document or individual sentences. When the config is
|
|
|
|
|
resolved, the function is created and passed into the model as an argument.
|
|
|
|
|
|
|
|
|
|
<Infobox variant="warning">
|
|
|
|
|
|
|
|
|
|
Remember that the `config.cfg` used for training should contain **no missing
|
|
|
|
|
values** and requires all settings to be defined. You don't want any hidden
|
|
|
|
|
defaults creeping in and changing your results! spaCy will tell you if settings
|
2020-08-15 15:50:29 +03:00
|
|
|
|
are missing, and you can run
|
|
|
|
|
[`spacy init fill-config`](/api/cli#init-fill-config) to automatically fill in
|
|
|
|
|
all defaults.
|
2020-07-29 19:44:10 +03:00
|
|
|
|
|
|
|
|
|
</Infobox>
|
|
|
|
|
|
|
|
|
|
### Customizing the settings {#training-custom-settings}
|
|
|
|
|
|
|
|
|
|
To change any of the settings, you can edit the `config.cfg` and re-run the
|
|
|
|
|
training. To change any of the functions, like the span getter, you can replace
|
|
|
|
|
the name of the referenced function – e.g. `@span_getters = "sent_spans.v1"` to
|
|
|
|
|
process sentences. You can also register your own functions using the
|
|
|
|
|
`span_getters` registry:
|
|
|
|
|
|
|
|
|
|
> #### config.cfg
|
|
|
|
|
>
|
|
|
|
|
> ```ini
|
|
|
|
|
> [components.transformer.model.get_spans]
|
|
|
|
|
> @span_getters = "custom_sent_spans"
|
|
|
|
|
> ```
|
|
|
|
|
|
2020-07-29 12:36:42 +03:00
|
|
|
|
```python
|
2020-07-29 19:44:10 +03:00
|
|
|
|
### code.py
|
|
|
|
|
import spacy_transformers
|
|
|
|
|
|
|
|
|
|
@spacy_transformers.registry.span_getters("custom_sent_spans")
|
|
|
|
|
def configure_custom_sent_spans():
|
|
|
|
|
# TODO: write custom example
|
|
|
|
|
def get_sent_spans(docs):
|
|
|
|
|
return [list(doc.sents) for doc in docs]
|
|
|
|
|
|
|
|
|
|
return get_sent_spans
|
2020-07-29 12:36:42 +03:00
|
|
|
|
```
|
|
|
|
|
|
2020-07-29 19:44:10 +03:00
|
|
|
|
To resolve the config during training, spaCy needs to know about your custom
|
|
|
|
|
function. You can make it available via the `--code` argument that can point to
|
2020-07-29 20:48:26 +03:00
|
|
|
|
a Python file. For more details on training with custom code, see the
|
|
|
|
|
[training documentation](/usage/training#custom-code).
|
2020-07-29 12:36:42 +03:00
|
|
|
|
|
2020-07-29 19:44:10 +03:00
|
|
|
|
```bash
|
2020-08-06 20:30:43 +03:00
|
|
|
|
$ python -m spacy train ./config.cfg --code ./code.py
|
2020-07-29 19:44:10 +03:00
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
### Customizing the model implementations {#training-custom-model}
|
|
|
|
|
|
|
|
|
|
The [`Transformer`](/api/transformer) component expects a Thinc
|
|
|
|
|
[`Model`](https://thinc.ai/docs/api-model) object to be passed in as its `model`
|
|
|
|
|
argument. You're not limited to the implementation provided by
|
|
|
|
|
`spacy-transformers` – the only requirement is that your registered function
|
|
|
|
|
must return an object of type `Model[List[Doc], FullTransformerBatch]`: that is,
|
|
|
|
|
a Thinc model that takes a list of [`Doc`](/api/doc) objects, and returns a
|
|
|
|
|
[`FullTransformerBatch`](/api/transformer#fulltransformerbatch) object with the
|
|
|
|
|
transformer data.
|
|
|
|
|
|
|
|
|
|
> #### Model type annotations
|
|
|
|
|
>
|
|
|
|
|
> In the documentation and code base, you may come across type annotations and
|
|
|
|
|
> descriptions of [Thinc](https://thinc.ai) model types, like
|
|
|
|
|
> `Model[List[Doc], List[Floats2d]]`. This so-called generic type describes the
|
|
|
|
|
> layer and its input and output type – in this case, it takes a list of `Doc`
|
|
|
|
|
> objects as the input and list of 2-dimensional arrays of floats as the output.
|
|
|
|
|
> You can read more about defining Thinc
|
|
|
|
|
> models [here](https://thinc.ai/docs/usage-models). Also see the
|
|
|
|
|
> [type checking](https://thinc.ai/docs/usage-type-checking) for how to enable
|
|
|
|
|
> linting in your editor to see live feedback if your inputs and outputs don't
|
|
|
|
|
> match.
|
|
|
|
|
|
|
|
|
|
The same idea applies to task models that power the **downstream components**.
|
|
|
|
|
Most of spaCy's built-in model creation functions support a `tok2vec` argument,
|
|
|
|
|
which should be a Thinc layer of type `Model[List[Doc], List[Floats2d]]`. This
|
|
|
|
|
is where we'll plug in our transformer model, using the
|
|
|
|
|
[Tok2VecListener](/api/architectures#Tok2VecListener) layer, which sneakily
|
|
|
|
|
delegates to the `Transformer` pipeline component.
|
2020-07-29 12:36:42 +03:00
|
|
|
|
|
2020-07-29 19:44:10 +03:00
|
|
|
|
```ini
|
|
|
|
|
### config.cfg (excerpt) {highlight="12"}
|
2020-07-29 12:36:42 +03:00
|
|
|
|
[components.ner]
|
|
|
|
|
factory = "ner"
|
|
|
|
|
|
|
|
|
|
[nlp.pipeline.ner.model]
|
|
|
|
|
@architectures = "spacy.TransitionBasedParser.v1"
|
|
|
|
|
nr_feature_tokens = 3
|
|
|
|
|
hidden_width = 128
|
|
|
|
|
maxout_pieces = 3
|
|
|
|
|
use_upper = false
|
|
|
|
|
|
|
|
|
|
[nlp.pipeline.ner.model.tok2vec]
|
|
|
|
|
@architectures = "spacy-transformers.Tok2VecListener.v1"
|
|
|
|
|
grad_factor = 1.0
|
|
|
|
|
|
|
|
|
|
[nlp.pipeline.ner.model.tok2vec.pooling]
|
|
|
|
|
@layers = "reduce_mean.v1"
|
|
|
|
|
```
|
|
|
|
|
|
2020-07-29 19:44:10 +03:00
|
|
|
|
The [Tok2VecListener](/api/architectures#Tok2VecListener) layer expects a
|
2020-07-29 20:41:34 +03:00
|
|
|
|
[pooling layer](https://thinc.ai/docs/api-layers#reduction-ops) as the argument
|
|
|
|
|
`pooling`, which needs to be of type `Model[Ragged, Floats2d]`. This layer
|
|
|
|
|
determines how the vector for each spaCy token will be computed from the zero or
|
|
|
|
|
more source rows the token is aligned against. Here we use the
|
2020-07-29 19:44:10 +03:00
|
|
|
|
[`reduce_mean`](https://thinc.ai/docs/api-layers#reduce_mean) layer, which
|
|
|
|
|
averages the wordpiece rows. We could instead use `reduce_last`,
|
|
|
|
|
[`reduce_max`](https://thinc.ai/docs/api-layers#reduce_max), or a custom
|
|
|
|
|
function you write yourself.
|
|
|
|
|
|
2020-07-29 12:36:42 +03:00
|
|
|
|
You can have multiple components all listening to the same transformer model,
|
|
|
|
|
and all passing gradients back to it. By default, all of the gradients will be
|
2020-07-29 19:44:10 +03:00
|
|
|
|
**equally weighted**. You can control this with the `grad_factor` setting, which
|
2020-07-29 12:36:42 +03:00
|
|
|
|
lets you reweight the gradients from the different listeners. For instance,
|
|
|
|
|
setting `grad_factor = 0` would disable gradients from one of the listeners,
|
|
|
|
|
while `grad_factor = 2.0` would multiply them by 2. This is similar to having a
|
|
|
|
|
custom learning rate for each component. Instead of a constant, you can also
|
|
|
|
|
provide a schedule, allowing you to freeze the shared parameters at the start of
|
|
|
|
|
training.
|