12 KiB
title | teaser | menu | next | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Transformers | Using transformer models like BERT in spaCy |
|
/usage/training |
Installation
Transformers are a family of neural network architectures that compute dense, context-sensitive representations for the tokens in your documents. Downstream models in your pipeline can then use these representations as input features to improve their predictions. You can connect multiple components to a single transformer model, with any or all of those components giving feedback to the transformer to fine-tune it to your tasks. spaCy's transformer support interoperates with PyTorch and the Huggingface transformers library, giving you access to thousands of pretrained models for your pipelines. There are many great guides to transformer models, but for practical purposes, you can simply think of them as a drop-in replacement that let you achieve higher accuracy in exchange for higher training and runtime costs.
System requirements
We recommend an NVIDIA GPU with at least 10GB of memory in order to work with transformer models. The exact requirements will depend on the transformer you model you choose and whether you're training the pipeline or simply running it. Training a transformer-based model without a GPU will be too slow for most practical purposes. You'll also need to make sure your GPU drivers are up-to-date and v9+ of the CUDA runtime is installed.
Once you have CUDA installed, you'll need to install two pip packages, cupy
and spacy-transformers
. CuPy
is just like numpy
, but for GPU. The best way to install it is to choose a
wheel that matches the version of CUDA you're using. You may also need to set the
CUDA_PATH
environment variable if your CUDA runtime is installed in
a non-standard location. Putting it all together, if you had installed CUDA 10.2
in /opt/nvidia/cuda
, you would run:
export CUDA_PATH="/opt/nvidia/cuda"
pip install cupy-cuda102
pip install spacy-transformers
Provisioning a new machine will require about 5GB of data to be downloaded in total: 3GB for the CUDA runtime, 800MB for PyTorch, 400MB for CuPy, 500MB for the transformer weights, and about 200MB for spaCy and its various requirements.
Runtime usage
Transformer models can be used as drop-in replacements for other types of
neural networks, so your spaCy pipeline can include them in a way that's
completely invisible to the user. Users will download, load and use the model in
the standard way, like any other spaCy pipeline. Instead of using the
transformers as subnetworks directly, you can also use them via the
Transformer
pipeline component.
The Transformer
component sets the
Doc._.trf_data
extension attribute,
which lets you access the transformers outputs at runtime.
$ python -m spacy download en_core_trf_lg
### Example
import spacy
from thinc.api import use_pytorch_for_gpu_memory, require_gpu
# Use the GPU, with memory allocations directed via PyTorch.
# This prevents out-of-memory errors that would otherwise occur from competing
# memory pools.
use_pytorch_for_gpu_memory()
require_gpu(0)
nlp = spacy.load("en_core_trf_lg")
for doc in nlp.pipe(["some text", "some other text"]):
tokvecs = doc._.trf_data.tensors[-1]
You can also customize how the Transformer
component sets
annotations onto the Doc
, by customizing the annotation_setter
.
This callback will be called with the raw input and output data for the whole
batch, along with the batch of Doc
objects, allowing you to implement whatever
you need. The annotation setter is called with a batch of Doc
objects and a FullTransformerBatch
containing the transformers data for the batch.
def custom_annotation_setter(docs, trf_data):
# TODO:
...
nlp = spacy.load("en_core_trf_lg")
nlp.get_pipe("transformer").annotation_setter = custom_annotation_setter
doc = nlp("This is a text")
print() # TODO:
Training usage
The recommended workflow for training is to use spaCy's
config system, usually via the
spacy train
command. The training config defines all
component settings and hyperparameters in one place and lets you describe a tree
of objects by referring to creation functions, including functions you register
yourself. For details on how to get started with training your own model, check
out the training quickstart.
The easiest way to get started is to clone a transformers-based project template. Swap in your data, edit the settings and hyperparameters and train, evaluate, package and visualize your model.
The [components]
section in the config.cfg
describes the pipeline components and the settings used to construct them,
including their model implementation. Here's a config snippet for the
Transformer
component, along with matching Python code. In
this case, the [components.transformer]
block describes the transformer
component:
Python equivalent
from spacy_transformers import Transformer, TransformerModel from spacy_transformers.annotation_setters import null_annotation_setter from spacy_transformers.span_getters import get_doc_spans trf = Transformer( nlp.vocab, TransformerModel( "bert-base-cased", get_spans=get_doc_spans, tokenizer_config={"use_fast": True}, ), annotation_setter=null_annotation_setter, max_batch_items=4096, )
### config.cfg (excerpt)
[components.transformer]
factory = "transformer"
max_batch_items = 4096
[components.transformer.model]
@architectures = "spacy-transformers.TransformerModel.v1"
name = "bert-base-cased"
tokenizer_config = {"use_fast": true}
[components.transformer.model.get_spans]
@span_getters = "doc_spans.v1"
[components.transformer.annotation_setter]
@annotation_setters = "spacy-transformer.null_annotation_setter.v1"
The [components.transformer.model]
block describes the model
argument passed
to the transformer component. It's a Thinc
Model
object that will be passed into the
component. Here, it references the function
spacy-transformers.TransformerModel.v1
registered in the architectures
registry. If a key
in a block starts with @
, it's resolved to a function and all other
settings are passed to the function as arguments. In this case, name
,
tokenizer_config
and get_spans
.
get_spans
is a function that takes a batch of Doc
object and returns lists
of potentially overlapping Span
objects to process by the transformer. Several
built-in functions are available – for example,
to process the whole document or individual sentences. When the config is
resolved, the function is created and passed into the model as an argument.
Remember that the config.cfg
used for training should contain no missing
values and requires all settings to be defined. You don't want any hidden
defaults creeping in and changing your results! spaCy will tell you if settings
are missing, and you can run
spacy init fill-config
to automatically fill in
all defaults.
Customizing the settings
To change any of the settings, you can edit the config.cfg
and re-run the
training. To change any of the functions, like the span getter, you can replace
the name of the referenced function – e.g. @span_getters = "sent_spans.v1"
to
process sentences. You can also register your own functions using the
span_getters
registry:
config.cfg
[components.transformer.model.get_spans] @span_getters = "custom_sent_spans"
### code.py
import spacy_transformers
@spacy_transformers.registry.span_getters("custom_sent_spans")
def configure_custom_sent_spans():
# TODO: write custom example
def get_sent_spans(docs):
return [list(doc.sents) for doc in docs]
return get_sent_spans
To resolve the config during training, spaCy needs to know about your custom
function. You can make it available via the --code
argument that can point to
a Python file. For more details on training with custom code, see the
training documentation.
$ python -m spacy train ./config.cfg --code ./code.py
Customizing the model implementations
The Transformer
component expects a Thinc
Model
object to be passed in as its model
argument. You're not limited to the implementation provided by
spacy-transformers
– the only requirement is that your registered function
must return an object of type Model[List[Doc], FullTransformerBatch]
: that is,
a Thinc model that takes a list of Doc
objects, and returns a
FullTransformerBatch
object with the
transformer data.
Model type annotations
In the documentation and code base, you may come across type annotations and descriptions of Thinc model types, like
Model[List[Doc], List[Floats2d]]
. This so-called generic type describes the layer and its input and output type – in this case, it takes a list ofDoc
objects as the input and list of 2-dimensional arrays of floats as the output. You can read more about defining Thinc models here. Also see the type checking for how to enable linting in your editor to see live feedback if your inputs and outputs don't match.
The same idea applies to task models that power the downstream components.
Most of spaCy's built-in model creation functions support a tok2vec
argument,
which should be a Thinc layer of type Model[List[Doc], List[Floats2d]]
. This
is where we'll plug in our transformer model, using the
Tok2VecListener layer, which sneakily
delegates to the Transformer
pipeline component.
### config.cfg (excerpt) {highlight="12"}
[components.ner]
factory = "ner"
[nlp.pipeline.ner.model]
@architectures = "spacy.TransitionBasedParser.v1"
nr_feature_tokens = 3
hidden_width = 128
maxout_pieces = 3
use_upper = false
[nlp.pipeline.ner.model.tok2vec]
@architectures = "spacy-transformers.Tok2VecListener.v1"
grad_factor = 1.0
[nlp.pipeline.ner.model.tok2vec.pooling]
@layers = "reduce_mean.v1"
The Tok2VecListener layer expects a
pooling layer as the argument
pooling
, which needs to be of type Model[Ragged, Floats2d]
. This layer
determines how the vector for each spaCy token will be computed from the zero or
more source rows the token is aligned against. Here we use the
reduce_mean
layer, which
averages the wordpiece rows. We could instead use reduce_last
,
reduce_max
, or a custom
function you write yourself.
You can have multiple components all listening to the same transformer model,
and all passing gradients back to it. By default, all of the gradients will be
equally weighted. You can control this with the grad_factor
setting, which
lets you reweight the gradients from the different listeners. For instance,
setting grad_factor = 0
would disable gradients from one of the listeners,
while grad_factor = 2.0
would multiply them by 2. This is similar to having a
custom learning rate for each component. Instead of a constant, you can also
provide a schedule, allowing you to freeze the shared parameters at the start of
training.