mirror of
https://github.com/explosion/spaCy.git
synced 2024-12-26 18:06:29 +03:00
Dev docs: listeners (#9061)
* Start Listeners documentation * intro tabel of different architectures * initialization, linking, dim inference * internal comm (WIP) * expand internal comm section * frozen components and replacing listeners * various small fixes * fix content table * fix link
This commit is contained in:
parent
1e9b4b55ee
commit
5af88427a2
220
extra/DEVELOPER_DOCS/Listeners.md
Normal file
220
extra/DEVELOPER_DOCS/Listeners.md
Normal file
|
@ -0,0 +1,220 @@
|
||||||
|
# Listeners
|
||||||
|
|
||||||
|
1. [Overview](#1-overview)
|
||||||
|
2. [Initialization](#2-initialization)
|
||||||
|
- [A. Linking listeners to the embedding component](#2a-linking-listeners-to-the-embedding-component)
|
||||||
|
- [B. Shape inference](#2b-shape-inference)
|
||||||
|
3. [Internal communication](#3-internal-communication)
|
||||||
|
- [A. During prediction](#3a-during-prediction)
|
||||||
|
- [B. During training](#3b-during-training)
|
||||||
|
- [C. Frozen components](#3c-frozen-components)
|
||||||
|
4. [Replacing listener with standalone](#4-replacing-listener-with-standalone)
|
||||||
|
|
||||||
|
## 1. Overview
|
||||||
|
|
||||||
|
Trainable spaCy components typically use some sort of `tok2vec` layer as part of the `model` definition.
|
||||||
|
This `tok2vec` layer produces embeddings and is either a standard `Tok2Vec` layer, or a Transformer-based one.
|
||||||
|
Both versions can be used either inline/standalone, which means that they are defined and used
|
||||||
|
by only one specific component (e.g. NER), or
|
||||||
|
[shared](https://spacy.io/usage/embeddings-transformers#embedding-layers),
|
||||||
|
in which case the embedding functionality becomes a separate component that can
|
||||||
|
feed embeddings to multiple components downstream, using a listener-pattern.
|
||||||
|
|
||||||
|
| Type | Usage | Model Architecture |
|
||||||
|
| ------------- | ---------- | -------------------------------------------------------------------------------------------------- |
|
||||||
|
| `Tok2Vec` | standalone | [`spacy.Tok2Vec`](https://spacy.io/api/architectures#Tok2Vec) |
|
||||||
|
| `Tok2Vec` | listener | [`spacy.Tok2VecListener`](https://spacy.io/api/architectures#Tok2VecListener) |
|
||||||
|
| `Transformer` | standalone | [`spacy-transformers.Tok2VecTransformer`](https://spacy.io/api/architectures#Tok2VecTransformer) |
|
||||||
|
| `Transformer` | listener | [`spacy-transformers.TransformerListener`](https://spacy.io/api/architectures#TransformerListener) |
|
||||||
|
|
||||||
|
Here we discuss the listener pattern and its implementation in code in more detail.
|
||||||
|
|
||||||
|
## 2. Initialization
|
||||||
|
|
||||||
|
### 2A. Linking listeners to the embedding component
|
||||||
|
|
||||||
|
To allow sharing a `tok2vec` layer, a separate `tok2vec` component needs to be defined in the config:
|
||||||
|
|
||||||
|
```
|
||||||
|
[components.tok2vec]
|
||||||
|
factory = "tok2vec"
|
||||||
|
|
||||||
|
[components.tok2vec.model]
|
||||||
|
@architectures = "spacy.Tok2Vec.v2"
|
||||||
|
```
|
||||||
|
|
||||||
|
A listener can then be set up by making sure the correct `upstream` name is defined, referring to the
|
||||||
|
name of the `tok2vec` component (which equals the factory name by default), or `*` as a wildcard:
|
||||||
|
|
||||||
|
```
|
||||||
|
[components.ner.model.tok2vec]
|
||||||
|
@architectures = "spacy.Tok2VecListener.v1"
|
||||||
|
upstream = "tok2vec"
|
||||||
|
```
|
||||||
|
|
||||||
|
When an [`nlp`](https://github.com/explosion/spaCy/blob/master/extra/DEVELOPER_DOCS/Language.md) object is
|
||||||
|
initialized or deserialized, it will make sure to link each `tok2vec` component to its listeners. This is
|
||||||
|
implemented in the method `nlp._link_components()` which loops over each
|
||||||
|
component in the pipeline and calls `find_listeners()` on a component if it's defined.
|
||||||
|
The [`tok2vec` component](https://github.com/explosion/spaCy/blob/master/spacy/pipeline/tok2vec.py)'s implementation
|
||||||
|
of this `find_listener()` method will specifically identify sublayers of a model definition that are of type
|
||||||
|
`Tok2VecListener` with a matching upstream name and will then add that listener to the internal `self.listener_map`.
|
||||||
|
|
||||||
|
If it's a Transformer-based pipeline, a
|
||||||
|
[`transformer` component](https://github.com/explosion/spacy-transformers/blob/master/spacy_transformers/pipeline_component.py)
|
||||||
|
has a similar implementation but its `find_listener()` function will specifically look for `TransformerListener`
|
||||||
|
sublayers of downstream components.
|
||||||
|
|
||||||
|
### 2B. Shape inference
|
||||||
|
|
||||||
|
Typically, the output dimension `nO` of a listener's model equals the `nO` (or `width`) of the upstream embedding layer.
|
||||||
|
For a standard `Tok2Vec`-based component, this is typically known up-front and defined as such in the config:
|
||||||
|
|
||||||
|
```
|
||||||
|
[components.ner.model.tok2vec]
|
||||||
|
@architectures = "spacy.Tok2VecListener.v1"
|
||||||
|
width = ${components.tok2vec.model.encode.width}
|
||||||
|
```
|
||||||
|
|
||||||
|
A `transformer` component however only knows its `nO` dimension after the HuggingFace transformer
|
||||||
|
is set with the function `model.attrs["set_transformer"]`,
|
||||||
|
[implemented](https://github.com/explosion/spacy-transformers/blob/master/spacy_transformers/layers/transformer_model.py)
|
||||||
|
by `set_pytorch_transformer`.
|
||||||
|
This is why, upon linking of the transformer listeners, the `transformer` component also makes sure to set
|
||||||
|
the listener's output dimension correctly.
|
||||||
|
|
||||||
|
This shape inference mechanism also needs to happen with resumed/frozen components, which means that for some CLI
|
||||||
|
commands (`assemble` and `train`), we need to call `nlp._link_components` even before initializing the `nlp`
|
||||||
|
object. To cover all use-cases and avoid negative side effects, the code base ensures that performing the
|
||||||
|
linking twice is not harmful.
|
||||||
|
|
||||||
|
## 3. Internal communication
|
||||||
|
|
||||||
|
The internal communication between a listener and its downstream components is organized by sending and
|
||||||
|
receiving information across the components - either directly or implicitly.
|
||||||
|
The details are different depending on whether the pipeline is currently training, or predicting.
|
||||||
|
Either way, the `tok2vec` or `transformer` component always needs to run before the listener.
|
||||||
|
|
||||||
|
### 3A. During prediction
|
||||||
|
|
||||||
|
When the `Tok2Vec` pipeline component is called, its `predict()` method is executed to produce the results,
|
||||||
|
which are then stored by `set_annotations()` in the `doc.tensor` field of the document(s).
|
||||||
|
Similarly, the `Transformer` component stores the produced embeddings
|
||||||
|
in `doc._.trf_data`. Next, the `forward` pass of a
|
||||||
|
[`Tok2VecListener`](https://github.com/explosion/spaCy/blob/master/spacy/pipeline/tok2vec.py)
|
||||||
|
or a
|
||||||
|
[`TransformerListener`](https://github.com/explosion/spacy-transformers/blob/master/spacy_transformers/layers/listener.py)
|
||||||
|
accesses these fields on the `Doc` directly. Both listener implementations have a fallback mechanism for when these
|
||||||
|
properties were not set on the `Doc`: in that case an all-zero tensor is produced and returned.
|
||||||
|
We need this fallback mechanism to enable shape inference methods in Thinc, but the code
|
||||||
|
is slightly risky and at times might hide another bug - so it's a good spot to be aware of.
|
||||||
|
|
||||||
|
### 3B. During training
|
||||||
|
|
||||||
|
During training, the `update()` methods of the `Tok2Vec` & `Transformer` components don't necessarily set the
|
||||||
|
annotations on the `Doc` (though since 3.1 they can if they are part of the `annotating_components` list in the config).
|
||||||
|
Instead, we rely on a caching mechanism between the original embedding component and its listener.
|
||||||
|
Specifically, the produced embeddings are sent to the listeners by calling `listener.receive()` and uniquely
|
||||||
|
identifying the batch of documents with a `batch_id`. This `receive()` call also sends the appropriate `backprop`
|
||||||
|
call to ensure that gradients from the downstream component flow back to the trainable `Tok2Vec` or `Transformer`
|
||||||
|
network.
|
||||||
|
|
||||||
|
We rely on the `nlp` object properly batching the data and sending each batch through the pipeline in sequence,
|
||||||
|
which means that only one such batch needs to be kept in memory for each listener.
|
||||||
|
When the downstream component runs and the listener should produce embeddings, it accesses the batch in memory,
|
||||||
|
runs the backpropagation, and returns the results and the gradients.
|
||||||
|
|
||||||
|
There are two ways in which this mechanism can fail, both are detected by `verify_inputs()`:
|
||||||
|
|
||||||
|
- `E953` if a different batch is in memory than the requested one - signaling some kind of out-of-sync state of the
|
||||||
|
training pipeline.
|
||||||
|
- `E954` if no batch is in memory at all - signaling that the pipeline is probably not set up correctly.
|
||||||
|
|
||||||
|
#### Training with multiple listeners
|
||||||
|
|
||||||
|
One `Tok2Vec` or `Transformer` component may be listened to by several downstream components, e.g.
|
||||||
|
a tagger and a parser could be sharing the same embeddings. In this case, we need to be careful about how we do
|
||||||
|
the backpropagation. When the `Tok2Vec` or `Transformer` sends out data to the listener with `receive()`, they will
|
||||||
|
send an `accumulate_gradient` function call to all listeners, except the last one. This function will keep track
|
||||||
|
of the gradients received so far. Only the final listener in the pipeline will get an actual `backprop` call that
|
||||||
|
will initiate the backpropagation of the `tok2vec` or `transformer` model with the accumulated gradients.
|
||||||
|
|
||||||
|
### 3C. Frozen components
|
||||||
|
|
||||||
|
The listener pattern can get particularly tricky in combination with frozen components. To detect components
|
||||||
|
with listeners that are not frozen consistently, `init_nlp()` (which is called by `spacy train`) goes through
|
||||||
|
the listeners and their upstream components and warns in two scenarios.
|
||||||
|
|
||||||
|
#### The Tok2Vec or Transformer is frozen
|
||||||
|
|
||||||
|
If the `Tok2Vec` or `Transformer` was already trained,
|
||||||
|
e.g. by [pretraining](https://spacy.io/usage/embeddings-transformers#pretraining),
|
||||||
|
it could be a valid use-case to freeze the embedding architecture and only train downstream components such
|
||||||
|
as a tagger or a parser. This used to be impossible before 3.1, but has become supported since then by putting the
|
||||||
|
embedding component in the [`annotating_components`](https://spacy.io/usage/training#annotating-components)
|
||||||
|
list of the config. This works like any other "annotating component" because it relies on the `Doc` attributes.
|
||||||
|
|
||||||
|
However, if the `Tok2Vec` or `Transformer` is frozen, and not present in `annotating_components`, and a related
|
||||||
|
listener isn't frozen, then a `W086` warning is shown and further training of the pipeline will likely end with `E954`.
|
||||||
|
|
||||||
|
#### The upstream component is frozen
|
||||||
|
|
||||||
|
If an upstream component is frozen but the underlying `Tok2Vec` or `Transformer` isn't, the performance of
|
||||||
|
the upstream component will be degraded after training. In this case, a `W087` warning is shown, explaining
|
||||||
|
how to use the `replace_listeners` functionality to prevent this problem.
|
||||||
|
|
||||||
|
## 4. Replacing listener with standalone
|
||||||
|
|
||||||
|
The [`replace_listeners`](https://spacy.io/api/language#replace_listeners) functionality changes the architecture
|
||||||
|
of a downstream component from using a listener pattern to a standalone `tok2vec` or `transformer` layer,
|
||||||
|
effectively making the downstream component independent of any other components in the pipeline.
|
||||||
|
It is implemented by `nlp.replace_listeners()` and typically executed by `nlp.from_config()`.
|
||||||
|
First, it fetches the original `Model` of the original component that creates the embeddings:
|
||||||
|
|
||||||
|
```
|
||||||
|
tok2vec = self.get_pipe(tok2vec_name)
|
||||||
|
tok2vec_model = tok2vec.model
|
||||||
|
```
|
||||||
|
|
||||||
|
Which is either a [`Tok2Vec` model](https://github.com/explosion/spaCy/blob/master/spacy/ml/models/tok2vec.py) or a
|
||||||
|
[`TransformerModel`](https://github.com/explosion/spacy-transformers/blob/master/spacy_transformers/layers/transformer_model.py).
|
||||||
|
|
||||||
|
In the case of the `tok2vec`, this model can be copied as-is into the configuration and architecture of the
|
||||||
|
downstream component. However, for the `transformer`, this doesn't work.
|
||||||
|
The reason is that the `TransformerListener` architecture chains the listener with
|
||||||
|
[`trfs2arrays`](https://github.com/explosion/spacy-transformers/blob/master/spacy_transformers/layers/trfs2arrays.py):
|
||||||
|
|
||||||
|
```
|
||||||
|
model = chain(
|
||||||
|
TransformerListener(upstream_name=upstream)
|
||||||
|
trfs2arrays(pooling, grad_factor),
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
but the standalone `Tok2VecTransformer` has an additional `split_trf_batch` chained inbetween the model
|
||||||
|
and `trfs2arrays`:
|
||||||
|
|
||||||
|
```
|
||||||
|
model = chain(
|
||||||
|
TransformerModel(name, get_spans, tokenizer_config),
|
||||||
|
split_trf_batch(),
|
||||||
|
trfs2arrays(pooling, grad_factor),
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
So you can't just take the model from the listener, and drop that into the component internally. You need to
|
||||||
|
adjust the model and the config. To facilitate this, `nlp.replace_listeners()` will check whether additional
|
||||||
|
[functions](https://github.com/explosion/spacy-transformers/blob/master/spacy_transformers/layers/_util.py) are
|
||||||
|
[defined](https://github.com/explosion/spacy-transformers/blob/master/spacy_transformers/layers/transformer_model.py)
|
||||||
|
in `model.attrs`, and if so, it will essentially call these to make the appropriate changes:
|
||||||
|
|
||||||
|
```
|
||||||
|
replace_func = tok2vec_model.attrs["replace_listener_cfg"]
|
||||||
|
new_config = replace_func(tok2vec_cfg["model"], pipe_cfg["model"]["tok2vec"])
|
||||||
|
...
|
||||||
|
new_model = tok2vec_model.attrs["replace_listener"](new_model)
|
||||||
|
```
|
||||||
|
|
||||||
|
The new config and model are then properly stored on the `nlp` object.
|
||||||
|
Note that this functionality (running the replacement for a transformer listener) was broken prior to
|
||||||
|
`spacy-transformers` 1.0.5.
|
Loading…
Reference in New Issue
Block a user