Merge pull request #6197 from svlandeg/feature/pipe-docs [ci skip]

This commit is contained in:
Ines Montani 2020-10-05 11:55:40 +02:00 committed by GitHub
commit 0f64556c04
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
3 changed files with 315 additions and 23 deletions

View File

@ -226,6 +226,12 @@ the "catastrophic forgetting" problem. This feature is experimental.
Find the loss and gradient of loss for the batch of documents and their
predicted scores.
<Infobox variant="danger">
This method needs to be overwritten with your own custom `get_loss` method.
</Infobox>
> #### Example
>
> ```python

View File

@ -288,7 +288,7 @@ those parts of the network.
To use our custom model including the PyTorch subnetwork, all we need to do is
register the architecture using the
[`architectures` registry](/api/top-level#registry). This will assign the
[`architectures` registry](/api/top-level#registry). This assigns the
architecture a name so spaCy knows how to find it, and allows passing in
arguments like hyperparameters via the [config](/usage/training#config). The
full example then becomes:
@ -373,7 +373,7 @@ gpu_allocator = "pytorch"
Of course it's also possible to define the `Model` from the previous section
entirely in Thinc. The Thinc documentation provides details on the
[various layers](https://thinc.ai/docs/api-layers) and helper functions
available. Combinators can also be used to
available. Combinators can be used to
[overload operators](https://thinc.ai/docs/usage-models#operators) and a common
usage pattern is to bind `chain` to `>>`. The "native" Thinc version of our
simple neural network would then become:
@ -486,28 +486,314 @@ with Model.define_operators({">>": chain}):
## Create new trainable components {#components}
<Infobox title="This section is still under construction" emoji="🚧" variant="warning">
</Infobox>
In addition to [swapping out](#swap-architectures) default models in built-in
components, you can also implement an entirely new,
[trainable pipeline component](/usage/processing-pipelines#trainable-components)
from scratch. This can be done by creating a new class inheriting from
[`Pipe`](/api/pipe), and linking it up to your custom model implementation.
<!-- TODO: write trainable component section
- Interaction with `predict`, `get_loss` and `set_annotations`
- Initialization life-cycle with `initialize`, correlation with add_label
Example: relation extraction component (implemented as project template)
Avoid duplication with usage/processing-pipelines#trainable-components ?
-->
### Example: Pipeline component for relation extraction {#component-rel}
<!-- ![Diagram of a pipeline component with its model](../images/layers-architectures.svg)
This section outlines an example use-case of implementing a novel relation
extraction component from scratch. We'll implement a binary relation extraction
method that determines whether or not two entities in a document are related,
and if so, what type of relation. We'll allow multiple types of relations
between two such entities (multi-label setting).
There are two major steps required: first, we need to
[implement a machine learning model](#component-rel-model) specific to this
task, and subsequently we use this model to
[implement a custom pipeline component](#component-rel-pipe).
#### Step 1: Implementing the Model {#component-rel-model}
We need to implement a [`Model`](https://thinc.ai/docs/api-model) that takes a
list of documents as input, and outputs a two-dimensional matrix of predictions:
```python
def update(self, examples):
docs = [ex.predicted for ex in examples]
refs = [ex.reference for ex in examples]
predictions, backprop = self.model.begin_update(docs)
gradient = self.get_loss(predictions, refs)
backprop(gradient)
def __call__(self, doc):
predictions = self.model([doc])
self.set_annotations(predictions)
@registry.architectures.register("rel_model.v1")
def create_relation_model(...) -> Model[List[Doc], Floats2d]:
model = _create_my_model()
return model
```
-->
The first layer in this model will typically be an
[embedding layer](/usage/embeddings-transformers) such as a
[`Tok2Vec`](/api/tok2vec) component or a [`Transformer`](/api/transformer). This
layer is assumed to be of type ~~Model[List[Doc], List[Floats2d]]~~ as it
transforms each document into a list of tokens, with each token being
represented by its embedding in the vector space.
Next, we need a method that generates pairs of entities that we want to classify
as being related or not. As these candidate pairs are typically formed within
one document, this function takes a `Doc` as input and outputs a `List` of
`Span` tuples. For instance, a very straightforward implementation would be to
just take any two entities from the same document:
```python
def get_candidates(doc: "Doc") -> List[Tuple[Span, Span]]:
candidates = []
for ent1 in doc.ents:
for ent2 in doc.ents:
candidates.append((ent1, ent2))
return candidates
```
> ```
> [model]
> @architectures = "rel_model.v1"
>
> [model.tok2vec]
> ...
>
> [model.get_candidates]
> @misc = "rel_cand_generator.v2"
> max_length = 20
> ```
But we could also refine this further by excluding relations of an entity with
itself, and posing a maximum distance (in number of tokens) between two
entities. We register this function in the
[`@misc` registry](/api/top-level#registry) so we can refer to it from the
config, and easily swap it out for any other candidate generation function.
```python
### {highlight="1,2,7,8"}
@registry.misc.register("rel_cand_generator.v2")
def create_candidate_indices(max_length: int) -> Callable[[Doc], List[Tuple[Span, Span]]]:
def get_candidates(doc: "Doc") -> List[Tuple[Span, Span]]:
candidates = []
for ent1 in doc.ents:
for ent2 in doc.ents:
if ent1 != ent2:
if max_length and abs(ent2.start - ent1.start) <= max_length:
candidates.append((ent1, ent2))
return candidates
return get_candidates
```
Finally, we require a method that transforms the candidate entity pairs into a
2D tensor using the specified `Tok2Vec` function. The resulting `Floats2d`
object will then be processed by a final `output_layer` of the network. Putting
all this together, we can define our relation model in a config file as such:
```
[model]
@architectures = "rel_model.v1"
...
[model.tok2vec]
...
[model.get_candidates]
@misc = "rel_cand_generator.v2"
max_length = 20
[model.create_candidate_tensor]
@misc = "rel_cand_tensor.v1"
[model.output_layer]
@architectures = "rel_output_layer.v1"
...
```
<!-- TODO: Link to project for implementation details -->
When creating this model, we store the custom functions as
[attributes](https://thinc.ai/docs/api-model#properties) and the sublayers as
references, so we can access them easily:
```python
tok2vec_layer = model.get_ref("tok2vec")
output_layer = model.get_ref("output_layer")
create_candidate_tensor = model.attrs["create_candidate_tensor"]
get_candidates = model.attrs["get_candidates"]
```
#### Step 2: Implementing the pipeline component {#component-rel-pipe}
To use our new relation extraction model as part of a custom component, we
create a subclass of [`Pipe`](/api/pipe) that holds the model:
```python
from spacy.pipeline import Pipe
class RelationExtractor(Pipe):
def __init__(self, vocab, model, name="rel", labels=[]):
self.model = model
...
def update(self, examples, ...):
...
def predict(self, docs):
...
def set_annotations(self, docs, predictions):
...
```
Before the model can be used, it needs to be
[initialized](/api/pipe#initialize). This function receives either the full
training data set, or a representative sample. This data set can be used to
deduce all relevant labels. Alternatively, a list of labels can be provided, or
a script can call `rel_component.add_label()` directly.
The number of labels defines the output dimensionality of the network, and will
be used to do [shape inference](https://thinc.ai/docs/usage-models#validation)
throughout the layers of the neural network. This is triggered by calling
`model.initialize`.
```python
### {highlight="12,18,22"}
from itertools import islice
def initialize(
self,
get_examples: Callable[[], Iterable[Example]],
*,
nlp: Language = None,
labels: Optional[List[str]] = None,
):
if labels is not None:
for label in labels:
self.add_label(label)
else:
for example in get_examples():
relations = example.reference._.rel
for indices, label_dict in relations.items():
for label in label_dict.keys():
self.add_label(label)
subbatch = list(islice(get_examples(), 10))
doc_sample = [eg.reference for eg in subbatch]
label_sample = self._examples_to_truth(subbatch)
self.model.initialize(X=doc_sample, Y=label_sample)
```
The `initialize` method is triggered whenever this component is part of an `nlp`
pipeline, and [`nlp.initialize()`](/api/language#initialize) is invoked. After
doing so, the pipeline component and its internal model can be trained and used
to make predictions.
During training, the function [`update`](/api/pipe#update) is invoked which
delegates to
[`self.model.begin_update`](https://thinc.ai/docs/api-model#begin_update) and a
[`get_loss`](/api/pipe#get_loss) function that calculate the loss for a batch of
examples, as well as the gradient of loss that will be used to update the
weights of the model layers.
```python
### {highlight="12-14"}
def update(
self,
examples: Iterable[Example],
*,
drop: float = 0.0,
set_annotations: bool = False,
sgd: Optional[Optimizer] = None,
losses: Optional[Dict[str, float]] = None,
) -> Dict[str, float]:
...
docs = [ex.predicted for ex in examples]
predictions, backprop = self.model.begin_update(docs)
loss, gradient = self.get_loss(examples, predictions)
backprop(gradient)
losses[self.name] += loss
...
return losses
```
Thinc provides several [loss functions](https://thinc.ai/docs/api-loss) that can
be used for the implementation of the `get_loss` function.
When the internal model is trained, the component can be used to make novel
predictions. The [`predict`](/api/pipe#predict) function needs to be implemented
for each subclass of `Pipe`. In our case, we can simply delegate to the internal
model's [predict](https://thinc.ai/docs/api-model#predict) function:
```python
def predict(self, docs: Iterable[Doc]) -> Floats2d:
predictions = self.model.predict(docs)
return self.model.ops.asarray(predictions)
```
The final method that needs to be implemented, is
[`set_annotations`](/api/pipe#set_annotations). This function takes the
predictions, and modifies the given `Doc` object in place to store them. For our
relation extraction component, we store the data as a dictionary in a custom
extension attribute `doc._.rel`. As keys, we represent the candidate pair by the
start offsets of each entity, as this defines an entity pair uniquely within one
document.
To interpret the scores predicted by the REL model correctly, we need to refer
to the model's `get_candidates` function that defined which pairs of entities
were relevant candidates, so that the predictions can be linked to those exact
entities:
> #### Example output
>
> ```python
> doc = nlp("Amsterdam is the capital of the Netherlands.")
> print(f"spans: [(e.start, e.text, e.label_) for e in doc.ents]")
> for value, rel_dict in doc._.rel.items():
> print(f"{value}: {rel_dict}")
> ```
> ```
> spans [(0, 'Amsterdam', 'LOC'), (6, 'Netherlands', 'LOC')]
> (0, 6): {'CAPITAL_OF': 0.89, 'LOCATED_IN': 0.75, 'UNRELATED': 0.002}
> (6, 0): {'CAPITAL_OF': 0.01, 'LOCATED_IN': 0.13, 'UNRELATED': 0.017}
> ```
```python
### {highlight="5-6,10"}
def set_annotations(self, docs: Iterable[Doc], predictions: Floats2d):
c = 0
get_candidates = self.model.attrs["get_candidates"]
for doc in docs:
for (e1, e2) in get_candidates(doc):
offset = (e1.start, e2.start)
if offset not in doc._.rel:
doc._.rel[offset] = {}
for j, label in enumerate(self.labels):
doc._.rel[offset][label] = predictions[c, j]
c += 1
```
Under the hood, when the pipe is applied to a document, it delegates to the
`predict` and `set_annotations` functions:
```python
def __call__(self, Doc doc):
predictions = self.predict([doc])
self.set_annotations([doc], predictions)
return doc
```
Once our `Pipe` subclass is fully implemented, we can
[register](http://localhost:8000/usage/processing-pipelines#custom-components-factories)
the component with the `Language.factory` decorator. This enables the creation
of the component with `nlp.add_pipe`, or via the config.
> ```
>
> [components.relation_extractor]
> factory = "relation_extractor"
> labels = []
>
> [components.relation_extractor.model]
> @architectures = "rel_model.v1"
> ...
> ```
```python
from spacy.language import Language
@Language.factory("relation_extractor")
def make_relation_extractor(nlp, name, model, labels):
return RelationExtractor(nlp.vocab, model, name, labels=labels)
```
<!-- TODO: refer once more to example project -->
<!-- ![Diagram of a pipeline component with its model](../images/layers-architectures.svg) -->

View File

@ -1176,7 +1176,7 @@ plug fully custom machine learning components into your pipeline. You'll need
the following:
1. **Model:** A Thinc [`Model`](https://thinc.ai/docs/api-model) instance. This
can be a model using implemented in
can be a model implemented in
[Thinc](/usage/layers-architectures#thinc), or a
[wrapped model](/usage/layers-architectures#frameworks) implemented in
PyTorch, TensorFlow, MXNet or a fully custom solution. The model must take a