slight rewrite to hide some thinc implementation details

This commit is contained in:
svlandeg 2020-10-04 13:26:46 +02:00
parent 08ad349a18
commit 452b8309f9

View File

@ -373,7 +373,7 @@ gpu_allocator = "pytorch"
Of course it's also possible to define the `Model` from the previous section Of course it's also possible to define the `Model` from the previous section
entirely in Thinc. The Thinc documentation provides details on the entirely in Thinc. The Thinc documentation provides details on the
[various layers](https://thinc.ai/docs/api-layers) and helper functions [various layers](https://thinc.ai/docs/api-layers) and helper functions
available. Combinators can also be used to available. Combinators can be used to
[overload operators](https://thinc.ai/docs/usage-models#operators) and a common [overload operators](https://thinc.ai/docs/usage-models#operators) and a common
usage pattern is to bind `chain` to `>>`. The "native" Thinc version of our usage pattern is to bind `chain` to `>>`. The "native" Thinc version of our
simple neural network would then become: simple neural network would then become:
@ -494,13 +494,34 @@ from scratch. This can be done by creating a new class inheriting from
### Example: Pipeline component for relation extraction {#component-rel} ### Example: Pipeline component for relation extraction {#component-rel}
This section will run through an example of implementing a novel relation This section outlines an example use-case of implementing a novel relation
extraction component from scratch. As a first step, we need a method that will extraction component from scratch. We assume we want to implement a binary
relation extraction method that determines whether two entities in a document
are related or not, and if so, with what type of relation. We'll allow multiple
types of relations between two such entities - i.e. it is a multi-label setting.
We'll need to implement a [`Model`](https://thinc.ai/docs/api-model) that takes
a list of documents as input, and outputs a two-dimensional matrix of scores:
```python
@registry.architectures.register("rel_model.v1")
def create_relation_model(...) -> Model[List[Doc], Floats2d]:
model = _create_my_model()
return model
```
The first layer in this model will typically be an
[embedding layer](/usage/embeddings-transformers) such as a
[`Tok2Vec`](/api/tok2vec) component or [`Transformer`](/api/transformer). This
layer is assumed to be of type `Model[List["Doc"], List[Floats2d]]` as it
transforms each document into a list of tokens, with each token being
represented by its embedding in the vector space.
Next, we need a method that will
generate pairs of entities that we want to classify as being related or not. generate pairs of entities that we want to classify as being related or not.
These candidate pairs are typically formed within one document, which means These candidate pairs are typically formed within one document, which means
we'll have a function that takes a `Doc` as input and outputs a `List` of `Span` we'll have a function that takes a `Doc` as input and outputs a `List` of `Span`
tuples. In this example, we will focus on binary relation extraction, i.e. the tuples. For instance, a very straightforward implementation
tuple will be of length 2. For instance, a very straightforward implementation
would be to just take any two entities from the same document: would be to just take any two entities from the same document:
```python ```python
@ -512,18 +533,24 @@ def get_candidates(doc: "Doc") -> List[Tuple[Span, Span]]:
return candidates return candidates
``` ```
But we could also refine this further by excluding relations of an entity with
itself, and posing a maximum distance (in number of tokens) between two
entities. We'll also register this function in the
[`@misc` registry](/api/top-level#registry) so we can refer to it from the
config, and easily swap it out for any other candidate generation function.
> ``` > ```
> [get_candidates] > [model]
> @architectures = "rel_model.v1"
>
> [model.tok2vec]
> ...
>
> [model.get_candidates]
> @misc = "rel_cand_generator.v2" > @misc = "rel_cand_generator.v2"
> max_length = 6 > max_length = 6
> ``` > ```
But we could also refine this further by excluding relations of an entity with
itself, and posing a maximum distance (in number of tokens) between two
entities. We'll register this function in the
[`@misc` registry](/api/top-level#registry) so we can refer to it from the
config, and easily swap it out for any other candidate generation function.
```python ```python
### {highlight="1,2,7,8"} ### {highlight="1,2,7,8"}
@registry.misc.register("rel_cand_generator.v2") @registry.misc.register("rel_cand_generator.v2")
@ -539,32 +566,33 @@ def create_candidate_indices(max_length: int) -> Callable[[Doc], List[Tuple[Span
return get_candidates return get_candidates
``` ```
Finally, we'll require a method that transforms the candidate pairs of entities into
a 2D tensor using the specified Tok2Vec function, and this `Floats2d` object will then be
processed by a final `output_layer` of the network. Taking all this together, we can define
our relation model like this in the config:
> ``` > ```
> [tok2vec] > [model]
> @architectures = "spacy.HashEmbedCNN.v1" > @architectures = "rel_model.v1"
> pretrained_vectors = null > nO = null
> width = 96 >
> depth = 2 > [model.tok2vec]
> embed_size = 300 > ...
> window_size = 1 >
> maxout_pieces = 3 > [model.get_candidates]
> subword_features = true > @misc = "rel_cand_generator.v2"
> max_length = 6
>
> [components.relation_extractor.model.create_candidate_tensor]
> @misc = "rel_cand_tensor.v1"
>
> [components.relation_extractor.model.output_layer]
> @architectures = "rel_output_layer.v1"
> nI = null
> nO = null
> ``` > ```
Next, we'll assume we have access to an <!-- Link to project for implementation details -->
[embedding layer](/usage/embeddings-transformers) such as a
[`Tok2Vec`](/api/tok2vec) component or [`Transformer`](/api/transformer). This
layer is assumed to be of type `Model[List["Doc"], List[Floats2d]]` as it
transforms a list of documents into a list of 2D vectors. Further, this
`tok2vec` component will be trainable, which means that, following the Thinc
paradigm, we'll apply it to some input, and receive the predicted results as
well as a callback to perform backpropagation:
```python
tok2vec = model.get_ref("tok2vec")
tokvecs, bp_tokvecs = tok2vec(docs, is_train=True)
```