small fixes and formatting

This commit is contained in:
svlandeg 2020-11-20 15:55:05 +01:00
parent 124f49feb6
commit 4a3e611abc

View File

@ -502,7 +502,7 @@ with Model.define_operators({">>": chain}):
## Create new trainable components {#components} ## Create new trainable components {#components}
In addition to [swapping out](#swap-architectures) default models in built-in In addition to [swapping out](#swap-architectures) layers in existing
components, you can also implement an entirely new, components, you can also implement an entirely new,
[trainable](/usage/processing-pipelines#trainable-components) pipeline component [trainable](/usage/processing-pipelines#trainable-components) pipeline component
from scratch. This can be done by creating a new class inheriting from from scratch. This can be done by creating a new class inheriting from
@ -523,25 +523,27 @@ overview of the `TrainablePipe` methods used by
This section outlines an example use-case of implementing a **novel relation This section outlines an example use-case of implementing a **novel relation
extraction component** from scratch. We'll implement a binary relation extraction component** from scratch. We'll implement a binary relation
extraction method that determines whether or not **two entities** in a document extraction method that determines whether or not **two entities** in a document
are related, and if so, what type of relation. We'll allow multiple types of are related, and if so, what type of relation connects them. We allow multiple
relations between two such entities (multi-label setting). There are two major types of relations between two such entities (a multi-label setting). There are
steps required: two major steps required:
1. Implement a [machine learning model](#component-rel-model) specific to this 1. Implement a [machine learning model](#component-rel-model) specific to this
task. It will have to extract candidates from a [`Doc`](/api/doc) and predict task. It will have to extract candidate relation instances from a
a relation for the available candidate pairs. [`Doc`](/api/doc) and predict the corresponding scores for each relation
2. Implement a custom [pipeline component](#component-rel-pipe) powered by the label.
machine learning model that sets annotations on the [`Doc`](/api/doc) passing 2. Implement a custom [pipeline component](#component-rel-pipe) - powered by the
through the pipeline. machine learning model from step 1 - that translates the predicted scores
into annotations that are stored on the [`Doc`](/api/doc) objects as they
pass through the `nlp` pipeline.
<Project id="tutorials/rel_component"> <Project id="tutorials/rel_component">
Run this example use-case by using our project template. It includes all the Run this example use-case by using our project template. It includes all the
code to create the ML model and the pipeline component from scratch. code to create the ML model and the pipeline component from scratch.
It contains two config files to train the model: It also contains two config files to train the model:
one to run on CPU with a Tok2Vec layer, and one for the GPU using a transformer. one to run on CPU with a Tok2Vec layer, and one for the GPU using a transformer.
The project applies the relation extraction component to identify biomolecular The project applies the relation extraction component to identify biomolecular
interactions, but you can easily swap in your own dataset for your experiments interactions in a sample dataset, but you can easily swap in your own dataset
in any other domain. for your experiments in any other domain.
</Project> </Project>
#### Step 1: Implementing the Model {#component-rel-model} #### Step 1: Implementing the Model {#component-rel-model}
@ -558,18 +560,17 @@ matrix** (~~Floats2d~~) of predictions:
> for details. > for details.
```python ```python
### Register the model architecture ### The model architecture
@spacy.registry.architectures.register("rel_model.v1") @spacy.registry.architectures.register("rel_model.v1")
def create_relation_model(...) -> Model[List[Doc], Floats2d]: def create_relation_model(...) -> Model[List[Doc], Floats2d]:
model = ... # 👈 model will go here model = ... # 👈 model will go here
return model return model
``` ```
We will adapt a **modular approach** to the definition of this relation model, and We will adapt a **modular approach** to the definition of this relation model,
define it as chaining to layers together: the first layer that generates an and define it as chaining two layers together: the first layer that generates an
instance tensor from a given set of documents, and the second layer that instance tensor from a given set of documents, and the second layer that
transforms this tensor into a final tensor holding the predictions: transforms the instance tensor into a final tensor holding the predictions.
> #### config.cfg (excerpt) > #### config.cfg (excerpt)
> >
@ -581,11 +582,11 @@ transforms this tensor into a final tensor holding the predictions:
> # ... > # ...
> >
> [model.classification_layer] > [model.classification_layer]
> ... > # ...
> ``` > ```
```python ```python
### Implement the model architecture ### The model architecture
@spacy.registry.architectures.register("rel_model.v1") @spacy.registry.architectures.register("rel_model.v1")
def create_relation_model( def create_relation_model(
create_instance_tensor: Model[List[Doc], Floats2d], create_instance_tensor: Model[List[Doc], Floats2d],
@ -595,9 +596,8 @@ def create_relation_model(
return model return model
``` ```
The `classification_layer` could be something simple like a Linear layer The `classification_layer` could be something like a Linear layer followed by a
followed by a logistic activation function: logistic activation function:
> #### config.cfg (excerpt) > #### config.cfg (excerpt)
> >
@ -609,7 +609,7 @@ followed by a logistic activation function:
> ``` > ```
```python ```python
### Implement the classification layer ### The classification layer
@spacy.registry.architectures.register("rel_classification_layer.v1") @spacy.registry.architectures.register("rel_classification_layer.v1")
def create_classification_layer( def create_classification_layer(
nO: int = None, nI: int = None nO: int = None, nI: int = None
@ -617,60 +617,16 @@ def create_classification_layer(
return chain(Linear(nO=nO, nI=nI), Logistic()) return chain(Linear(nO=nO, nI=nI), Logistic())
``` ```
The first layer that **creates the instance tensor** can be defined The first layer that **creates the instance tensor** can be defined by
by implementing a implementing a
[custom forward function](https://thinc.ai/docs/usage-models#weights-layers-forward) [custom forward function](https://thinc.ai/docs/usage-models#weights-layers-forward)
with an appropriate backpropagation callback. We also define an with an appropriate backpropagation callback. We also define an
[initialization method](https://thinc.ai/docs/usage-models#weights-layers-init) [initialization method](https://thinc.ai/docs/usage-models#weights-layers-init)
that ensures that the layer is properly set up for training. that ensures that the layer is properly set up for training.
```python We omit some of the implementation details here, and refer to the spaCy project
### Implement the custom forward function that has the full implementation
def instance_forward( [here](https://github.com/explosion/projects/tree/v3/tutorials/rel_component).
model: Model[List[Doc], Floats2d],
docs: List[Doc],
is_train: bool
) -> Tuple[Floats2d, Callable]:
...
tok2vec = model.get_ref("tok2vec")
tokvecs, bp_tokvecs = tok2vec(docs, is_train)
relations = ...
def backprop(d_relations: Floats2d) -> List[Doc]:
d_tokvecs = ...
return bp_tokvecs(d_tokvecs)
return relations, backprop
### Implement the custom initialization method
def instance_init(
model: Model,
X: List[Doc] = None,
Y: Floats2d = None
) -> Model:
tok2vec = model.get_ref("tok2vec")
tok2vec.initialize(X)
return model
### Implement the layer that creates the instance tensor
@spacy.registry.architectures.register("rel_instance_tensor.v1")
def create_tensors(
tok2vec: Model[List[Doc], List[Floats2d]],
pooling: Model[Ragged, Floats2d],
get_instances: Callable[[Doc], List[Tuple[Span, Span]]],
) -> Model[List[Doc], Floats2d]:
return Model(
"instance_tensors",
instance_forward,
layers=[tok2vec, pooling],
refs={"tok2vec": tok2vec, "pooling": pooling},
attrs={"get_instances": get_instances},
init=instance_init,
)
```
> #### config.cfg (excerpt) > #### config.cfg (excerpt)
> >
@ -688,19 +644,69 @@ def create_tensors(
> [model.create_instance_tensor.get_instances] > [model.create_instance_tensor.get_instances]
> ... > ...
> ` > `
> ```
This custom layer uses an ```python
**[embedding layer](/usage/embeddings-transformers)** such as a ### The layer that creates the instance tensor
[`Tok2Vec`](/api/tok2vec) component or a [`Transformer`](/api/transformer). This @spacy.registry.architectures.register("rel_instance_tensor.v1")
layer is assumed to be of type ~~Model[List[Doc], List[Floats2d]]~~ as it def create_tensors(
tok2vec: Model[List[Doc], List[Floats2d]],
pooling: Model[Ragged, Floats2d],
get_instances: Callable[[Doc], List[Tuple[Span, Span]]],
) -> Model[List[Doc], Floats2d]:
return Model(
"instance_tensors",
instance_forward,
layers=[tok2vec, pooling],
refs={"tok2vec": tok2vec, "pooling": pooling},
attrs={"get_instances": get_instances},
init=instance_init,
)
### The custom forward function
def instance_forward(
model: Model[List[Doc], Floats2d],
docs: List[Doc],
is_train: bool,
) -> Tuple[Floats2d, Callable]:
# ...
tok2vec = model.get_ref("tok2vec")
tokvecs, bp_tokvecs = tok2vec(docs, is_train)
relations = ...
def backprop(d_relations: Floats2d) -> List[Doc]:
d_tokvecs = ...
return bp_tokvecs(d_tokvecs)
return relations, backprop
### The custom initialization method
def instance_init(
model: Model,
X: List[Doc] = None,
Y: Floats2d = None,
) -> Model:
tok2vec = model.get_ref("tok2vec")
tok2vec.initialize(X)
return model
```
This custom layer uses an [embedding layer](/usage/embeddings-transformers) such
as a [`Tok2Vec`](/api/tok2vec) component or a [`Transformer`](/api/transformer).
This layer is assumed to be of type ~~Model[List[Doc], List[Floats2d]]~~ as it
transforms each **document into a list of tokens**, with each token being transforms each **document into a list of tokens**, with each token being
represented by its embedding in the vector space. represented by its embedding in the vector space.
The **`pooling`** layer will be applied to summarize the token vectors into entity The `pooling` layer will be applied to summarize the token vectors into **entity
vectors, as named entities (represented by `Span` objects) can consist of one vectors**, as named entities (represented by ~~Span~~ objects) can consist of
or multiple tokens. For instance, the pooling layer could resort to calculating one or multiple tokens. For instance, the pooling layer could resort to
the average of all token vectors in an entity. Thinc provides several calculating the average of all token vectors in an entity. Thinc provides
[built-in pooling operators](https://thinc.ai/docs/api-layers#reduction-ops) for several
[built-in pooling operators](https://thinc.ai/docs/api-layers#reduction-ops) for
this purpose. this purpose.
> #### config.cfg (excerpt) > #### config.cfg (excerpt)
@ -712,15 +718,15 @@ this purpose.
> max_length = 100 > max_length = 100
> ``` > ```
Finally, we need a `get_instances` method that **generates pairs of entities** Finally, we need a `get_instances` method that **generates pairs of entities**
that we want to classify as being related or not. As these candidate pairs are typically formed that we want to classify as being related or not. As these candidate pairs are
within one document, this function takes a [`Doc`](/api/doc) as input and typically formed within one document, this function takes a [`Doc`](/api/doc) as
outputs a `List` of `Span` tuples. For instance, this input and outputs a `List` of `Span` tuples. For instance, the following
implementation takes any two entities from the same document, as long as they implementation takes any two entities from the same document, as long as they
are within a **maximum distance** (in number of tokens) of eachother: are within a **maximum distance** (in number of tokens) of eachother:
```python ```python
### Simple candiate generation ### Candiate generation
@spacy.registry.misc.register("rel_instance_generator.v1") @spacy.registry.misc.register("rel_instance_generator.v1")
def create_candidate_indices(max_length: int) -> Callable[[Doc], List[Tuple[Span, Span]]]: def create_candidate_indices(max_length: int) -> Callable[[Doc], List[Tuple[Span, Span]]]:
def get_candidates(doc: "Doc") -> List[Tuple[Span, Span]]: def get_candidates(doc: "Doc") -> List[Tuple[Span, Span]]:
@ -733,10 +739,10 @@ def create_candidate_indices(max_length: int) -> Callable[[Doc], List[Tuple[Span
return candidates return candidates
return get_candidates return get_candidates
``` ```
This function in added to the
[`@misc` registry](/api/top-level#registry) so we can refer to it from the
config, and easily swap it out for any other candidate generation function.
This function in added to the [`@misc` registry](/api/top-level#registry) so we
can refer to it from the config, and easily swap it out for any other candidate
generation function.
When creating this model, we store the custom functions as When creating this model, we store the custom functions as
[attributes](https://thinc.ai/docs/api-model#properties) and the sublayers as [attributes](https://thinc.ai/docs/api-model#properties) and the sublayers as
@ -851,13 +857,13 @@ def update(
sgd: Optional[Optimizer] = None, sgd: Optional[Optimizer] = None,
losses: Optional[Dict[str, float]] = None, losses: Optional[Dict[str, float]] = None,
) -> Dict[str, float]: ) -> Dict[str, float]:
... # ...
docs = [ex.predicted for ex in examples] docs = [ex.predicted for ex in examples]
predictions, backprop = self.model.begin_update(docs) predictions, backprop = self.model.begin_update(docs)
loss, gradient = self.get_loss(examples, predictions) loss, gradient = self.get_loss(examples, predictions)
backprop(gradient) backprop(gradient)
losses[self.name] += loss losses[self.name] += loss
... # ...
return losses return losses
``` ```
@ -933,9 +939,9 @@ def __call__(self, Doc doc):
return doc return doc
``` ```
There is one more optional method to implement: [`score`](/api/pipe#score) There is one more optional method to implement: [`score`](/api/pipe#score)
calculates the performance of your component on a set of examples, and calculates the performance of your component on a set of examples, and returns
returns the results as a dictionary: the results as a dictionary:
```python ```python
### The score method ### The score method
@ -951,8 +957,8 @@ def score(self, examples: Iterable[Example]) -> Dict[str, Any]:
} }
``` ```
This is particularly useful to see the scores on the development corpus This is particularly useful to see the scores on the development corpus when
when training the component with [`spacy train`](/api/cli#training). training the component with [`spacy train`](/api/cli#training).
Once our `TrainablePipe` subclass is fully implemented, we can Once our `TrainablePipe` subclass is fully implemented, we can
[register](/usage/processing-pipelines#custom-components-factories) the [register](/usage/processing-pipelines#custom-components-factories) the
@ -976,7 +982,7 @@ assigns it a name and lets you create the component with
> [components.relation_extractor.model.get_candidates] > [components.relation_extractor.model.get_candidates]
> @misc = "rel_cand_generator.v1" > @misc = "rel_cand_generator.v1"
> max_length = 20 > max_length = 20
> >
> [training.score_weights] > [training.score_weights]
> rel_micro_p = 0.0 > rel_micro_p = 0.0
> rel_micro_r = 0.0 > rel_micro_r = 0.0
@ -992,8 +998,8 @@ def make_relation_extractor(nlp, name, model):
return RelationExtractor(nlp.vocab, model, name) return RelationExtractor(nlp.vocab, model, name)
``` ```
You can extend the decorator to include information such as the type of You can extend the decorator to include information such as the type of
annotations that are required for this component to run, the type of annotations annotations that are required for this component to run, the type of annotations
it produces, and the scores that can be calculated: it produces, and the scores that can be calculated:
```python ```python