slight rewrite to hide some thinc implementation details

2025-11-06 02:47:29 +03:00 · 2020-10-04 13:26:46 +02:00 · 2020-10-04 13:26:46 +02:00 · 452b8309f9
commit 452b8309f9
parent 08ad349a18
1 changed files with 63 additions and 35 deletions
--- a/website/docs/usage/layers-architectures.md
+++ b/website/docs/usage/layers-architectures.md
@ -373,7 +373,7 @@ gpu_allocator = "pytorch"
 Of course it's also possible to define the `Model` from the previous section
 entirely in Thinc. The Thinc documentation provides details on the
 [various layers](https://thinc.ai/docs/api-layers) and helper functions
-available. Combinators can also be used to
+available. Combinators can be used to
 [overload operators](https://thinc.ai/docs/usage-models#operators) and a common
 usage pattern is to bind `chain` to `>>`. The "native" Thinc version of our
 simple neural network would then become:
@ -494,13 +494,34 @@ from scratch. This can be done by creating a new class inheriting from
 ### Example: Pipeline component for relation extraction {#component-rel}
-This section will run through an example of implementing a novel relation
+This section outlines an example use-case of implementing a novel relation
-extraction component from scratch. As a first step, we need a method that will
+extraction component from scratch. We assume we want to implement a binary 
 relation extraction method that determines whether two entities in a document 
 are related or not, and if so, with what type of relation. We'll allow multiple 
 types of relations between two such entities - i.e. it is a multi-label setting.
 We'll need to implement a [`Model`](https://thinc.ai/docs/api-model) that takes 
 a list of documents as input, and outputs a two-dimensional matrix of scores:
 ```python
@registry.architectures.register("rel_model.v1")
 def create_relation_model(...) -> Model[List[Doc], Floats2d]:
    model = _create_my_model()
    return model
 ```
 The first layer in this model will typically be an
 [embedding layer](/usage/embeddings-transformers) such as a
 [`Tok2Vec`](/api/tok2vec) component or [`Transformer`](/api/transformer). This
 layer is assumed to be of type `Model[List["Doc"], List[Floats2d]]` as it
 transforms each document into a list of tokens, with each token being 
 represented by its embedding in the vector space.
 Next, we need a method that will
 generate pairs of entities that we want to classify as being related or not.
 These candidate pairs are typically formed within one document, which means
 we'll have a function that takes a `Doc` as input and outputs a `List` of `Span`
-tuples. In this example, we will focus on binary relation extraction, i.e. the
+tuples. For instance, a very straightforward implementation
 tuple will be of length 2. For instance, a very straightforward implementation
 would be to just take any two entities from the same document:
 ```python
@ -512,18 +533,24 @@ def get_candidates(doc: "Doc") -> List[Tuple[Span, Span]]:
    return candidates
 ```
 But we could also refine this further by excluding relations of an entity with
 itself, and posing a maximum distance (in number of tokens) between two
 entities. We'll also register this function in the
 [`@misc` registry](/api/top-level#registry) so we can refer to it from the
 config, and easily swap it out for any other candidate generation function.
 > ```
-> [get_candidates]
+> [model]
 > @architectures = "rel_model.v1"
 > 
 > [model.tok2vec]
 > ...
 > 
 > [model.get_candidates]
 > @misc = "rel_cand_generator.v2"
 > max_length = 6
 > ```
 But we could also refine this further by excluding relations of an entity with
 itself, and posing a maximum distance (in number of tokens) between two
 entities. We'll register this function in the
 [`@misc` registry](/api/top-level#registry) so we can refer to it from the
 config, and easily swap it out for any other candidate generation function.
 ```python
 ### {highlight="1,2,7,8"}
@registry.misc.register("rel_cand_generator.v2")
@ -539,32 +566,33 @@ def create_candidate_indices(max_length: int) -> Callable[[Doc], List[Tuple[Span
    return get_candidates
 ```
 Finally, we'll require a method that transforms the candidate pairs of entities into 
 a 2D tensor using the specified Tok2Vec function, and this `Floats2d` object will then be 
 processed by a final `output_layer` of the network. Taking all this together, we can define 
 our relation model like this in the config:
 > ```
-> [tok2vec]
+> [model]
-> @architectures = "spacy.HashEmbedCNN.v1"
+> @architectures = "rel_model.v1"
-> pretrained_vectors = null
+> nO = null
-> width = 96
+> 
-> depth = 2
+> [model.tok2vec]
-> embed_size = 300
+> ...
-> window_size = 1
+> 
-> maxout_pieces = 3
+> [model.get_candidates]
-> subword_features = true
+> @misc = "rel_cand_generator.v2"
 > max_length = 6
 > 
 > [components.relation_extractor.model.create_candidate_tensor]
 > @misc = "rel_cand_tensor.v1"
 > 
 > [components.relation_extractor.model.output_layer]
 > @architectures = "rel_output_layer.v1"
 > nI = null
 > nO = null
 > ```
-Next, we'll assume we have access to an
+<!-- Link to project for implementation details -->
 [embedding layer](/usage/embeddings-transformers) such as a
 [`Tok2Vec`](/api/tok2vec) component or [`Transformer`](/api/transformer). This
 layer is assumed to be of type `Model[List["Doc"], List[Floats2d]]` as it
 transforms a list of documents into a list of 2D vectors. Further, this
 `tok2vec` component will be trainable, which means that, following the Thinc
 paradigm, we'll apply it to some input, and receive the predicted results as
 well as a callback to perform backpropagation:
 ```python
 tok2vec = model.get_ref("tok2vec")
 tokvecs, bp_tokvecs = tok2vec(docs, is_train=True)
 ```