REL intro and get_candidates function

2025-08-08 22:24:55 +03:00 · 2020-10-03 23:27:05 +02:00 · 2020-10-03 23:27:05 +02:00 · 2c4b2ee5e9
commit 2c4b2ee5e9
parent df06f7a792
2 changed files with 55 additions and 1 deletions
--- a/website/docs/usage/layers-architectures.md
+++ b/website/docs/usage/layers-architectures.md
@ -486,6 +486,60 @@ with Model.define_operators({">>": chain}):

 ## Create new trainable components {#components}

+In addition to [swapping out](#swap-architectures) default models in built-in
+components, you can also implement an entirely new,
+[trainable pipeline component](usage/processing-pipelines#trainable-components)
+from scratch. This can be done by creating a new class inheriting from [`Pipe`](/api/pipe), 
+and linking it up to your custom model implementation.
+
+### Example: Pipeline component for relation extraction {#component-rel}
+
+This section will run through an example of implementing a novel relation extraction 
+component from scratch. As a first step, we need a method that will generate pairs of
+entities that we want to classify as being related or not. These candidate pairs are 
+typically formed within one document, which means we'll have a function that takes a 
+`Doc` as input and outputs a `List` of `Span` tuples. In this example, we will focus 
+on binary relation extraction, i.e. the tuple will be of length 2.
+
+We register this function in the 'misc' register so we can easily refer to it from the config, 
+and allow swapping it out for any candidate 
+generation function. For instance, a very straightforward implementation would be to just 
+take any two entities from the same document:
+
+```python
+@registry.misc.register("rel_cand_generator.v1")
+def create_candidate_indices() -> Callable[[Doc], List[Tuple[Span, Span]]]:
+    def get_candidate_indices(doc: "Doc"):
+        indices = []
+        for ent1 in doc.ents:
+            for ent2 in doc.ents:
+                indices.append((ent1, ent2))
+        return indices
+    return get_candidate_indices
+```
+
+But we could also refine this further by excluding relations of an entity with itself, 
+and posing a maximum distance (in number of tokens) between two entities:
+
+```python
+### {highlight="1,2,7,8"}
+@registry.misc.register("rel_cand_generator.v2")
+def create_candidate_indices(max_length: int) -> Callable[[Doc], List[Tuple[Span, Span]]]:
+    def get_candidate_indices(doc: "Doc"):
+        indices = []
+        for ent1 in doc.ents:
+            for ent2 in doc.ents:
+                if ent1 != ent2:
+                    if max_length and abs(ent2.start - ent1.start) <= max_length:
+                        indices.append((ent1, ent2))
+        return indices
+    return get_candidate_indices
+```
+
+
+
+
+
 <Infobox title="This section is still under construction" emoji="🚧" variant="warning">
 </Infobox>

--- a/website/docs/usage/processing-pipelines.md
+++ b/website/docs/usage/processing-pipelines.md
@ -1035,7 +1035,7 @@ plug fully custom machine learning components into your pipeline. You'll need
 the following:

 1. **Model:** A Thinc [`Model`](https://thinc.ai/docs/api-model) instance. This
-   can be a model using implemented in
+   can be a model implemented in
   [Thinc](/usage/layers-architectures#thinc), or a
   [wrapped model](/usage/layers-architectures#frameworks) implemented in
   PyTorch, TensorFlow, MXNet or a fully custom solution. The model must take a