mirror of
				https://github.com/explosion/spaCy.git
				synced 2025-10-31 07:57:35 +03:00 
			
		
		
		
	slight rewrite to hide some thinc implementation details
This commit is contained in:
		
							parent
							
								
									08ad349a18
								
							
						
					
					
						commit
						452b8309f9
					
				|  | @ -373,7 +373,7 @@ gpu_allocator = "pytorch" | |||
| Of course it's also possible to define the `Model` from the previous section | ||||
| entirely in Thinc. The Thinc documentation provides details on the | ||||
| [various layers](https://thinc.ai/docs/api-layers) and helper functions | ||||
| available. Combinators can also be used to | ||||
| available. Combinators can be used to | ||||
| [overload operators](https://thinc.ai/docs/usage-models#operators) and a common | ||||
| usage pattern is to bind `chain` to `>>`. The "native" Thinc version of our | ||||
| simple neural network would then become: | ||||
|  | @ -494,13 +494,34 @@ from scratch. This can be done by creating a new class inheriting from | |||
| 
 | ||||
| ### Example: Pipeline component for relation extraction {#component-rel} | ||||
| 
 | ||||
| This section will run through an example of implementing a novel relation | ||||
| extraction component from scratch. As a first step, we need a method that will | ||||
| This section outlines an example use-case of implementing a novel relation | ||||
| extraction component from scratch. We assume we want to implement a binary  | ||||
| relation extraction method that determines whether two entities in a document  | ||||
| are related or not, and if so, with what type of relation. We'll allow multiple  | ||||
| types of relations between two such entities - i.e. it is a multi-label setting. | ||||
| 
 | ||||
| We'll need to implement a [`Model`](https://thinc.ai/docs/api-model) that takes  | ||||
| a list of documents as input, and outputs a two-dimensional matrix of scores: | ||||
| 
 | ||||
| ```python | ||||
| @registry.architectures.register("rel_model.v1") | ||||
| def create_relation_model(...) -> Model[List[Doc], Floats2d]: | ||||
|     model = _create_my_model() | ||||
|     return model | ||||
| ``` | ||||
| 
 | ||||
| The first layer in this model will typically be an | ||||
| [embedding layer](/usage/embeddings-transformers) such as a | ||||
| [`Tok2Vec`](/api/tok2vec) component or [`Transformer`](/api/transformer). This | ||||
| layer is assumed to be of type `Model[List["Doc"], List[Floats2d]]` as it | ||||
| transforms each document into a list of tokens, with each token being  | ||||
| represented by its embedding in the vector space. | ||||
| 
 | ||||
| Next, we need a method that will | ||||
| generate pairs of entities that we want to classify as being related or not. | ||||
| These candidate pairs are typically formed within one document, which means | ||||
| we'll have a function that takes a `Doc` as input and outputs a `List` of `Span` | ||||
| tuples. In this example, we will focus on binary relation extraction, i.e. the | ||||
| tuple will be of length 2. For instance, a very straightforward implementation | ||||
| tuples. For instance, a very straightforward implementation | ||||
| would be to just take any two entities from the same document: | ||||
| 
 | ||||
| ```python | ||||
|  | @ -512,18 +533,24 @@ def get_candidates(doc: "Doc") -> List[Tuple[Span, Span]]: | |||
|     return candidates | ||||
| ``` | ||||
| 
 | ||||
| But we could also refine this further by excluding relations of an entity with | ||||
| itself, and posing a maximum distance (in number of tokens) between two | ||||
| entities. We'll also register this function in the | ||||
| [`@misc` registry](/api/top-level#registry) so we can refer to it from the | ||||
| config, and easily swap it out for any other candidate generation function. | ||||
| 
 | ||||
| > ``` | ||||
| > [get_candidates] | ||||
| > [model] | ||||
| > @architectures = "rel_model.v1" | ||||
| >  | ||||
| > [model.tok2vec] | ||||
| > ... | ||||
| >  | ||||
| > [model.get_candidates] | ||||
| > @misc = "rel_cand_generator.v2" | ||||
| > max_length = 6 | ||||
| > ``` | ||||
| 
 | ||||
| But we could also refine this further by excluding relations of an entity with | ||||
| itself, and posing a maximum distance (in number of tokens) between two | ||||
| entities. We'll register this function in the | ||||
| [`@misc` registry](/api/top-level#registry) so we can refer to it from the | ||||
| config, and easily swap it out for any other candidate generation function. | ||||
| 
 | ||||
| ```python | ||||
| ### {highlight="1,2,7,8"} | ||||
| @registry.misc.register("rel_cand_generator.v2") | ||||
|  | @ -539,32 +566,33 @@ def create_candidate_indices(max_length: int) -> Callable[[Doc], List[Tuple[Span | |||
|     return get_candidates | ||||
| ``` | ||||
| 
 | ||||
| Finally, we'll require a method that transforms the candidate pairs of entities into  | ||||
| a 2D tensor using the specified Tok2Vec function, and this `Floats2d` object will then be  | ||||
| processed by a final `output_layer` of the network. Taking all this together, we can define  | ||||
| our relation model like this in the config: | ||||
| 
 | ||||
| > ``` | ||||
| > [tok2vec] | ||||
| > @architectures = "spacy.HashEmbedCNN.v1" | ||||
| > pretrained_vectors = null | ||||
| > width = 96 | ||||
| > depth = 2 | ||||
| > embed_size = 300 | ||||
| > window_size = 1 | ||||
| > maxout_pieces = 3 | ||||
| > subword_features = true | ||||
| > [model] | ||||
| > @architectures = "rel_model.v1" | ||||
| > nO = null | ||||
| >  | ||||
| > [model.tok2vec] | ||||
| > ... | ||||
| >  | ||||
| > [model.get_candidates] | ||||
| > @misc = "rel_cand_generator.v2" | ||||
| > max_length = 6 | ||||
| >  | ||||
| > [components.relation_extractor.model.create_candidate_tensor] | ||||
| > @misc = "rel_cand_tensor.v1" | ||||
| >  | ||||
| > [components.relation_extractor.model.output_layer] | ||||
| > @architectures = "rel_output_layer.v1" | ||||
| > nI = null | ||||
| > nO = null | ||||
| > ``` | ||||
| 
 | ||||
| Next, we'll assume we have access to an | ||||
| [embedding layer](/usage/embeddings-transformers) such as a | ||||
| [`Tok2Vec`](/api/tok2vec) component or [`Transformer`](/api/transformer). This | ||||
| layer is assumed to be of type `Model[List["Doc"], List[Floats2d]]` as it | ||||
| transforms a list of documents into a list of 2D vectors. Further, this | ||||
| `tok2vec` component will be trainable, which means that, following the Thinc | ||||
| paradigm, we'll apply it to some input, and receive the predicted results as | ||||
| well as a callback to perform backpropagation: | ||||
| 
 | ||||
| ```python | ||||
| tok2vec = model.get_ref("tok2vec") | ||||
| tokvecs, bp_tokvecs = tok2vec(docs, is_train=True) | ||||
| ``` | ||||
| 
 | ||||
| <!-- Link to project for implementation details --> | ||||
| 
 | ||||
| 
 | ||||
| 
 | ||||
|  |  | |||
		Loading…
	
		Reference in New Issue
	
	Block a user