mirror of
				https://github.com/explosion/spaCy.git
				synced 2025-10-26 13:41:21 +03:00 
			
		
		
		
	
		
			
				
	
	
		
			51 lines
		
	
	
		
			2.8 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
			
		
		
	
	
			51 lines
		
	
	
		
			2.8 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
| //- 💫 DOCS > USAGE > SPACY 101 > TRAINING
 | ||
| 
 | ||
| p
 | ||
|     |  spaCy's models are #[strong statistical] and every "decision" they make –
 | ||
|     |  for example, which part-of-speech tag to assign, or whether a word is a
 | ||
|     |  named entity – is a #[strong prediction]. This prediction is based
 | ||
|     |  on the examples the model has seen during #[strong training]. To train
 | ||
|     |  a model, you first need training data – examples of text, and the
 | ||
|     |  labels you want the model to predict. This could be a part-of-speech tag,
 | ||
|     |  a named entity or any other information.
 | ||
| 
 | ||
| p
 | ||
|     |  The model is then shown the unlabelled text and will make a prediction.
 | ||
|     |  Because we know the correct answer, we can give the model feedback on its
 | ||
|     |  prediction in the form of an #[strong error gradient] of the
 | ||
|     |  #[strong loss function] that calculates the difference between the training
 | ||
|     |  example and the expected output. The greater the difference, the more
 | ||
|     |  significant the gradient and the updates to our model.
 | ||
| 
 | ||
| +aside
 | ||
|     |  #[strong Training data:] Examples and their annotations.#[br]
 | ||
|     |  #[strong Text:] The input text the model should predict a label for.#[br]
 | ||
|     |  #[strong Label:] The label the model should predict.#[br]
 | ||
|     |  #[strong Gradient:] Gradient of the loss function calculating the
 | ||
|     |  difference between input and expected output.
 | ||
| 
 | ||
| +graphic("/assets/img/training.svg")
 | ||
|     include ../../assets/img/training.svg
 | ||
| 
 | ||
| p
 | ||
|     |  When training a model, we don't just want it to memorise our examples –
 | ||
|     |  we want it to come up with theory that can be
 | ||
|     |  #[strong generalised across other examples]. After all, we don't just want
 | ||
|     |  the model to learn that this one instance of "Amazon" right here is a
 | ||
|     |  company – we want it to learn that "Amazon", in contexts #[em like this],
 | ||
|     |  is most likely a company. That's why the training data should always be
 | ||
|     |  representative of the data we want to process. A model trained on
 | ||
|     |  Wikipedia, where sentences in the first person are extremely rare, will
 | ||
|     |  likely perform badly on Twitter. Similarly, a model trained on romantic
 | ||
|     |  novels will likely perform badly on legal text.
 | ||
| 
 | ||
| p
 | ||
|     |  This also means that in order to know how the model is performing,
 | ||
|     |  and whether it's learning the right things, you don't only need
 | ||
|     |  #[strong training data] – you'll also need #[strong evaluation data]. If
 | ||
|     |  you only test the model with the data it was trained on, you'll have no
 | ||
|     |  idea how well it's generalising. If you want to train a model from scratch,
 | ||
|     |  you usually need at least a few hundred examples for both training and
 | ||
|     |  evaluation. To update an existing model, you can already achieve decent
 | ||
|     |  results with very few examples – as long as they're representative.
 |