mirror of
				https://github.com/explosion/spaCy.git
				synced 2025-10-30 23:47:31 +03:00 
			
		
		
		
	
		
			
				
	
	
		
			163 lines
		
	
	
		
			7.5 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
			
		
		
	
	
			163 lines
		
	
	
		
			7.5 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
| //- 💫 DOCS > USAGE > TRAINING > OPTIMIZATION TIPS AND ADVICE
 | ||
| 
 | ||
| p
 | ||
|     |  There are lots of conflicting "recipes" for training deep neural
 | ||
|     |  networks at the moment. The cutting-edge models take a very long time to
 | ||
|     |  train, so most researchers can't run enough experiments to figure out
 | ||
|     |  what's #[em really] going on. For what it's worth, here's a recipe that seems
 | ||
|     |  to work well on a lot of NLP problems:
 | ||
| 
 | ||
| +list("numbers")
 | ||
|     +item
 | ||
|         |  Initialise with batch size 1, and compound to a maximum determined
 | ||
|         |  by your data size and problem type.
 | ||
|     +item
 | ||
|         |  Use Adam solver with fixed learning rate.
 | ||
| 
 | ||
|     +item
 | ||
|         |  Use averaged parameters
 | ||
| 
 | ||
|     +item
 | ||
|         |  Use L2 regularization.
 | ||
| 
 | ||
|     +item
 | ||
|         |  Clip gradients by L2 norm to 1.
 | ||
| 
 | ||
|     +item
 | ||
|         |  On small data sizes, start at a high dropout rate, with linear decay.
 | ||
| 
 | ||
| p
 | ||
|     |  This recipe has been cobbled together experimentally. Here's why the
 | ||
|     |  various elements of the recipe made enough sense to try initially, and
 | ||
|     |  what you might try changing, depending on your problem.
 | ||
| 
 | ||
| +h(3, "tips-batch-size") Compounding batch size
 | ||
| 
 | ||
| p
 | ||
|     |  The trick of increasing the batch size is starting to become quite
 | ||
|     |  popular (see #[+a("https://arxiv.org/abs/1711.00489") Smith et al., 2017]).
 | ||
|     |  Their recipe is quite different from how spaCy's models are being
 | ||
|     |  trained, but there are some similarities. In training the various spaCy
 | ||
|     |  models, we haven't found much advantage from decaying the learning
 | ||
|     |  rate – but starting with a low batch size has definitely helped. You
 | ||
|     |  should try it out on your data, and see how you go. Here's our current
 | ||
|     |  strategy:
 | ||
| 
 | ||
| +code("Batch heuristic").
 | ||
|     def get_batches(train_data, model_type):
 | ||
|         max_batch_sizes = {'tagger': 32, 'parser': 16, 'ner': 16, 'textcat': 64}
 | ||
|         max_batch_size = max_batch_sizes[model_type]
 | ||
|         if len(train_data) < 1000:
 | ||
|             max_batch_size /= 2
 | ||
|         if len(train_data) < 500:
 | ||
|             max_batch_size /= 2
 | ||
|         batch_size = compounding(1, max_batch_size, 1.001)
 | ||
|         batches = minibatch(train_data, size=batch_size)
 | ||
|         return batches
 | ||
| 
 | ||
| p
 | ||
|     |  This will set the batch size to start at #[code 1], and increase each
 | ||
|     |  batch until it reaches a maximum size. The tagger, parser and entity
 | ||
|     |  recognizer all take whole sentences as input, so they're learning a lot
 | ||
|     |  of labels in a single example. You therefore need smaller batches for
 | ||
|     |  them. The batch size for the text categorizer should be somewhat larger,
 | ||
|     |  especially if your documents are long.
 | ||
| 
 | ||
| +h(3, "tips-hyperparams") Learning rate, regularization and gradient clipping
 | ||
| 
 | ||
| p
 | ||
|     |  By default spaCy uses the Adam solver, with default settings
 | ||
|     |  (learning rate #[code 0.001], #[code beta1=0.9], #[code beta2=0.999]).
 | ||
|     |  Some researchers have said they found these settings terrible on their
 | ||
|     |  problems – but they've always performed very well in training spaCy's
 | ||
|     |  models, in combination with the rest of our recipe. You can change these
 | ||
|     |  settings directly, by modifying the corresponding attributes on the
 | ||
|     |  #[code optimizer] object. You can also set environment variables, to
 | ||
|     |  adjust the defaults.
 | ||
| 
 | ||
| p
 | ||
|     |  There are two other key hyper-parameters of the solver: #[code L2]
 | ||
|     |  #[strong regularization], and #[strong gradient clipping]
 | ||
|     |  (#[code max_grad_norm]). Gradient clipping is a hack that's not discussed
 | ||
|     |  often, but everybody seems to be using. It's quite important in helping
 | ||
|     |  to ensure the network doesn't diverge, which is a fancy way of saying
 | ||
|     |  "fall over during training". The effect is sort of similar to setting the
 | ||
|     |  learning rate low. It can also compensate for a large batch size (this is
 | ||
|     |  a good example of how the choices of all these hyper-parameters
 | ||
|     |  intersect).
 | ||
| 
 | ||
| +h(3, "tips-dropout") Dropout rate
 | ||
| 
 | ||
| p
 | ||
|     |  For small datasets, it's useful to set a
 | ||
|     |  #[strong high dropout rate at first], and #[strong decay] it down towards
 | ||
|     |  a more reasonable value. This helps avoid the network immediately
 | ||
|     |  overfitting, while still encouraging it to learn some of the more
 | ||
|     |  interesting things in your data. spaCy comes with a
 | ||
|     |  #[+api("top-level#util.decaying") #[code decaying]] utility function to
 | ||
|     |  facilitate this. You might try setting:
 | ||
| 
 | ||
| +code.
 | ||
|     from spacy.util import decaying
 | ||
|     dropout = decaying(0.6, 0.2, 1e-4)
 | ||
| 
 | ||
| p
 | ||
|     |  You can then draw values from the iterator with #[code next(dropout)],
 | ||
|     |  which you would pass to the #[code drop] keyword argument of
 | ||
|     |  #[+api("language#update") #[code nlp.update]]. It's pretty much always a
 | ||
|     |  good idea to use at least #[strong some dropout]. All of the models
 | ||
|     |  currently use Bernoulli dropout, for no particularly principled reason –
 | ||
|     |  we just haven't experimented with another scheme like Gaussian dropout
 | ||
|     |  yet.
 | ||
| 
 | ||
| +h(3, "tips-param-avg") Parameter averaging
 | ||
| 
 | ||
| p
 | ||
|     |  The last part of our optimization recipe is #[strong parameter averaging],
 | ||
|     |  an old trick introduced by
 | ||
|     |  #[+a("https://cseweb.ucsd.edu/~yfreund/papers/LargeMarginsUsingPerceptron.pdf") Freund and Schapire (1999)],
 | ||
|     |  popularised in the NLP community by
 | ||
|     |  #[+a("http://www.aclweb.org/anthology/P04-1015") Collins (2002)],
 | ||
|     |  and explained in more detail by
 | ||
|     |  #[+a("http://leon.bottou.org/projects/sgd") Leon Bottou]. Just about the
 | ||
|     |  only other people who seem to be using this for neural network training
 | ||
|     |  are the SyntaxNet team (one of whom is Michael Collins) – but it really
 | ||
|     |  seems to work great on every problem.
 | ||
| 
 | ||
| p
 | ||
|     |  The trick is to store the moving average of the weights during training.
 | ||
|     |  We don't optimize this average – we just track it. Then  when we want to
 | ||
|     |  actually use the model, we use the averages, not the most recent value.
 | ||
|     |  In spaCy (and #[+a(gh("thinc")) Thinc]) this is done by using a
 | ||
|     |  context manager, #[+api("language#use_params") #[code use_params]], to
 | ||
|     |  temporarily replace the weights:
 | ||
| 
 | ||
| +code.
 | ||
|     with nlp.use_params(optimizer.averages):
 | ||
|         nlp.to_disk('/model')
 | ||
| 
 | ||
| p
 | ||
|     |  The context manager is handy because you naturally want to evaluate and
 | ||
|     |  save the model at various points during training (e.g. after each epoch).
 | ||
|     |  After evaluating and saving, the context manager will exit and the
 | ||
|     |  weights will be restored, so you resume training from the most recent
 | ||
|     |  value, rather than the average. By evaluating the model after each epoch,
 | ||
|     |  you can remove one hyper-parameter from consideration (the number of
 | ||
|     |  epochs). Having one less magic number to guess is extremely nice – so
 | ||
|     |  having the averaging under a context manager is very convenient.
 | ||
| 
 | ||
| +h(3, "tips-transfer-learning") Transfer learning
 | ||
| 
 | ||
| p
 | ||
|     |  Finally, if you're training from a small data set, it's very useful to
 | ||
|     |  start off with some knowledge already in the model. #[strong Word vectors]
 | ||
|     |  are an easy and reliable way to do that, but depending on the
 | ||
|     |  application, you may also be able to start with useful knowledge from one
 | ||
|     |  of spaCy's #[+a("/models") pre-trained models], such as the parser,
 | ||
|     |  entity recogniser and tagger. If you're adapting a pre-trained model and
 | ||
|     |  you want it to retain accuracy on the tasks it was originally trained
 | ||
|     |  for, you should consider the  "catastrophic forgetting" problem.
 | ||
|     |  #[+a("https://explosion.ai/blog/pseudo-rehearsal-catastrophic-forgetting", true) See this blog post]
 | ||
|     |  to read more about the problem and our suggested solution,
 | ||
|     |  pseudo-rehearsal.
 |