//- 💫 DOCS > USAGE > TRAINING > OPTIMIZATION TIPS AND ADVICE p | There are lots of conflicting "recipes" for training deep neural | networks at the moment. The cutting-edge models take a very long time to | train, so most researchers can't run enough experiments to figure out | what's #[em really] going on. For what it's worth, here's a recipe that seems | to work well on a lot of NLP problems: +list("numbers") +item | Initialise with batch size 1, and compound to a maximum determined | by your data size and problem type. +item | Use Adam solver with fixed learning rate. +item | Use averaged parameters +item | Use L2 regularization. +item | Clip gradients by L2 norm to 1. +item | On small data sizes, start at a high dropout rate, with linear decay. p | This recipe has been cobbled together experimentally. Here's why the | various elements of the recipe made enough sense to try initially, and | what you might try changing, depending on your problem. +h(3, "tips-batch-size") Compounding batch size p | The trick of increasing the batch size is starting to become quite | popular (see #[+a("https://arxiv.org/abs/1711.00489") Smith et al., 2017]). | Their recipe is quite different from how spaCy's models are being | trained, but there are some similarities. In training the various spaCy | models, we haven't found much advantage from decaying the learning | rate – but starting with a low batch size has definitely helped. You | should try it out on your data, and see how you go. Here's our current | strategy: +code("Batch heuristic"). def get_batches(train_data, model_type): max_batch_sizes = {'tagger': 32, 'parser': 16, 'ner': 16, 'textcat': 64} max_batch_size = max_batch_sizes[model_type] if len(train_data) < 1000: max_batch_size /= 2 if len(train_data) < 500: max_batch_size /= 2 batch_size = compounding(1, max_batch_size, 1.001) batches = minibatch(train_data, size=batch_size) return batches p | This will set the batch size to start at #[code 1], and increase each | batch until it reaches a maximum size. The tagger, parser and entity | recognizer all take whole sentences as input, so they're learning a lot | of labels in a single example. You therefore need smaller batches for | them. The batch size for the text categorizer should be somewhat larger, | especially if your documents are long. +h(3, "tips-hyperparams") Learning rate, regularization and gradient clipping p | By default spaCy uses the Adam solver, with default settings | (learning rate #[code 0.001], #[code beta1=0.9], #[code beta2=0.999]). | Some researchers have said they found these settings terrible on their | problems – but they've always performed very well in training spaCy's | models, in combination with the rest of our recipe. You can change these | settings directly, by modifying the corresponding attributes on the | #[code optimizer] object. You can also set environment variables, to | adjust the defaults. p | There are two other key hyper-parameters of the solver: #[code L2] | #[strong regularization], and #[strong gradient clipping] | (#[code max_grad_norm]). Gradient clipping is a hack that's not discussed | often, but everybody seems to be using. It's quite important in helping | to ensure the network doesn't diverge, which is a fancy way of saying | "fall over during training". The effect is sort of similar to setting the | learning rate low. It can also compensate for a large batch size (this is | a good example of how the choices of all these hyper-parameters | intersect). +h(3, "tips-dropout") Dropout rate p | For small datasets, it's useful to set a | #[strong high dropout rate at first], and #[strong decay] it down towards | a more reasonable value. This helps avoid the network immediately | overfitting, while still encouraging it to learn some of the more | interesting things in your data. spaCy comes with a | #[+api("top-level#util.decaying") #[code decaying]] utility function to | facilitate this. You might try setting: +code. from spacy.util import decaying dropout = decaying(0.6, 0.2, 1e-4) p | You can then draw values from the iterator with #[code next(dropout)], | which you would pass to the #[code drop] keyword argument of | #[+api("language#update") #[code nlp.update]]. It's pretty much always a | good idea to use at least #[strong some dropout]. All of the models | currently use Bernoulli dropout, for no particularly principled reason – | we just haven't experimented with another scheme like Gaussian dropout | yet. +h(3, "tips-param-avg") Parameter averaging p | The last part of our optimization recipe is #[strong parameter averaging], | an old trick introduced by | #[+a("https://cseweb.ucsd.edu/~yfreund/papers/LargeMarginsUsingPerceptron.pdf") Freund and Schapire (1999)], | popularised in the NLP community by | #[+a("http://www.aclweb.org/anthology/P04-1015") Collins (2002)], | and explained in more detail by | #[+a("http://leon.bottou.org/projects/sgd") Leon Bottou]. Just about the | only other people who seem to be using this for neural network training | are the SyntaxNet team (one of whom is Michael Collins) – but it really | seems to work great on every problem. p | The trick is to store the moving average of the weights during training. | We don't optimize this average – we just track it. Then when we want to | actually use the model, we use the averages, not the most recent value. | In spaCy (and #[+a(gh("thinc")) Thinc]) this is done by using a | context manager, #[+api("language#use_params") #[code use_params]], to | temporarily replace the weights: +code. with nlp.use_params(optimizer.averages): nlp.to_disk('/model') p | The context manager is handy because you naturally want to evaluate and | save the model at various points during training (e.g. after each epoch). | After evaluating and saving, the context manager will exit and the | weights will be restored, so you resume training from the most recent | value, rather than the average. By evaluating the model after each epoch, | you can remove one hyper-parameter from consideration (the number of | epochs). Having one less magic number to guess is extremely nice – so | having the averaging under a context manager is very convenient. +h(3, "tips-transfer-learning") Transfer learning p | Finally, if you're training from a small data set, it's very useful to | start off with some knowledge already in the model. #[strong Word vectors] | are an easy and reliable way to do that, but depending on the | application, you may also be able to start with useful knowledge from one | of spaCy's #[+a("/models") pre-trained models], such as the parser, | entity recogniser and tagger. If you're adapting a pre-trained model and | you want it to retain accuracy on the tasks it was originally trained | for, you should consider the "catastrophic forgetting" problem. | #[+a("https://explosion.ai/blog/pseudo-rehearsal-catastrophic-forgetting", true) See this blog post] | to read more about the problem and our suggested solution, | pseudo-rehearsal.