How should the learning rate change as the batch size change?
Machine LearningDeep LearningMachine Learning Problem Overview
When I increase/decrease batch size of the mini-batch used in SGD, should I change learning rate? If so, then how?
For reference, I was discussing with someone, and it was said that, when batch size is increased, the learning rate should be decreased by some extent.
My understanding is when I increase batch size, computed average gradient will be less noisy and so I either keep same learning rate or increase it.
Also, if I use an adaptive learning rate optimizer, like Adam or RMSProp, then I guess I can leave learning rate untouched.
Please correct me if I am mistaken and give any insight on this.
Machine Learning Solutions
Solution 1 - Machine Learning
Theory suggests that when multiplying the batch size by k, one should multiply the learning rate by sqrt(k) to keep the variance in the gradient expectation constant. See page 5 at A. Krizhevsky. One weird trick for parallelizing convolutional neural networks: https://arxiv.org/abs/1404.5997
However, recent experiments with large mini-batches suggest for a simpler linear scaling rule, i.e multiply your learning rate by k when using mini-batch size of kN. See P.Goyal et al.: Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour https://arxiv.org/abs/1706.02677
I would say that with using Adam, Adagrad and other adaptive optimizers, learning rate may remain the same if batch size does not change substantially.
Solution 2 - Machine Learning
Learning Rate Scaling for Dummies
I've always found the heuristics which seem to vary somewhere between scale with the square root of the batch size and the batch size to be a bit hand-wavy and fluffy, as is often the case in Deep Learning. Hence I devised my own theoretical framework to answer this question. If you are interested the full pre-print is here (currently under review) https://arxiv.org/abs/2006.09092
Learning Rate is a function of the Largest Eigenvalue
Let me start with two small sub-questions, which answer the main question
- Are there any cases where we can a priori know the optimal learning rate?
Yes, for the convex quadratic, the optimal learning rate is given as 2/(λ+μ), where λ,μ represent the largest and smallest eigenvalues of the Hessian (Hessian = the second derivative of the loss ∇∇L, which is a matrix) respectively.
- How do we expect these eigenvalues (which represent how much the loss changes along a infinitesimal move in the direction of the eigenvectors) to change as a function of batch size?
This is actually a little more tricky to answer (it is what I made the theory for in the first place), but it goes something like this.
Let us imagine that we have all the data and that would give us the full Hessian H. But now instead we only sub-sample this Hessian so we use a batch Hessian B. We can simply re-write B=H+(B-H)=H+E. Where E is now some error or fluctuation matrix.
Under some technical assumptions on the nature of the elements of E, we can assume this fluctations to be a zero mean random matrix and so the Batch Hessian becomes a fixed matrix + a random matrix.
For this model, the change in eigenvalues (which determines how large the learning rate can be) is known. In my paper there is another more fancy model, but the answer is more or less the same.
What actually happens? Experiments and Scaling Rules
I attach a plot of what happens in the case that the largest eigenvalue from the full data matrix is far outside that of the noise matrix (usually the case). As we increase the mini-batch size, the size of the noise matrix decreases and so the largest eigenvalue also decreases in size, hence larger learning rates can be used. This effect is initially proportional and continues to be approximately proportional until a threshold after which no appreciable decrease happens.
How well does this hold in practice? The answer as shown below in my plot on the VGG-16 without batch norm (see paper for batch normalisation and resnets), is very well.
I would hasten to add that for adaptive order methods, the argument is a little different because you have an interplay of the eigenvalues, the estimated eigenvalues and the estimated eigenvectors! so you actually end up getting a square root rule up till a threshold. Quite why nobody is discussing this or has published this result is honestly a little beyond me.
But if you want my practical advice, stick with SGD and just go proportional to the increase in batch size if your batch size is small and then don't increase it beyond a certain point.
Solution 3 - Machine Learning
Apart from the papers mentioned in Dmytro's answer, you can refer to the article of: Jastrzębski, S., Kenton, Z., Arpit, D., Ballas, N., Fischer, A., Bengio, Y., & Storkey, A. (2018, October). Width of Minima Reached by Stochastic Gradient Descent is Influenced by Learning Rate to Batch Size Ratio. The authors give the mathematical and empirical foundation to the idea that the ratio of learning rate to batch size influences the generalization capacity of DNN. They show that this ratio plays a major role in the width of the minima found by SGD. The higher ratio the wider is minima and better generalization.