Learning rate

In machine learning and statistics, the learning rate is a tuning parameter in an optimization algorithm that determines the step size at each iteration while moving toward a minimum of a loss function.

[1] Since it influences to what extent newly acquired information overrides old information, it metaphorically represents the speed at which a machine learning model "learns".

In the adaptive control literature, the learning rate is commonly referred to as gain.

[3] In order to achieve faster convergence, prevent oscillations and getting stuck in undesirable local minima the learning rate is often varied during training either in accordance to a learning rate schedule or by using an adaptive learning rate.

[4] The learning rate and its adjustments may also differ per parameter, in which case it is a diagonal matrix that can be interpreted as an approximation to the inverse of the Hessian matrix in Newton's method.

[5] The learning rate is related to the step length determined by inexact line search in quasi-Newton methods and related optimization algorithms.

[6][7] Initial rate can be left as system default or can be selected using a range of techniques.

There are many different learning rate schedules but the most common are time-based, step-based and exponential.

[4] Decay serves to settle the learning in a nice place and avoid oscillations, a situation that may arise when a too high constant learning rate makes the learning jump back and forth over a minimum, and is controlled by a hyperparameter.

Momentum is analogous to a ball rolling down a hill; we want the ball to settle at the lowest point of the hill (corresponding to the lowest error).

Momentum both speeds up the learning (increasing the learning rate) when the error cost gradient is heading in the same direction for a long time and also avoids local minima by 'rolling over' small bumps.

Momentum is controlled by a hyperparameter analogous to a ball's mass which must be chosen manually—too high and the ball will roll over minima which we wish to find, too low and it will not fulfil its purpose.

The formula for factoring in the momentum is more complex than for decay but is most often built in with deep learning libraries such as Keras.

Factoring in the decay the mathematical formula for the learning rate is:

is how much the learning rate should change at each drop (0.5 corresponds to a halving) and

The issue with learning rate schedules is that they all depend on hyperparameters that must be manually chosen for each given learning session and may vary greatly depending on the problem at hand or the model used.

To combat this, there are many different types of adaptive gradient descent algorithms such as Adagrad, Adadelta, RMSprop, and Adam[9] which are generally built into deep learning libraries such as Keras.