Stochastic gradient descent

Stochastic gradient descent (often abbreviated SGD) is an iterative method for optimizing an objective function with suitable smoothness properties (e.g. differentiable or subdifferentiable).

Especially in high-dimensional optimization problems this reduces the very high computational burden, achieving faster iterations in exchange for a lower convergence rate.

In many cases, the summand functions have a simple form that enables inexpensive evaluations of the sum-function and the sum gradient.

To economize on the computational cost at every iteration, stochastic gradient descent samples a subset of summand functions at every step.

This can perform significantly better than "true" stochastic gradient descent described, because the code can make use of vectorization libraries rather than computing each step separately as was first shown in [6] where it was called "the bunch-mode back-propagation algorithm".

[11] Later in the 1950s, Frank Rosenblatt used SGD to optimize his perceptron model, demonstrating the first applicability of stochastic gradient descent to neural networks.

[12] Backpropagation was first described in 1986, with stochastic gradient descent being used to efficiently optimize parameters across neural networks with multiple hidden layers.

Soon after, another improvement was developed: mini-batch gradient descent, where small batches of data are substituted for single samples.

In 1997, the practical performance benefits from vectorization achievable with such small batches were first explored,[13] paving the way for efficient optimization in machine learning.

TensorFlow and PyTorch, by far the most popular machine learning libraries,[20] as of 2023 largely only include Adam-derived optimizers, as well as predecessors to Adam such as RMSprop and classic SGD.

PyTorch also partially supports Limited-memory BFGS, a line-search method, but only for single-device setups without parameter groups.

[28] As mentioned earlier, classical stochastic gradient descent is generally sensitive to learning rate η.

The problem can be largely solved[29] by considering implicit updates whereby the stochastic gradient is evaluated at the next iterate rather than the current one:

Even though a closed-form solution for ISGD is only possible in least squares, the procedure can be efficiently implemented in a wide range of models.

Further proposals include the momentum method or the heavy ball method, which in ML context appeared in Rumelhart, Hinton and Williams' paper on backpropagation learning[30] and borrowed the idea from Soviet mathematician Boris Polyak's 1964 article on solving functional equations.

, thought of as a particle traveling through parameter space,[30] incurs acceleration from the gradient of the loss ("force").

[34] The momentum method is closely related to underdamped Langevin dynamics, and may be combined with simulated annealing.

[38] It still has a base learning rate η, but this is multiplied with the elements of a vector {Gj,j} which is the diagonal of the outer product matrix

This vector essentially stores a historical sum of gradient squares by dimension and is updated after every iteration.

[39] RMSProp (for Root Mean Square Propagation) is a method invented in 2012 by James Martens and Ilya Sutskever, at the time both PhD students in Geoffrey Hinton's group, in which the learning rate is, like in Adagrad, adapted for each of the parameters.

[40] Adam[41] (short for Adaptive Moment Estimation) is a 2014 update to the RMSProp optimizer combining it with the main feature of the Momentum method.

Some examples include: Even though sign-based optimization goes back to the aforementioned Rprop, in 2018 researchers tried to simplify Adam by removing the magnitude of the stochastic gradient from being taken into account and only considering its sign.

Backtracking line search uses function evaluations to check Armijo's condition, and in principle the loop in the algorithm for determining the learning rates can be long and unknown in advance.

On the other hand, adaptive SGD does not guarantee the "descent property" – which Backtracking line search enjoys – which is that

A method that uses direct measurements of the Hessian matrices of the summands in the empirical risk function was developed by Byrd, Hansen, Nocedal, and Singer.

Practical and theoretically sound methods for second-order versions of SGD that do not require direct Hessian information are given by Spall and others.

In particular, second-order optimality is asymptotically achievable without direct calculation of the Hessian matrices of the summands in the empirical risk function.

is the predictive model (e.g., a deep neural network) the objective's structure can be exploited to estimate 2nd order information using gradients only.

denotes taking the expectation with respect to the random choice of indices in the stochastic gradient descent scheme.

denotes the Ito-integral with respect to a Brownian motion is a more precise approximation in the sense that there exists a constant

Fluctuations in the total objective function as gradient steps with respect to mini-batches are taken.
A graph visualizing the behavior of a selected set of optimizers, using a 3D perspective projection of a loss function f(x, y)
A graph visualizing the behavior of a selected set of optimizers