The method involves starting with a relatively large estimate of the step size for movement along the line search direction, and iteratively shrinking the step size (i.e., "backtracking") until a decrease of the objective function is observed that adequately corresponds to the amount of decrease that is expected, based on the step size and the local gradient of the objective function.
Backtracking line search is typically used for gradient descent (GD), but it can also be used in other contexts.
that provides a reasonable amount of improvement in the objective function, rather than to find the actual minimizing value of
This condition, when used appropriately as part of a line search, can ensure that the step size is not excessively large.
Thus, the backtracking line search strategy starts with a relatively large step size, and repeatedly shrinks it by a factor
The detailed steps are thus, see Armijo (1966), Bertsekas (2016): To assure good behavior, it is necessary that some conditions must be satisfied by
, an interesting question is how large learning rates can be chosen in Armijo's condition (that is, when one has no limit on
as defined in the section "Function minimization using backtracking line search in practice"), since larger learning rates when
An upper bound for learning rates is shown to exist if one wants the constructed sequence
converges to a non-degenerate critical point, see Truong & Nguyen (2020): The learning rates must be bounded from above roughly by
An argument against the use of Backtracking line search, in particular in Large scale optimisation, is that satisfying Armijo's condition is expensive.
(There, one can find also good/stable implementations of Armijo's condition and its combination with some popular algorithms such as Momentum and NAG, on datasets such as Cifar10 and Cifar100.)
converges (as wished when one makes use of an iterative optimisation method), then the sequence of learning rates
One can save time further by a hybrid mixture between two-way backtracking and the basic standard gradient descent algorithm.
Roughly speaking, we run two-way backtracking a few times, then use the learning rate we get from then unchanged, except if the function value increases.
In the setting of deep learning, saddle points are also prevalent, see Dauphin et al. (2014).
For non-smooth functions satisfying Łojasiewicz inequality, the above convergence guarantee is extended, see Attouch, Bolte & Svaiter (2011).
For the case of a function with at most countably many critical points (such as a Morse function) and compact sublevels, as well as with Lipschitz continuous gradient where one uses standard GD with learning rate <1/L (see the section "Stochastic gradient descent"), then convergence is guaranteed, see for example Chapter 12 in Lange (2013).
In the same reference, similarly convergence is guaranteed for other modifications of Backtracking line search (such as Unbounded backtracking gradient descent mentioned in the section "Upper bound for learning rates"), and even if the function has uncountably many critical points still one can deduce some non-trivial facts about convergence behaviour.
In the stochastic setting, under the same assumption that the gradient is Lipschitz continuous and one uses a more restrictive version (requiring in addition that the sum of learning rates is infinite and the sum of squares of learning rates is finite) of diminishing learning rate scheme (see section "Stochastic gradient descent") and moreover the function is strictly convex, then the convergence is established in the well-known result Robbins & Monro (1951), see Bertsekas & Tsitsiklis (2006) for generalisations to less restrictive versions of a diminishing learning rate scheme.
(more precisely, outside a set of Lebesgue measure zero), the sequence constructed will not converge to a non-degenerate saddle point (proven in Lee et al. (2016)), and more generally it is also true that the sequence constructed will not converge to a degenerate saddle point (proven in Panageas & Piliouras (2017)).
Under the same assumption that the gradient is Lipschitz continuous and one uses a diminishing learning rate scheme (see the section "Stochastic gradient descent"), then avoidance of saddle points is established in Panageas, Piliouras & Wang (2019).
and for whatever constant learning rate one chooses, with a random initial point the sequence constructed by this special scheme does not converge to the global minimum 0.
If one does not care about the condition that learning rate must be bounded by 1/L, then this special scheme has been used much older, at least since 1847 by Cauchy, which can be called standard GD (not to be confused with stochastic gradient descent, which is abbreviated herein as SGD).
Even if the cost function has globally continuous gradient, good estimate of the Lipschitz constant for the cost functions in deep learning may not be feasible or desirable, given the very high dimensions of deep neural networks.
Another way is the so-called adaptive standard GD or SGD, some representatives are Adam, Adadelta, RMSProp and so on, see the article on Stochastic gradient descent.
In adaptive standard GD or SGD, learning rates are allowed to vary at each iterate step n, but in a different manner from Backtracking line search for gradient descent.
In summary, backtracking line search (and its modifications) is a method which is easy to implement, is applicable for very general functions, has very good theoretical guarantee (for both convergence to critical points and avoidance of saddle points) and works well in practice.
Several other methods which have good theoretical guarantee, such as diminishing learning rates or standard GD with learning rate <1/L – both require the gradient of the objective function to be Lipschitz continuous, turn out to be a special case of Backtracking line search or satisfy Armijo's condition.
Even though a priori one needs the cost function to be continuously differentiable to apply this method, in practice one can apply this method successfully also for functions which are continuously differentiable on a dense open subset such as