Stochastic approximation

The recursive update rules of stochastic approximation methods can be used, among other things, for solving linear systems when the collected data is corrupted by noise, or for approximating extreme values of functions which cannot be computed directly, but only estimated via noisy observations.

In a nutshell, stochastic approximation algorithms deal with a function of the form

which is the expected value of a function depending on a random variable

Recently, stochastic approximations have found extensive applications in the fields of statistics and machine learning, especially in settings with big data.

These applications range from stochastic optimization methods and algorithms, to online forms of the EM algorithm, reinforcement learning via temporal differences, and deep learning, and others.

[1] Stochastic approximation algorithms have also been used in the social sciences to describe collective dynamics: fictitious play in learning theory and consensus algorithms can be studied using their theory.

The Robbins–Monro algorithm, introduced in 1951 by Herbert Robbins and Sutton Monro,[3] presented a methodology for solving a root finding problem, where the function is represented as an expected value.

, then the Robbins–Monro algorithm is equivalent to stochastic gradient descent with loss function

under the assumption of twice continuous differentiability and strong convexity, it can perform quite poorly upon implementation.

This is primarily due to the fact that the algorithm is very sensitive to the choice of the step size sequence, and the supposed asymptotically optimal step size policy can be quite harmful in the beginning.

[6][8] Chung (1954)[9] and Fabian (1968)[10] showed that we would achieve optimal convergence rate

Lai and Robbins[11][12] designed adaptive procedures to estimate

However the application of such optimal methods requires much a priori information which is hard to obtain in most situations.

To overcome this shortfall, Polyak (1991)[13] and Ruppert (1988)[14] independently developed a new optimal algorithm based on the idea of averaging the trajectories.

Polyak and Juditsky[15] also presented a method of accelerating Robbins–Monro for linear and non-linear root-searching problems through the use of longer steps, and averaging of the iterates.

[15] Prior to this, the idea of using longer steps and averaging the iterates had already been proposed by Nemirovski and Yudin[16] for the cases of solving the stochastic optimization problem with continuous convex objectives and for convex-concave saddle point problems.

A more general result is given in Chapter 11 of Kushner and Yin[17] by defining interpolated time

The success of the averaging idea is because of the time scale separation of the original sequence

is differentiable and convex, then this problem is equivalent to find the root

In some special cases when either IPA or likelihood ratio methods are applicable, then one is able to obtain an unbiased gradient estimator

is viewed as some "fundamental" underlying random process that is generated independently of

, and under some regularization conditions for derivative-integral interchange operations so that

We then define a recursion analogously to Newton's Method in the deterministic algorithm: The following result gives sufficient conditions on

However, the algorithm was presented as a method which would stochastically estimate the maximum of a function.

The structure of the algorithm follows a gradient-like method, with the iterates being generated as where

specifies a sequence of positive step sizes taken along that direction.

almost surely, provided that: A suitable choice of sequences, as recommended by Kiefer and Wolfowitz, would be

An extensive theoretical literature has grown up around these algorithms, concerning conditions for convergence, rates of convergence, multivariate and other generalizations, proper choice of step size, possible noise models, and so on.

[21][22] These methods are also applied in control theory, in which case the unknown function which we wish to optimize or find the zero of may vary in time.

[21], 2nd ed., chapter 3 C. Johan Masreliez and R. Douglas Martin were the first to apply stochastic approximation to robust estimation.