Stochastic variance reduction

(Stochastic) variance reduction is an algorithmic approach to minimizing functions that can be decomposed into finite sums.

By exploiting the finite sum structure, variance reduction techniques are able to achieve convergence rates that are impossible to achieve with methods that treat the objective as an infinite sum, as in the classical Stochastic approximation setting.

Variance reduction approaches are widely used for training machine learning models such as logistic regression and support vector machines[1] as these problems have finite-sum structure and uniform conditioning that make them ideal candidates for variance reduction.

is considered to have finite sum structure if it can be decomposed into a summation or average: where the function value and derivative of each

Although variance reduction methods can be applied for any positive

structure, their favorable theoretical and practical properties arise when

have similar (but not necessarily identical) Lipschitz smoothness and strong convexity constants.

The finite sum structure should be contrasted with the stochastic approximation setting which deals with functions of the form

which is the expected value of a function depending on a random variable

Any finite sum problem can be optimized using a stochastic approximation algorithm by using

Stochastic variance reduced methods without acceleration are able to find a minima of

Accelerated methods in the stochastic variance reduction framework achieve even faster convergence rates, requiring only steps to reach

[2] for the finite sum class establish that this rate is the fastest possible for smooth strongly convex problems.

Each category contains methods designed for dealing with convex, non-smooth, and non-convex problems, each differing in hyper-parameter settings and other algorithmic details.

SAGA is among the most popular of the variance reduction methods due to its simplicity, easily adaptable theory, and excellent performance.

It is the successor of the SAG method,[4] improving on its flexibility and performance.

The stochastic variance reduced gradient method (SVRG),[5] the prototypical snapshot method, uses a similar update except instead of using the average of a table it instead uses a full-gradient that is reevaluated at a snapshot point

The update becomes: This approach requires two stochastic gradient evaluations per step, one to compute

Exploiting the dual representation of the objective leads to another variance reduction approach that is particularly suited to finite-sums where each term has a structure that makes computing the convex conjugate

The standard SDCA method[6] considers finite sums that have additional structure compared to generic finite sum setting: where each

SDCA solves the dual problem: by a stochastic coordinate ascent procedure, where at each step the objective is optimized with respect to a randomly chosen coordinate

values: This method obtains similar theoretical rates of convergence to other stochastic variance reduced methods, while avoiding the need to specify a step-size parameter.

The earliest approaches make use of proximal operators to accelerate convergence, either approximately or exactly.

Direct acceleration approaches have also been developed[7] The catalyst framework[8] uses any of the standard methods above as an inner optimizer to approximately solve a proximal operator: after which it uses an extrapolation step to determine the next

: The catalyst method's flexibility and simplicity make it a popular baseline approach.

It doesn't achieve the optimal rate of convergence among accelerated methods, it is potentially slower by up to a log factor in the hyper-parameters.

The Point-SAGA method[9] replaces the gradient operations in SAGA with proximal operator evaluations, result in a simple, direct acceleration method: with the table update

th term: Unlike other known accelerated methods, Point-SAGA requires only a single iterate sequence

to be maintained between steps, and it has the advantage of only having a single tunable parameter

It obtains the optimal accelerated rate of convergence for strongly convex finite-sum minimization without additional log factors.