Proximal gradient methods for learning

Proximal gradient (forward backward splitting) methods for learning is an area of research in optimization and statistical learning theory which studies algorithms for a general class of convex regularization problems where the regularization penalty may not be differentiable.

regularization (also known as Lasso) of the form Proximal gradient methods offer a general framework for solving regularization problems from statistical learning theory with penalties that are tailored to a specific problem application.

[1][2] Such customized penalties can help to induce certain structure in problem solutions, such as sparsity (in the case of lasso) or group structure (in the case of group lasso).

Proximal gradient methods are applicable in a wide variety of scenarios for solving convex optimization problems of the form where

is a convex, lower semicontinuous function which is possibly nondifferentiable, and

be a lower semicontinuous, convex function on a vector space

to be the function The general form of Moreau's decomposition states that for any

Consider the regularized empirical risk minimization problem with square loss and with the

regularization problem is sometimes referred to as lasso (least absolute shrinkage and selection operator).

To solve the problem we consider our objective function in two parts: a convex, differentiable term

[1][6] To finally solve the lasso problem we consider the fixed point equation shown earlier: Given that we have computed the form of the proximity operator explicitly, then we can define a standard fixed point iteration procedure.

define Note here the effective trade-off between the empirical error term

Convergence of this fixed point scheme is well-studied in the literature[1][6] and is guaranteed under appropriate choice of step size

Accelerated methods were introduced by Nesterov in 1983 which improve the rate of convergence under certain regularity assumptions on

[8] For more general learning problems where the proximity operator cannot be computed explicitly for some regularization term

, such fixed point schemes can still be carried out using approximations to both the gradient and the proximity operator.

[4][9] There have been numerous developments within the past decade in convex optimization techniques which have influenced the application of proximal gradient methods in statistical learning theory.

Here we survey a few important topics which can greatly improve practical algorithmic performance of these methods.

[2][10] In the fixed point iteration scheme one can allow variable step size

Numerous adaptive step size schemes have been proposed throughout the literature.

[1][4][11][12] Applications of these schemes[2][13] suggest that these can offer substantial improvement in number of iterations required for fixed point convergence.

Elastic net regularization offers an alternative to pure

This is often avoided by the inclusion of an additional strictly convex term, such as an

is now strictly convex, and hence the minimization problem now admits a unique solution.

acts as a preconditioner and can substantially improve convergence while not adversely affecting the sparsity of solutions.

[2][14] Proximal gradient methods provide a general framework which is applicable to a wide variety of problems in statistical learning theory.

Certain problems in learning can often involve data which has additional structure that is known a priori.

In the past several years there have been new developments which incorporate information about group structure to provide methods which are tailored to different applications.

Where the lasso penalty has a proximity operator which is soft thresholding on each individual component, the proximity operator for the group lasso is soft thresholding on each group.

Here the proximity operator of the conjugate of the group lasso penalty becomes a projection onto the ball of a dual norm.