Łojasiewicz inequality

In real algebraic geometry, the Łojasiewicz inequality, named after Stanisław Łojasiewicz, gives an upper bound for the distance of a point to the nearest zero of a given real analytic function.

Specifically, let ƒ : U → R be a real analytic function on an open set U in Rn, and let Z be the zero locus of ƒ.

Then for any compact set K in U, there exist positive constants α and C such that, for all x in K Here α can be large.

The following form of this inequality is often seen in more analytic contexts: with the same assumptions on f, for every p ∈ U there is a possibly smaller open neighborhood W of p and constants θ ∈ (0,1) and c > 0 such that A special case of the Łojasiewicz inequality, due to Polyak [ru], is commonly used to prove linear convergence of gradient descent algorithms.

This section is based on Karimi, Nutini & Schmidt (2016) and Liu, Zhu & Belkin (2022).

achieves its global minimum (if one exists).

Throughout this section we assume such a global minimum value

The optimization objective is to find some point

-PL (where "PL" means "Polyak-Łojasiewicz") iff

is the point on the optimum set that is closest to

By definition, every stationary point is a global minimum.

with constant unit velocity, starting at

, the gradient flow terminates on the zero set

Theorem (linear convergence of gradient descent) — If

-Lipschitz, then gradient descent with constant step size

Plugging in the gradient descent step,

Under the same conditions, gradient descent with optimal step size (which might be found by line-searching) satisfies

uniformly, then perform gradient descent by

Iterating the process, we have the desired result.

In stochastic gradient descent, we have a function to minimize

, but we cannot sample its gradient directly.

is the average loss over all training data points.

We can also write it using the variance of gradient L2 norm:

The second equation is shown similarly by noting that

We can make it easier to use by some further assumptions.

The second-moment on the right can be removed by assuming a uniform upper bound.

For constant learning rate schedule, with

In short, because the gradient descent steps are too large, the variance in the stochastic gradient starts to dominate, and

starts doing a random walk in the vicinity of

For decreasing learning rate schedule with