Vapnik–Chervonenkis theory

The theory is a form of computational learning theory, which attempts to explain the learning process from a statistical point of view.

From this point of view, VC theory is related to stability, which is an alternative approach for characterizing generalization.

Arguably these are the most important applications of the VC theory, and are employed in proving generalization.

Several techniques will be introduced that are widely used in the empirical process and VC theory.

The discussion is mainly based on the book Weak Convergence and Empirical Processes: With Applications to Statistics.

given by: Now suppose P is the underlying true distribution of the data, which is unknown.

Empirical Processes theory aims at identifying classes

A Donsker class is Glivenko–Cantelli in probability by an application of Slutsky's theorem .

, by standard LLN, CLT arguments under regularity conditions, and the difficulty in the Empirical Processes comes in because joint statements are being made for all

Two sufficient conditions are provided below, under which it can be proved that the set

and satisfies: The next condition is a version of the celebrated Dudley's theorem.

In the last integral, the notation means The majority of the arguments of how to bound the empirical process rely on symmetrization, maximal and concentration inequalities, and chaining.

For every nondecreasing, convex Φ: R → R and class of measurable functions

, The proof of the Symmetrization lemma relies on introducing independent copies of the original variables

(sometimes referred to as a ghost sample) and replacing the inner expectation of the LHS by these copies.

After an application of Jensen's inequality different signs could be introduced (hence the name symmetrization) without changing the expectation.

gives: Note that adding a minus sign in front of a term

doesn't change the RHS, because it's a symmetric function of

Therefore, the RHS remains the same under "sign perturbation": for any

It turns out that there is a fascinating connection between certain combinatorial properties of the set

The VC-index (similar to VC dimension + 1 for an appropriately chosen classifier set)

A similar bound can be shown (with a different constant, same rate) for the so-called VC subgraph classes.

: The important consequence of this fact is that which is just enough so that the entropy integral is going to converge, and therefore the class

that would imply that the LHS is strictly positive but the RHS is non-positive.

[4] A similar setting is considered, which is more common to machine learning.

Therefore, in terms of the previous section the shattering coefficient is precisely This equivalence together with Sauer's Lemma implies that

Assume that the data is generated by an unknown probability distribution

Then one has the following Theorem: For binary classification and the 0/1 loss function we have the following generalization bounds: In words the VC inequality is saying that as the sample increases, provided that

Note that both RHS of the two inequalities will converge to 0, provided that

Here one is dealing with a modified empirical process but not surprisingly the ideas are the same.