In computational learning theory (machine learning and theory of computation), Rademacher complexity, named after Hans Rademacher, measures richness of a class of sets with respect to a probability distribution.
The concept can also be extended to real valued functions.
are independent random variables drawn from the Rademacher distribution i.e.
be a sample of points and consider a function class
is defined as: This can also be written using the previous definition:[2]: 326 where
The Rademacher complexity is typically applied on a function class of models that are used for classification, with the goal of measuring their ability to classify points drawn from a probability space under arbitrary labellings.
When the function class is rich enough, it contains functions that can appropriately adapt for each arrangement of labels, simulated by the random draw of
under the expectation, so that this quantity in the sum is maximised.
Then: The Rademacher complexity can be used to derive data-dependent upper-bounds on the learnability of function classes.
Intuitively, a function-class with smaller Rademacher complexity is easier to learn.
In machine learning, it is desired to have a training set that represents the true distribution of some sample data
the set of hypotheses (potential classifiers) and denote by
, that maps each training sample (features,label) to the error of the classifier
(note in this case hypothesis and classifier are used interchangeably).
, is defined as: Smaller representativeness is better, since it provides a way to avoid overfitting: it means that the true error of a classifier is not much higher than its estimated error, and so selecting a classifier that has low estimated error will ensure that the true error is also low.
Note however that the concept of representativeness is relative and hence can not be compared across distinct samples.
The expected representativeness of a sample can be bounded above by the Rademacher complexity of the function class:[2]: 326 When the Rademacher complexity is small, it is possible to learn the hypothesis class H using empirical risk minimization.
The following rules can be used to upper-bound the Rademacher complexity of a set
are operated by a contraction mapping, then Rad(A) strictly decreases.
(Massart Lemma) The Rademacher complexity of a finite set grows logarithmically with the set size.
is a set of binary vectors, the norm is at most
can be considered as a set of binary vectors over
Substituting this in Massart's lemma gives: With more advanced techniques (Dudley's entropy bound and Haussler's upper bound[4]) one can show, for example, that there exists a constant
The following bounds are related to linear operations on
Then: The following bound relates the Rademacher complexity of a set
is a set of vectors whose length (norm) is at most
random variables with zero-mean and variance 1, i.e.
Gaussian and Rademacher complexities are known to be equivalent up to logarithmic factors.
As an example, consider the rademacher and gaussian complexities of the L1 ball.
(which can be shown by applying known properties of suprema of a set of subgaussian random variables).