Sample complexity

The sample complexity of a machine learning algorithm represents the number of training-samples that it needs in order to successfully learn a target function.

More precisely, the sample complexity is the number of training-samples that we need to supply to the algorithm, so that the function returned by the algorithm is within an arbitrarily small error of the best possible function, with probability arbitrarily close to 1.

There are two variants of sample complexity: The No free lunch theorem, discussed below, proves that, in general, the strong sample complexity is infinite, i.e. that there is no algorithm that can learn the globally-optimal target function using a finite number of training samples.

However, if we are only interested in a particular class of target functions (e.g., only linear functions) then the sample complexity is finite, and it depends linearly on the VC dimension on the class of target functions.

In other words, it is an algorithm that takes as input a finite sequence of training samples and outputs a function from

Typical learning algorithms include empirical risk minimization, without or with Tikhonov regularization.

is a sequence of vectors which are all drawn independently from

defines the rate of consistency of the algorithm: given a desired accuracy

data points to guarantee that the risk of the output function is within

[2] In probably approximately correct (PAC) learning, one is concerned with whether the sample complexity is polynomial, that is, whether

is polynomial for some learning algorithm, then one says that the hypothesis space

One can ask whether there exists a learning algorithm so that the sample complexity is finite in the strong sense, that is, there is a bound on the number of samples needed so that the algorithm can learn any distribution over the input-output space with a specified target error.

More formally, one asks whether there exists a learning algorithm

The No Free Lunch Theorem says that without restrictions on the hypothesis space

, this is not the case, i.e., there always exist "bad" distributions for which the sample complexity is arbitrarily large.

[1] Thus, in order to make statements about the rate of convergence of the quantity

one must either The latter approach leads to concepts such as VC dimension and Rademacher complexity which control the complexity of the space

A smaller hypothesis space introduces more bias into the inference process, meaning that

may be greater than the best possible risk in a larger space.

However, by restricting the complexity of the hypothesis space it becomes possible for an algorithm to produce more uniformly consistent functions.

This trade-off leads to the concept of regularization.

[2] It is a theorem from VC theory that the following three statements are equivalent for a hypothesis space

This is the linear classification with offset learning problem.

Now, four coplanar points in a square cannot be shattered by any affine function, since no affine function can be positive on two diagonally opposite vertices and negative on the remaining two.

Thus, the sample-complexity is a linear function of the VC dimension of the hypothesis space.

is a class of real-valued functions with range in

In addition to the supervised learning setting, sample complexity is relevant to semi-supervised learning problems including active learning,[7] where the algorithm can ask for labels to specifically chosen inputs in order to reduce the cost of obtaining many labels.

[9] A high sample complexity means that many calculations are needed for running a Monte Carlo tree search.

[10] It is equivalent to a model-free brute force search in the state space.

In contrast, a high-efficiency algorithm has a low sample complexity.