Universal approximation theorem

[3] Kurt Hornik [de], Maxwell Stinchcombe, and Halbert White showed in 1989 that multilayer feed-forward networks with as few as one hidden layer are universal approximators.

[1] Hornik also showed in 1991[4] that it is not the specific choice of the activation function but rather the multilayer feed-forward architecture itself that gives neural networks the potential of being universal approximators.

Moshe Leshno et al in 1993[5] and later Allan Pinkus in 1999[6] showed that the universal approximation property is equivalent to having a nonpolynomial activation function.

The arbitrary depth case was also studied by a number of authors such as Gustaf Gripenberg in 2003,[7] Dmitry Yarotsky,[8] Zhou Lu et al in 2017,[9] Boris Hanin and Mark Sellke in 2018[10] who focused on neural networks with ReLU activation function.

In 2020, Patrick Kidger and Terry Lyons[11] extended those results to neural networks with general activation functions such, e.g. tanh, GeLU, or Swish.

One special case of arbitrary depth is that each composition component comes from a finite set of mappings.

This is similar to the concept of compositionality in linguistics, which is the idea that a finite vocabulary of basic elements can be combined via grammar to express an infinite range of meanings.

In 2018, Guliyev and Ismailov[14] constructed a smooth sigmoidal activation function providing universal approximation property for two hidden layer feedforward neural networks with less units in hidden layers.

In 2018, they also constructed[15] single hidden layer networks with bounded width that are still universal approximators for univariate functions.

In 2022, Shen et al.[16] obtained precise quantitative information on the depth and width required to approximate a target function by deep and wide ReLU neural networks.

The question of minimal possible width for universality was first studied in 2021, Park et al obtained the minimum width required for the universal approximation of Lp functions using feed-forward neural networks with ReLU as activation functions.

[17] Similar results that can be directly applied to residual neural networks were also obtained in the same year by Paulo Tabuada and Bahman Gharesifard using control-theoretic arguments.

[18][19] In 2023, Cai obtained the optimal minimum width bound for the universal approximation.

Robert Hecht-Nielsen showed that a three-layer neural network can approximate any continuous multivariate function.

For input dimension dx and output dimension dy the minimum width required for the universal approximation of the Lp functions is exactly max{dx + 1, dy} (for a ReLU network).

There are also a variety of results between non-Euclidean spaces[31] and other commonly used architectures and, more generally, algorithmically generated sets of functions, such as the convolutional neural network (CNN) architecture,[32][33] radial basis functions,[34] or neural networks with specific properties.

A spate of papers in the 1980s—1990s, from George Cybenko and Kurt Hornik [de] etc, established several universal approximation theorems for arbitrary width and bounded depth.

In particular, this shows that a perceptron network with a single infinitely wide hidden layer can approximate arbitrary functions.

With more steps of the staircase, the overshoots smooth out and we get arbitrarily good approximation of the ramp function.

[41] The original proofs, such as the one by Cybenko, use methods from functional analysis, including the Hahn-Banach and Riesz–Markov–Kakutani representation theorems.

The "dual" versions of the theorem consider networks of bounded width and arbitrary depth.

A variant of the universal approximation theorem was proved for the arbitrary depth case by Zhou Lu et al. in 2017.

It was also shown that if the width was less than or equal to n, this general expressive power to approximate any Lebesgue integrable function was lost.

In the same paper[9] it was shown that ReLU networks with width n + 1 were sufficient to approximate any continuous function of n-dimensional input variables.

[43] Universal approximation theorem (L1 distance, ReLU activation, arbitrary depth, minimal width) — For any Bochner–Lebesgue p-integrable function

Remark: If the activation is replaced by leaky-ReLU, and the input is restricted in a compact domain, then the exact minimum width is[20]

Universal approximation theorem (Uniform non-affine activation, arbitrary depth, constrained width).

[13] Their remarkable result revealed that such networks can be universal approximators and for achieving this property two hidden layers are enough.

Using certain algorithmic and computer programming techniques, Guliyev and Ismailov efficiently constructed such activation functions depending on a numerical parameter.

The developed algorithm allows one to compute the activation functions at any point of the real axis instantly.