Neural network Gaussian process

A Neural Network Gaussian Process (NNGP) is a Gaussian process (GP) obtained as the limit of a certain type of sequence of neural networks.

[1][2][3][4][5][6][7][8] The concept constitutes an intensional definition, i.e., a NNGP is just a GP, but distinguished by how it is obtained.

They are a type of neural network whose parameters and predictions are both probabilistic.

[9][10] While standard neural networks often assign high confidence even to incorrect predictions,[11] Bayesian neural networks can more accurately evaluate how likely their predictions are to be correct.

When we consider a sequence of Bayesian neural networks with increasingly wide layers (see figure), they converge in distribution to a NNGP.

This large width limit is of practical interest, since the networks often improve as layers get wider.

[12][4][13] And the process may give a closed form way to evaluate networks.

NNGPs also appears in several other contexts: It describes the distribution over predictions made by wide non-Bayesian artificial neural networks after random initialization of their parameters, but before training; it appears as a term in neural tangent kernel prediction equations; it is used in deep information propagation to characterize whether hyperparameters and architectures will be trainable.

[14] It is related to other large width limits of neural networks.

The first correspondence result had been established in the 1995 PhD thesis of Radford M. Neal,[15] then supervised by Geoffrey Hinton at University of Toronto.

Neal cites David J. C. MacKay as inspiration, who worked in Bayesian learning.

[8] In fact, this NNGP correspondence holds for almost any architecture: Generally, if an architecture can be expressed solely via matrix multiplication and coordinatewise nonlinearities (i.e., a tensor program), then it has an infinite-width GP.

[8] This in particular includes all feedforward or recurrent neural networks composed of multilayer perceptron, recurrent neural networks (e.g., LSTMs, GRUs), (nD or graph) convolution, pooling, skip connection, attention, batch normalization, and/or layer normalization.

As neural networks are made infinitely wide, this distribution over functions converges to a Gaussian process for many architectures.

The black dots show the function computed by the neural network on these inputs for random draws of the parameters from

The red lines are iso-probability contours for the joint distribution over network outputs

in parameter space, and the black dots are samples from this distribution.

This section expands on the correspondence between infinitely wide neural networks and Gaussian processes for the specific case of a fully connected architecture.

It provides a proof sketch outlining why the correspondence holds, and introduces the specific functional form of the NNGP for fully connected networks.

The proof sketch closely follows the approach by Novak and coauthors.

[4] Consider a fully connected artificial neural network with inputs

This network is illustrated in the figure to the right, and described by the following set of equations: We first observe that the pre-activations

are described by a Gaussian process conditioned on the preceding activations

, they are described by a Gaussian process conditioned on the preceding activations

The covariance or kernel of this Gaussian process depends on the weight and bias variances

is a ReLU,[17] ELU, GELU,[18] or error function[1] nonlinearity.

Even when it can't be solved analytically, since it is a 2d integral it can generally be efficiently computed numerically.

, which corresponds to computing this 2d integral for all pairs of inputs, and which maps

By combining this expression with the further observations that the input layer second moment matrix

is a Gaussian process, the output of the neural network can be expressed as a Gaussian process in terms of its input, Neural Tangents is a free and open-source Python library used for computing and doing inference with the NNGP and neural tangent kernel corresponding to various common ANN architectures.

Left : a Bayesian neural network with two hidden layers, transforming a 3-dimensional input (bottom) into a two-dimensional output

(y_{1},y_{2})

(top). Right : output probability density function

p(y_{1},y_{2})

induced by the random weights of the network. Video : as the width of the network increases, the output distribution simplifies, ultimately converging to a multivariate normal in the infinite width limit.