However, in the limit of large layer width the NTK becomes constant, revealing a duality between training the wide neural network and kernel methods: gradient descent in the infinite-width limit is fully equivalent to kernel gradient descent with the NTK.
As a result, using gradient descent to minimize least-square loss for neural networks yields the same mean estimator as ridgeless kernel regression with the NTK.
This duality enables simple closed form equations describing the training dynamics, generalization, and predictions of wide neural networks.
The NTK was introduced in 2018 by Arthur Jacot, Franck Gabriel and Clément Hongler,[1] who used it to study the convergence and generalization properties of fully connected neural networks.
Later works[2][3] extended the NTK results to other neural network architectures.
In fact, the phenomenon behind NTK is not specific to neural networks and can be observed in generic nonlinear models, usually by a suitable scaling[4].Let
denote the scalar function computed by a given neural network with parameters
Consider a fully connected neural network whose parameters are chosen i.i.d.
Consider taking the width of every hidden layer to infinity and training the neural network with gradient descent (with a suitably small learning rate).
In this infinite-width limit, several nice properties emerge: From a physics point of view, the NTK can be understood as a type of Hamiltonian, since it generates the time-evolution of observables when the neural network is trained by gradient descent with infinitesimally small steps (the continuum limit).
[7] Kernel methods are machine learning algorithms which use only pairwise relations between input points.
[8] Note that this dual solution is expressed solely in terms of the inner products between inputs.
The regression equations are called "ridgeless" because they lack a ridge regularization term.
It’s known that if the weight vector is initialized close to zero, least-squares gradient descent converges to the minimum-norm solution, i.e., the final weight vector has the minimum Euclidean norm of all the interpolating solutions.
In the same way, kernel gradient descent yields the minimum-norm solution with respect to the RKHS norm.
Therefore, studying kernels with high-dimensional feature maps can provide insights about strongly overparametrized models.
Surprisingly, modern neural networks (which tend to be strongly overparametrized) seem to generalize well, even in the absence of explicit regularization.
[9][10] To study the generalization properties of overparametrized neural networks, one can exploit the infinite-width duality with ridgeless kernel regression.
Recent works[11][12][13] have derived equations describing the expected generalization error of high-dimensional kernel regression; these results immediately explain the generalization of sufficiently wide neural networks trained to convergence on least-squares.
with a global minimum, if the NTK remains positive-definite during training, the loss of the ANN
This positive-definiteness property has been shown in a number of cases, yielding the first proofs that large-width ANNs converge to global minima during training.
Individual parameters of a wide neural network in the kernel regime change negligibly during training.
This is not a generic feature of infinite-width neural networks and is largely due to a specific choice of the scaling by which the width is taken to the infinite limit; indeed several works[21][22][23][24] have found alternate infinite-width scaling limits of neural networks in which there is no duality with kernel regression and feature learning occurs during training.
Others[25] introduce a "neural tangent hierarchy" to describe finite-width effects, which may drive feature learning.
Neural Tangents is a free and open-source Python library used for computing and doing inference with the infinite width NTK and neural network Gaussian process (NNGP) corresponding to various common ANN architectures.
[26] In addition, there exists a scikit-learn compatible implementation of the infinite width NTK for Gaussian processes called scikit-ntk.
Empirical risk minimization proceeds as in the scalar case, with the difference being that the loss function takes vector inputs
through continuous-time gradient descent yields the following evolution in function space driven by the NTK:
are initialized as standard normal variables, the NTK has a finite nontrivial limit.
[28][29][5] The NTK describes the evolution of neural networks under gradient descent in function space.