Random feature

Random features (RF) are a technique used in machine learning to approximate kernel methods, introduced by Ali Rahimi and Ben Recht in their 2007 paper "Random Features for Large-Scale Kernel Machines",[1] and extended by.

[2][3] RF uses a Monte Carlo approximation to kernel functions by randomly sampled feature maps.

It is used for datasets that are too large for traditional kernel methods like support vector machine, kernel ridge regression, and gaussian process.

Kernel methods replaces linear operations in high-dimensional space by operations on the kernel matrix:

by an inner product in low-dimensional feature space

This converts kernel linear regression into linear regression in feature space, kernel SVM into SVM in feature space, etc.

, these methods no longer involve matrices of size

, but only random feature matrices of size

The radial basis function (RBF) kernel on two samples

is a free parameter defining the shape of the kernel.

It can be approximated by a random Fourier feature map

are IID samples from the multidimensional normal distribution

Theorem — - (Unbiased estimation) By independence of

Apply the spherical symmetry of normal distribution, then evaluate the integral:

are IID, it suffices to prove that the variance of

are bounded, there is a stronger convergence guarantee by Hoeffding's inequality.

[1]: Claim 1 By Bochner's theorem, the above construction can be generalized to arbitrary positive definite shift-invariant kernel

, training the feature on a dataset by featurized linear regression is equivalent to fitting complex parameters

which is a neural network with a single hidden layer, with activation function

, zero bias, and the parameters in the first layer frozen.

, the network linearly interpolates the dataset

, and the network parameters is the least-norm solution:

[5] A random binning features map partitions the input space using randomly shifted grids at randomly chosen resolutions and assigns to an input point a binary bit string that corresponds to the bins in which it falls.

The grids are constructed so that the probability that two points

The inner product between a pair of transformed points is proportional to the number of times the two points are binned together, and is therefore an unbiased estimate of

Since this mapping is not smooth and uses the proximity between input points, Random Binning Features works well for approximating kernels that depend only on the

In NIPS 2006, deep learning had just become competitive with linear models like PCA and linear SVMs for large datasets, and people speculated about whether it could compete with kernel SVMs.

However, there was no way to train kernel SVM on large datasets.

The two authors developed the random feature method to train those.

Attempting to discover what caused this led to the subsequent two papers.