Kernel Fisher discriminant analysis

In statistics, kernel Fisher discriminant analysis (KFD),[1] also known as generalized discriminant analysis[2] and kernel discriminant analysis,[3] is a kernelized version of linear discriminant analysis (LDA).

It is named after Ronald Fisher.

Intuitively, the idea of LDA is to find a projection where class separation is maximized.

Given two sets of labeled data,

, we can calculate the mean value of each class,

is the number of examples of class

The goal of linear discriminant analysis is to give a large separation of the class means while also keeping the in-class variance small.

is the total within-class covariance matrix: The maximum of the above ratio is attained at as can be shown by the Lagrange multiplier method (sketch of proof): Maximizing

yields which is trivially satisfied by

To extend LDA to non-linear mappings, the data, given as the

can be mapped to a new feature space,

In this new feature space, the function that needs to be maximized is[1] where and Further, note that

Explicitly computing the mappings

and then performing LDA can be computationally expensive, and in many cases intractable.

Thus, rather than explicitly mapping the data to

, the data can be implicitly embedded by rewriting the algorithm in terms of dot products and using kernel functions in which the dot product in the new feature space is replaced by a kernel function,

LDA can be reformulated in terms of dot products by first noting that

will have an expansion of the form[5] Then note that where The numerator of

This identity can be derived by starting out with the expression for

With these equations for the numerator and denominator of

can be rewritten as Then, differentiating and setting equal to zero gives Since only the direction of

is usually singular and so a multiple of the identity is added to it [1] Given the solution for

, the projection of a new data point is given by[1] The extension to cases where there are more than two classes is relatively straightforward.

Then multi-class KFD involves projecting the data into a

discriminant functions This can be written in matrix notation where the

is the mean of all the data in the new feature space.

The within-class covariance matrix is The solution is now obtained by maximizing The kernel trick can again be used and the goal of multi-class KFD becomes[7] where

In both two-class and multi-class KFD, the class label of a new input can be assigned as[7] where

is the projected mean for class

Kernel discriminant analysis has been used in a variety of applications.