Projection pursuit regression

In statistics, projection pursuit regression (PPR) is a statistical model developed by Jerome H. Friedman and Werner Stuetzle that extends additive models.

This model adapts the additive models in that it first projects the data matrix of explanatory variables in the optimal direction before applying smoothing functions to these explanatory variables.

The model consists of linear combinations of ridge functions: non-linear transformations of linear combinations of the explanatory variables.

The basic model takes the form where xi is a 1 × p row of the design matrix containing the explanatory variables for example i, yi is a 1 × 1 prediction, {βj} is a collection of r vectors (each a unit vector of length p) which contain the unknown parameters, {fj} is a collection of r initially unknown smooth functions that map from

Good values for r can be determined through cross-validation or a forward stage-wise strategy which stops when the model fit cannot be significantly improved.

As r approaches infinity and with an appropriate set of functions {fj}, the PPR model is a universal estimator, as it can approximate any continuous function in

pair individually: Let all other parameters be fixed, and find a "residual", the variance of the output not accounted for by those other parameters, given by The task of minimizing the error function now reduces to solving for each j in turn.

pairs are added to the model in a forward stage-wise fashion.

Aside: Previously fitted pairs can be readjusted after new fit-pairs are determined by an algorithm known as backfitting, which entails reconsidering a previous pair, recalculating the residual given how other pairs have changed, refitting to account for that new information, and then cycling through all fit-pairs this way until parameters converge.

This process typically results in a model that performs better with fewer fit-pairs, though it takes longer to train, and it is usually possible to achieve the same performance by skipping backfitting and simply adding more fits to the model (increasing r).Solving the simplified error function to determine an

pair can be done with alternating optimization, where first a random

is found to describe the relationship between that projection and the residuals via your favorite scatter plot regression method.

is once differentiable, the optimal updated weights

can be found via the Gauss–Newton method—a quasi-Newton method in which the part of the Hessian involving the second derivative is discarded.

, then plug the expansion back in to the simplified error function

and do some algebraic manipulation to put it in the form This is a weighted least squares problem.

in to a vector, and use the full data matrix

by resolving the above, and continue this alternating process until

It has been shown that the convergence rate, the bias and the variance are affected by the estimation of

The PPR model takes the form of a basic additive model but with the additional

vs the residual (unexplained variance) during training rather than using the raw inputs themselves.

to low dimension, making it solvable with common least squares or spline fitting methods and sidestepping the curse of dimensionality during training.

, the result looks like a "ridge" orthogonal to the projection dimension, so

are chosen to optimize the fit of their corresponding ridge functions.

Note that because PPR attempts to fit projections of the data, it can be difficult to interpret the fitted model as a whole, because each input variable has been accounted for in a complex and multifaceted way.

This can make the model more useful for prediction than for understanding the data, though visualizing individual ridge functions and considering which projections the model is discovering can yield some insight.

Both projection pursuit regression and fully connected neural networks with a single hidden layer project the input vector onto a one-dimensional hyperplane and then apply a nonlinear transformation of the input variables that are then added in a linear fashion.

Thus both follow the same steps to overcome the curse of dimensionality.

being fitted in PPR can be different for each combination of input variables and are estimated one at a time and then updated with the weights, whereas in NN these are all specified upfront and estimated simultaneously.

Thus, in PPR estimation the transformations of variables in PPR are data driven whereas in a single-layer neural network these transformations are fixed.