In statistics, sufficient dimension reduction (SDR) is a paradigm for analyzing data that combines the ideas of dimension reduction with the concept of sufficiency.
Dimension reduction has long been a primary goal of regression analysis.
Given a response variable y and a p-dimensional predictor vector
, regression analysis aims to study the distribution of
In other words, no information about the regression is lost in reducing the dimension of
[1] In a regression setting, it is often useful to summarize the distribution of
A scatterplot that contains all available regression information is called a sufficient summary plot.
, it becomes increasingly challenging to construct and visually interpret sufficiency summary plots without reducing the data.
Even three-dimensional scatter plots must be viewed via a computer program, and the third dimension can only be visualized by rotating the coordinate axes.
with small enough dimension, a sufficient summary plot of
may be constructed and visually interpreted with relative ease.
Hence sufficient dimension reduction allows for graphical intuition about the distribution of
Most graphical methodology focuses primarily on dimension reduction involving linear combinations of
The rest of this article deals only with such reductions.
Without loss of generality, only the space spanned by the columns of
It follows from the definition of a sufficient dimension reduction that where
is defined to be a dimension reduction subspace (DRS).
, is the smallest number of distinct linear combinations of
In other words, the smallest dimension reduction that is still sufficient maps
if it is a DRS and its dimension is less than or equal to that of all other DRSs for
does exist, then it is also the unique minimum dimension reduction subspace.
is not guaranteed in every regression situation, there are some rather broad conditions under which its existence follows directly.
[2] There are many existing methods for dimension reduction, both graphical and numeric.
For example, sliced inverse regression (SIR) and sliced average variance estimation (SAVE) were introduced in the 1990s and continue to be widely used.
[3] Although SIR was originally designed to estimate an effective dimension reducing subspace, it is now understood that it estimates only the central subspace, which is generally different.
More recent methods for dimension reduction include likelihood-based sufficient dimension reduction,[4] estimating the central subspace based on the inverse third moment (or kth moment),[5] estimating the central solution space,[6] graphical regression,[2] envelope model, and the principal support vector machine.
[7] For more details on these and other methods, consult the statistical literature.
Principal components analysis (PCA) and similar methods for dimension reduction are not based on the sufficiency principle.
Consider the regression model Note that the distribution of
is a sufficient summary plot for this regression.