Knockoffs (statistics)

In statistics, the knockoff filter, or simply knockoffs, is a framework for variable selection.

It was originally introduced for linear regression by Rina Barber and Emmanuel Candès,[1] and later generalized to other regression models in the random design setting.

[2] Knockoffs has found application in many practical areas, notably in genome-wide association studies.

[2][3] Consider a linear regression model with response vector

and feature matrix

Barber and Candès showed that, equipped with a suitable feature importance statistic, fixed-X knockoffs can be used for variable selection while controlling the false discovery rate (FDR).

Consider a general regression model with response vector

and random feature matrix

and satisfies a subtle pairwise exchangeable condition: for any

, the joint distribution of the random matrix

is the number of features.

While it is less clear how to create model-X knockoffs compared to their fixed-X counterpart, various algorithms have been proposed to construct knockoffs.

[2][3][4][5] Once constructed, model-X knockoffs can be used for variable selection following the same procedure as fixed-X knockoffs and control the FDR.

can be understood as negative controls.

Informally speaking, knockoffs has the property that no method can statistically distinguish the original matrix from its knockoffs without looking at

Mathematically, the exchangeability conditions translate to symmetry that allows for an estimation of the type I error (e.g., if one wishes to choose the FDR as the type I error rate, the false discovery proportion is estimated), which then leads to exact type I error control.

Model-X knockoffs provides valid type I error control regardless of the unknown conditional distribution of

, and it can work with black-box variable importance statistics, including the ones derived from complicated machine learning methods.

A most significant challenge of implementing model-X knockoffs is that it requires nontrivial knowledge on the distribution of

This knowledge can be gained with the help of unlabeled data.