Leakage (machine learning)

Row-wise leakage is caused by improper sharing of information between rows of data.

Types of row-wise leakage include: A 2023 review found data leakage to be "a widespread failure mode in machine-learning (ML)-based science", having affected at least 294 academic publications across 17 disciplines, and causing a potential reproducibility crisis.

Performance-wise, unusually high accuracy or significant discrepancies between training and test results often indicate leakage.

Models relying heavily on counter-intuitive features or showing unexpected prediction patterns warrant investigation.

Performance degradation over time when tested on new data may suggest earlier inflated metrics due to leakage.