Data dredging

[2] The process of data dredging involves testing multiple hypotheses using a single data set by exhaustively searching—perhaps for combinations of variables that might show a correlation, and perhaps for groups of cases or observations that show differences in their mean or in their breakdown by some other variable.

When enough hypotheses are tested, it is virtually certain that some will be reported to be statistically significant (even though this is misleading), since almost every data set with any degree of randomness is likely to contain (for example) some spurious correlations.

If the hypothesis is not tested on a different data set from the same statistical population, it is impossible to assess the likelihood that chance alone would produce such patterns.

The statistical significance under the incorrect procedure is completely spurious—significance tests do not protect against data dredging.

Or, more succinctly, the proper calculation of p-value requires accounting for counterfactuals, that is, what the experimenter could have done in reaction to data that might have been.

Some animal ethics boards even mandate early stopping if the study obtains a significant result midway.

A hypothesis, biased by data dredging, could then be "people born on August 7 have a much higher chance of switching minors more than twice in college."

The data itself taken out of context might be seen as strongly supporting that correlation, since no one with a different birthday had switched minors three times in college.

An analysis that did not correct for this bias unfairly penalized abacavir, since its patients were more high-risk so more of them had heart attacks.

[clarify] A crucial step in the process is to decide which covariates to include in a relationship explaining one or more other variables.

Of course, such a discipline necessitates waiting for new data to come in, to show the formulated theory's predictive power versus the null hypothesis.

This process ensures that no one can accuse the researcher of hand-tailoring the predictive model to the data on hand, since the upcoming weather is not yet available.

As another example, suppose that observers note that a particular town appears to have a cancer cluster, but lack a firm hypothesis of why this is so.

[13] This study was widespread in many media outlets around 2015, with many people believing the claim that eating a chocolate bar every day would cause them to lose weight, against their better judgement.

No claim of statistical significance can be made by only looking, without due regard to the method used to assess the data.

Academic journals increasingly shift to the registered report format, which aims to counteract very serious issues such as data dredging and HARKing, which have made theory-testing research very unreliable.

[15] The European Journal of Personality defines this format as follows: "In a registered report, authors create a study proposal that includes theoretical and empirical background, research questions/hypotheses, and pilot data (if available).

Upon submission, this proposal will then be reviewed prior to data collection, and if accepted, the paper resulting from this peer-reviewed procedure will be published, regardless of the study outcomes.

"[16] Methods and results can also be made publicly available, as in the open science approach, making it yet more difficult for data dredging to take place.

A humorous example of a result produced by data dredging, showing a correlation between the number of letters in Scripps National Spelling Bee 's winning word and the number of people in the United States killed by venomous spiders
The figure shows the change in p-values computed from a t-test as the sample size increases, and how early stopping can allow for p-hacking. Data is drawn from two identical normal distributions, . For each sample size , ranging from 5 to , a t-test is performed on the first samples from each distribution, and the resulting p-value is plotted. The red dashed line indicates the commonly used significance level of 0.05. If the data collection or analysis were to stop at a point where the p-value happened to fall below the significance level, a spurious statistically significant difference could be reported.