How can that be? We know how to formulate and test hypotheses in controlled experiments.

We can account for unwanted variation with statistical techniques. We appreciate the need to replicate observations. Yet many researchers persist in working in a way almost guaranteed not to deliver meaningful results. They ride with what I refer to as the four horsemen of the reproducibility apocalypse: publication bias, low statistical power, P -value hacking and HARKing hypothesizing after results are known.

My generation and the one before us have done little to rein these in.

Yet many researchers persist in working in a way almost guaranteed not to deliver meaningful results. Four is also the order of the smallest non-trivial groups that are not simple.

This prejudice leads to publication bias: researchers are less likely to write up studies that show no effect, and journal editors are less likely to accept them. Consequently, no one can learn from them, and researchers waste time and resources on repeating experiments, redundantly. That has begun to change for two reasons.

First, clinicians have realized that publication bias harms patients.

If there are 20 studies of a drug and only one shows a benefit, but that is the one that is published, we get a distorted view of drug efficacy. Second, the growing use of meta-analyses, which combine results across studies, has started to make clear that the tendency not to publish negative results gives misleading impressions. Low statistical power followed a similar trajectory.

My undergraduate statistics courses had nothing to say on statistical power, and few of us realized we should take it seriously. It is wasteful to conduct studies that are underpowered, but researchers have often treated statisticians who point this out as killjoys. Newcombe Br.

In fields such as clinical trials and genetics, funders have forced improvements to working practices by insisting that studies be adequately powered. Other disciplines have yet to catch up. I stumbled on the issue of P -hacking before the term existed.

I published a sarcastic note, including a simulation to show how easy it was to find an effect if you explored the data after collecting results D. Bishop J. This practice, now known as P -hacking, was once endemic to most branches of science that rely on P values to test significance of results, yet few people realized how seriously it could distort findings.

That started to change in , with an elegant, comic paper in which the authors crafted analyses to prove that listening to the Beatles could make undergraduates younger J. Simmons et al. Kerr Pers.

