How can we cut down on research that doesn't replicate?

In this week’s Nature, Francis Collins and Lawrence Tabak have written a policy statement about the problem of biomedical research results that cannot be replicated by independent studies. Collins is the director of the U.S. National Institutes of Health, and the essay is an official statement of how the agency is attempting to reduce the incidence of sensationalized findings that do not have real clinical validity.

Science has long been regarded as 'self-correcting', given that it is founded on the replication of earlier work. Over the long term, that principle remains true. In the shorter term, however, the checks and balances that once ensured scientific fidelity have been hobbled. This has compromised the ability of today's researchers to reproduce others' findings.

Although Collins and Tabak do not cite his work, their essay is a response to the findings of John Ioannidis, who has famously claimed that most published research findings are false. The basic reason is statistics. In many fields, studies are published when the results accord with a statistical threshold (often a 1 in 20 chance of being produced by chance). But across a whole field, everyone is selecting just these results for publication while discarding results that do not meet this threshold. Consider two studies of exactly the same phenomenon. One reaches the 1-in-20 threshold and is published. The other fails to reach the threshold – maybe the result could be obtained once in every two trials – and is not published. If both results could be compiled together, the phenomenon would fail to meet the test. But only one of the studies is published; the other falls into obscurity. It may never leave the pages of the lab notebook. Indeed, the results may never really be compiled when it is clear they will be non-significant.

What happens with replication studies is that research groups suddenly have an incentive to publish negative results. They might have obtained negative results on many, many different phenomena, but those would not be interesting enough to publish. We wouldn’t probably be too concerned about any single case. But across many, many different areas of science, it appears that replication studies are consistently failing to reproduce the effects claimed by prior research. That failure reflects a systemic bias in favor of publishing “statistically significant” results. “Significance” presupposes that the studies are independent, when in fact publication of studies across a field is highly non-independent.

The Economist published an informative article about Ioannidis’ work last year, which is cited by Collins and Tabak: “Trouble at the lab”.

The pitfalls Dr Stodden points to get deeper as research increasingly involves sifting through untold quantities of data. Take subatomic physics, where data are churned out by the petabyte. It uses notoriously exacting methodological standards, setting an acceptable false-positive rate of one in 3.5m (known as the five-sigma standard). But maximising a single figure of merit, such as statistical significance, is never enough: witness the “pentaquark” saga. Quarks are normally seen only two or three at a time, but in the mid-2000s various labs found evidence of bizarre five-quark composites. The analyses met the five-sigma test. But the data were not “blinded” properly; the analysts knew a lot about where the numbers were coming from. When an experiment is not blinded, the chances that the experimenters will see what they “should” see rise. This is why people analysing clinical-trials data should be blinded to whether data come from the “study group” or the control group. When looked for with proper blinding, the previously ubiquitous pentaquarks disappeared. Other data-heavy disciplines face similar challenges. Models which can be “tuned” in many different ways give researchers more scope to perceive a pattern where none exists. According to some estimates, three-quarters of published scientific papers in the field of machine learning are bunk because of this “overfitting”, says Sandy Pentland, a computer scientist at the Massachusetts Institute of Technology.

I worry about paleoanthropology. Traditionally, the field has been data-poor. There are only a handful of fossils that represent any particular anatomical detail in any particular ancient species of hominins. That makes for small samples. But because the field is of very high interest, many paleoanthropological papers can report negative results and still be publishable in relatively high-profile journals. Indeed, several of my own papers have been essentially based on negative results – failure to disprove a null hypothesis, or failure to show significant change over time. Those results are interesting when we are trying to test the pattern of our evolution.

Today, ancient DNA has begun to provide vastly more data about some parts of our evolution. But comparing an ancient genome to the genomes of hundreds or thousands of living people is not straightforward. We require fairly sophisticated models to understand the evolutionary changes in these samples.

Models introduce the problem of overfitting. And models require assumptions, which are often hidden away in the supplementary information of high-impact papers. As we’ve seen recently, many of the initial conclusions about ancient genomes, made in the wake of the Neandertal and Denisovan discoveries in 2010, were overhyped. Along with some other anthropologists, I raised concerns about these at the time, pointing out which conclusions were very solid, and which other ones we should treat more cautiously. And I’ll continue to do that. But many people who are applying sophisticated models to ancient DNA data are not quite so cautious – they are looking for their publishable results. Negative results are, at the moment, less interesting or publishable in this field. I worry that the level of scrutiny at top journals may be relaxing.

Collins and Tabak are not concerned with paleoanthropology, they are interested in biomedical research. But many areas of human genetics face the same challenges as paleoanthropology – a sea of new data, with new methods of analysis, and high-profile papers being published that heavily depend on models. They point out some of the problems of this environment:

Factors include poor training of researchers in experimental design; increased emphasis on making provocative statements rather than presenting technical details; and publications that do not report basic elements of experimental design. Crucial experimental design elements that are all too frequently ignored include blinding, randomization, replication, sample-size calculation and the effect of sex differences. And some scientists reputedly use a 'secret sauce' to make their experiments work — and withhold details from publication or describe them only vaguely to retain a competitive edge. What hope is there that other scientists will be able to build on such work to further biomedical progress?

I wish I had a “secret sauce”! In any event, the NIH has adopted a radical solution: Alter the format of the biosketch.

Perhaps the most vexed issue is the academic incentive system. It currently over-emphasizes publishing in high-profile journals. No doubt worsened by current budgetary woes, this encourages rapid submission of research findings to the detriment of careful replication. To address this, the NIH is contemplating modifying the format of its 'biographical sketch' form, which grant applicants are required to complete, to emphasize the significance of advances resulting from work in which the applicant participated, and to delineate the part played by the applicant. Other organizations such as the Howard Hughes Medical Institute have used this format and found it more revealing of actual contributions to science than the traditional list of unannotated publications. The NIH is also considering providing greater stability for investigators at certain, discrete career stages, utilizing grant mechanisms that allow more flexibility and a longer period than the current average of approximately four years of support per project.

I read that and said aloud, “What?” Talk about a toothless policy.

I don’t understand why people who apply for federal research money don’t have their funding revoked when they don’t follow agency policies. If a lab publishes hyped research, the principal investigator should be downgraded for future funding decisions. If the lab doesn’t archive data and make it available for replication, the lab should be downgraded for future funding decisions.

And if intervention at the level of the lab is not sufficient – as for advanced PIs who may be on their last funding cycle – then the university or research organization should be downgraded. Simple as that.

Fortunately, we don’t have too many data sharing problems in paleoanthropology.

References:

Collins, FS and Tabak, LA. 2014. Policy: NIH plans to enhance reproducibility. Nature 505, 612–613. doi:10.1038/505612a URL: http://www.nature.com/news/policy-nih-plans-to-enhance-reproducibility-1.14586