An oxytocin lab opens its drawers to let out the unpublished null results

This came to me via Razib Khan, and the signal deserves amplification: “Is there a publication bias in behavioral intranasal oxytocin research on humans? Opening the file drawer of one lab”. From the article:

In the present case, only 5 articles (2, 8, 27, 34, 35) have been published across the 13 dependent variables we have assessed, producing a publication rate of 38.5%. If our lab is a representative sample of IN-OT [intranasal oxytocin] research, then for 626 search results found in Scopus by entering “oxytocin” and “human” as research keys (and limiting the outputs to “Psychology”), approximately 1000 potential studies have remained in labs’ drawers. Unraveling these 1000 data sets is extremely important for understanding whether IN-OT exerts reliable effects on humans and under which circumstances.

Publication bias is at the center of the reproducibility crisis in behavioral sciences, and in combination with small sample sizes, leads almost inevitably to researchers reinforcing whatever biases and assumptions they bring into the research. This article illustrates many of the logical fallacies that scientists encounter as they struggle to test hypotheses that others have published. To me, the most illuminating was this:

A second proposition is that IN-OT effects do exist, but that they are strongly moderated by various factors, making them appear large in some circumstances but not others. Through the literature, more and more findings suggest that IN-OT influences behaviors by interacting with several moderators (for a review see (19)). Arguably, our findings do not rule out the possibility that the effects of IN-OT are moderated by various factors – a proposition that will be difficult to rule out, given the infinitely large set of factors that could potentially moderate IN-OT’s behavioral influences (genes, personality or environmental factors). Unfortunately, as far as we know, candidate moderators do not seem to replicate from one study to another8 and appear most often to represent post-hoc data fits rather than a-priori hypotheses9. Indeed, one can be sure to find a “significant” interaction in any data set, simply by conducting many statistical tests, even in the absence of a true signal in the data, unless the test level alpha is corrected for multiple hypothesis testing (43, 44).

It is such an alluring idea, that the “significant” effects observed by another lab are real, and failure to replicate is because of a failure to recreate the precise context in which the results were observed. Scientists should be unafraid to call BS in cases like this. Forget replicating on samples of the same size. If it’s important, then it should replicate in a sample an order of magnitude larger.