How long until paleoanthropologists must deposit data upon submission?

8 minute read

Genetics journals have for years routinely required sequence data to be deposited in a public database at the time that an article is published. Increasingly, these journals have broadened to require such deposits for other types of data, such as gene expression, methylation, or genotype-phenotype association data. Some journals have begun to expect data to be deposited at the time of article submission, instead of publication. In principle such a policy enables referees to examine the data in addition to the analytical methods and results reported in a paper. In practice it gives editors more leverage to ensure that the data actually end up being deposited, since when they have made the decision to publish, the ultimate submission of the datasets can fall through the cracks.

Nature Genetics in a recent issue published an editorial commentary, “No impact without data access”. The editorial accompanies an article reviewing the major features of the European Genome/Phenome Archive. Researchers who are investigating the role of genes in disease or other phenotypes can deposit their data in this archive for long-term access control. The editorial raises several issues that made me consider how paleoanthropological data access may be destined to change during the next decade.

Openness and biomedical data

Biomedical research poses an inevitable tension between open data access and the need for the privacy of research subjects, many of whom are patients undergoing treatment for disease. Patient rights provide a clear reason why public access to data should not be allowed against widespread distribution of data.

At the same time, the genetics community has long recognized both practical and moral reasons why data sharing is imperative. On the practical side, open access to data enables replication studies and the extension of results from one patient group or national population into other populations. Maybe most significantly, small trials are inevitably underpowered to find significant correlations, but if the data is archived and open to other researchers, they can be combined into larger-scale metaanalyses that can test for smaller effect sizes.

The moral argument for data sharing recognizes the huge gift that patients give by allowing their data to be used for research. Further, government and private funder resources are invested into research. Scientists should be responsible stewards of both by making the maximum scientific impact that they can. The reuse and broader dissemination of data are good scientific practice.

The generosity of research subjects is not unlimited: Most patients who participate in scientific research do not give consent for public release of their medical, DNA, or epigenetic data. Some people will participate in research even if their data are made totally open to the public, as demonstrated by the Personal Genome Project. But most prefer that their data be kept private. Scientists are expected to maintain patient privacy, which can make it difficult to share data. Although de-identification of data is possible, a number of studies have shown the ease of using supposedly de-identified, anonymous data to obtain personal details of research participants. From the Nature Genetics editorial:

In addition to public variants, individual-level genetic and phenotypic data or summary statistics from the research projects are often required for replication, meta-analysis and many other secondary uses, such as methods development or use as control samples. However, these data must be processed, archived and transferred in a manner that respects the consent agreements signed by the study subjects. This often means that data can only be provided to bona fide researchers and used for specific research aims.

Research communities have implemented a number of solutions to enable sharing data while maintaining privacy. A well-known instance is the dbGaP (database of Genotypes and Phenotypes) administered by the National Institutes of Health, for example, requires investigators to apply for access and agree to a code of conduct. When judged in light of data security, the provisions of dbGaP are in reality very weak, because they depend upon researchers and institutions complying with agreements, rather than enforcing strong protection through cryptography and segmented access. But the guidelines do comport with the general U.S. regulatory approach to medical records.

The European Genome/Phenome archive is basically similar in function but addresses a different regulatory framework than dbGaP. As in the U.S., there is a tension between data access and patient privacy, but complicating matters is the variety of national regulations on biomedical research and data among European countries. Since many European biomedical research projects are international in scope, there is a huge array of bureaucratic variation governing the conditions under which any particular dataset can be shared.

What’s interesting about the Nature Genetics editorial is a passage in which the journal extends beyond the U.S. and Europe-centric databases to consider the regulatory burden upon local research enterprises elsewhere in the world.

Although we recognize that these US and European databases are suitable for most research in the field, national laws may require local databases and access protocols to be developed for different communities. The most positive benefit that could be accrued by local data stewardship would be capacity building through using data access to recruit qualified international experts to collaborate or work locally on the data. But, given the global reach of the internet and cloud, capacity could be built electronically as well as in person, so we urge forward-looking strategists and legislators to anticipate these benefits rather than to be unnecessarily restrictive.

Some thoughts: I don’t agree that recruiting international experts to work locally on data is “the most positive benefit” that could result from local databases. Most nations will want to develop local scientific capacity through training and increased publication by local scholars. Countries are wise to develop local areas of strategic advantage in which they can lead rather than follow international collaborations. The variation of human biology across populations is one area where nearly every country has both local scientific interest and global importance.

Paleoanthropology and data access

Paleoanthropology holds in common with human genetics that many of our most important research subjects are outside of Europe and the U.S. The research objects of paleoanthropology are not only essential parts of world heritage, but also the national heritage of many countries around the world. Institutions charged with responsibility for protecting heritage are rightly concerned that international collaboration not place them at a disadvantage. The fossil record of human evolution can be a strategic asset for local development, just as the biological heritage of human populations can be a strategic asset for development of local biomedical research expertise.

Reading this Nature Genetics editorial, I wonder how long before a similar editorial could be written about fossil hominins. This passage strikes me as especially freighted with implications (I added the emphases):

We regard a data descriptor and a live accession code to a permanent data set in a supported repository as the minimum acceptable data access provision compatible with publication in a high-impact, journal and therefore hold the view that restrictive legislation with regard to access to data will inevitably place local researchers at an international disadvantage with respect to reputation, publication and collaboration. Without specific access provisions for qualified applicants to use data for purposes for which they were originally consented, such costive data management will also undermine trust in the research.

Paleoanthropologists should be familiar with having their research results questioned on the basis of whether their data can be trusted. Some large communities of people organize their beliefs around skepticism of the basic fossil data that underlies our knowledge of human evolution. Although we can do little to change the minds of those who will not look at the evidence, we can do much to make the evidence much more widely available to those who would. Of course there is no “informed consent” for paleaonthropological data, but there are considerations of heritage protection and public education, both of which argue for much wider distribution of original data.

Few paleoanthropologists or institutions have adopted the tools of open data accessibility to enhance trust in their research. This is a strategic failure. Replicability, transparency of methods and results, and access to primary materials are essential foundations of scientific practice. Paleoanthropology has nothing to gain from resisting a full integration with mainstream science; indeed such integration is essential to the future of our increasingly interdisciplinary field.

Logically, the “high impact” journals may prefer to lead the way in requiring data access—not because of an altruistic notion of quality science, but because data accessibility is one defense against the growing flood of retracted and non-replicated papers in biomedical fields. Faced with disputed findings, the journal can point to the availability of data and encourage replication and independent examination; ideally this will happen more and more commonly before publication instead of afterward.

Paleoanthropology so far has been an exception to this trend. For a long time it has been clear that the major “high impact” journals publish more questionable and sensationalized results than field-specific or open access journals. Still, we have seen very few retractions or corrections even in cases where a paper’s results were overturned by replication studies. Of course, when data are not available for a fossil sample, and when independent investigators are unable to examine the fossils, then no replication is possible. Hence, journals have been free to pursue studies that will attract media hype without facing real scrutiny.

I have undergone the review process at high impact journals (Science, Nature, and PNAS) many times in my career, including several published papers and several that were ultimately rejected. When I have published on genetics, reviewers have regularly included comments that request some assurance that data will be accessible upon publication. I have never had a reviewer of a paleoanthropology submission to these journals request any data accessibility whatsoever. That’s not a problem with the field in general: for example, when I have edited and reviewed papers, I consistently require (as an editor) or request data to be provided. But apparently neither I nor anyone like me reviews papers for Nature or Science.

This situation is not sustainable for exactly the reasons that the Nature Genetics editorial notes for biomedical data, namely:

restrictive legislation with regard to access to data will inevitably place local researchers at an international disadvantage with respect to reputation, publication and collaboration.

Restricting access to fossil data may provide advantages for a small coterie of Western investigators, but it harms the local institutions that are custodians of fossil remains. Forward-thinking institutions are building collaborations with a broad range of international investigators on questions of mutual interest, building the scientific significance of their fossil heritage.