Supplementary data loss

My inbox this morning has an article by Diana Kwon in The Scientist, looking into the data decay from the supplementary materials of published scientific articles: “The Push to Replace Journal Supplements with Repositories”.

The story leads with Vaughn Cooper, an evolutionary biologist who published a recent paper on a secondary school biology curriculum in the journal Evolution: Education and Outreach. Readers quickly discovered that the supplementary files were inaccessible.

Supplementary information for journal articles is a bad idea. It has always been a bad idea. Journals at the dawn of the World Wide Web, faced with the opportunity to publish infinite pages at low cost, chose instead to create proprietary non-edited slush piles for methods and analyses totally separate from the standard distribution format of their articles. It’s a near-miracle twenty years later that any supplementary information can still be read by today’s software.

Instead of becoming standardized rich media for data distribution, supplements became a bloated morgue where Excel spreadsheets go to die.

But it’s not just broken hyperlinks that frustrate scientists. As papers get more data-intensive and complex, supplementary files often become many times longer than the manuscript itself—in some extreme cases, ballooning to more than 100 pages. Because these files are typically published as PDFs, they can be a pain to navigate, so even if they are available, the information within them can get overlooked. “Most supplementary materials are just one big block and not very useful,” Cooper says.
Another issue is that these files are home to most of a study’s published data, and “you can’t extract data from PDFs except using complex software—and it’s a slow process that has errors,” Murray-Rust tells The Scientist. “This data is often deposited as a token of depositing data, rather than people actually wanting to reuse it.”

That is a super phrase: “Token of depositing data”. It’s exactly the concern I raised earlier this week: “Biological Anthropology association speaks out on data access”.

Going through the motions of providing data instead of actually creating a useful foundation for further work is a universal problem. After all, work is work. As the software industry has long known, there’s no substitute for full and adequate documentation of code, but still everyone is under pressure to produce outputs right now, with less incentive for doing the work for everyone else.

To their credit, the biological anthropologists reference several online data repositories, which grant agencies and publishers are increasingly encouraging. The Scientist article introduces the subject of repositories as a substitute for supplementary information:

Mark Hahnel, the CEO and founder of figshare, says that he started the company during his doctoral studies out of frustration with the limitations of supplemental files. “We expected to play this role for people who were producing outputs of research that didn’t fit into the model of publishing PDFs,” he tells The Scientist. But increasingly, academics also are using figshare for other reasons, he adds, such as being able to freely reuse material associated with a published paper without worrying about infringing upon copyrights. (While research outputs such as figures in a traditional journal may be subject to a publisher’s copyright policies, those deposited to repositories like figshare are usually published with a creative commons license that allows others to use the material without restrictions.)

Data repositories are a partial solution for only one of the problems of data access — providing a way for readers to get the data and code that underlie a published analysis. Building a durable foundation for further work is another task that should be recognized and valued much more.

Part of that task is publishing texts that actually provide the details of analyses. This comes back to supplements. Today, too many published scientific papers are little more than glorified abstracts. Details are hidden in hundred-page supplements, where they are poorly reviewed (if they are reviewed at all) and rarely replicable.

Some papers should be broken up into independent units. Multidisciplinary work that actually consumes hundreds of pages of detail should be formatted and published in a way that recognizes the detail, not hides it. Scientists working on multidisciplinary problems need to model good writing to enable readers to follow how the details from many analyses fit together. That would be vastly more valuable as a foundation for later work than a citation in a weekly science journal.