Big data, little data

2 minute read

Jacquelyn Gill is a paleoecologist who writes at her blog, The Contemplative Mammoth. Today she ponders a paradox: at the same time that pollen data are more and more in demand for meta-analyses, positions and funding for primary analysis of pollen have been decreasing: “Is pollen analysis dead? Paleoecology in the era of big data”.

Pollen analysis is expensive, time-consuming, and even hazardous (hello, hydrofluoric acid!). I often joke that a week in the field translates to a year in the lab; It can take as much as a year to produce a single pollen diagram from one sediment core. Add to that the time and costs associated with radiocarbon dating (anywhere from $250 to $600 a pop!) and any other analyses to fill out the environmental picture. In the time it takes someone to publish one paper based on pollen data they collected, someone analyzing pollen data can generate several papers. Theres arguably less reward (from a publications perspective) for that single site than there is from a multi-site synthesis. With these in mind, its easy to see how it can be more attractive to work with pollen data than to generate it.
Heres the thing: We need to be generating pollen records. There are major gaps in spatial and temporal coverage even in North America, let alone the rest of the world South America, Africa, Asia, and Australia have some excellent records but nowhere near the spatial or temporal coverage of Europe and North America.

Meta-analysis of “big data” is sexy, and tests the kinds of theories that make waves in science, but cannot proceed without primary data gathering. In human genetics, many “big data” projects (like the 1000 Genomes Project) are explicitly collaborations among many labs and research groups, so that primary data gathering is integrated with a series of meta-analyses at different scales. And the data are then openly made available to other researchers to do further meta-analysis, or to add their own datasets. That kind of system takes pretty explicit centralization of funding.

In paleoanthropology, we have pretty much the opposite. Meta-analysis is not a funding priority, and primary data-gathering (excavation, description) still can build a very good career. But we sometimes struggle with higher-order hypotheses that can be answered only by open analysis of data from many sites. This is because key data from some places are hidden away, only accessible by a small cadre of researchers.

So there are two ways that different fields have addressed the empirical data gathering crunch: Hide data so that only people generating new data can publish, or centralize data availability so that many people have an incentive to add new data. You can guess which one I think works better…