Making Big Data work in genetics

Laura Clarke and colleagues report on the data access and management practices of the 1000 Genomes Project Clarke:1000:2012.

The larger data volumes and shorter read lengths of high-throughput sequencing technologies created substantial new requirements for bioinformatics, analysis and data-distribution methods. The initial plan for the 1000 Genomes Project was to collect 2 whole genome coverage for 1,000 individuals, representing ~6 gigabase pairs of sequence per individual and ~6 terabase pairs (Tbp) of sequence in total. Increasing sequencing capacity led to repeated revisions of these plans to the current project scale of collecting low-coverage, ~4 whole-genome and ~20 whole-exome sequence for ~2,500 individuals plus high-coverage, ~40 whole-genome sequence for 500 individuals in total (~25-fold increase in sequence generation over original estimates). In fact, the 1000 Genomes Pilot Project collected 5 Tbp of sequence data, resulting in 38,000 files and over 12 terabytes of data being available to the community. In March 2012 the still-growing project resources include more than 260 terabytes of data in more than 250,000 publicly accessible files.

The paper acknowledges that this large-scale genetic sequencing project nevertheless generates far less data than physics and astronomy projects. The Large Synoptic Survey Telescope, for example, will generate 20 terabytes each night of operation, while the Large Hadron Collider will generate roughly 15 petabytes per year. The 1000 Genomes Project data to date add up to around two weeks of LSST operation. Still, it’s not hard to see how high-coverage sequencing will start to catch up in data storage and transfer requirements.

We are now in a golden age of data centralization. But five years from now, we may return to a second era of disposable data, as gene expression and whole-genome resequencing studies will generate far more data than any central repository can store. We will need curation practices to identify and preserve data that have value beyond the project for which they were collected.

The beautiful thing about this is that when data are abundant, they don’t all have to work together. There is a real role for a new generation of curators to facilitate the mashups of the future.