Daniel Macarthur, of Genetic Future, reviews the amount of information required to store genomic information. Naturally, you’d probably think it was around 12 billion bits (2 bits per base pair), but sequencing technologies and the availability of references from other people make things a little more complicated.
This interesting quote about the raw image files generated by the Illumina platform presents some of the range of complications:
Almost as soon as these images are generated they are fed into an algorithm that processes them, creating a set of text files containing the sequence of each of the fragments. The image files are then almost always discarded. Why are they discarded? Because, as you will see in a minute, storing the raw image data from each run in even a moderate-scale sequencing facility quickly becomes prohibitively expensive - in fact, several people have suggested to me that it would be cheaper to just repeat the sequencing than to store these data long-term.
An accurate read requires lots of redundant bits, which adds up to lots and lots of data storage. If these are winnowed down to a real “best” sequence, then you’re back to 12 billion bits (=1.5 gigabytes), more or less. Of course, most of that sequence is redundant and may be significantly compressed. And if you compare with a reference sequence, really a small amount of information is sufficient to distinguish your genome compared to the reference. Anyway, all this is explained at the link.