It's another week, which means it's time for another of the mysteriously-not-yet-appeared PNAS papers. But this time, a friendly source sent me the paper (which should be here, when it appears).
You can see my comments on it in David Biello's article at Scientific American.
Basically, this is the analysis of DNA damage in the genome sequence coming out on the 454 sequencing platform. They show the damage as limited to two kinds of chemical changes, which generate spurious C->T and G->A substitutions. Also, the position of the nucleotide in the sequencing read makes a difference, as terminal nucleotides have more than 5 times as many spurious errors than some internal ones. There is a lot of detail in the paper about all of this.
The interesting part is near the end, where the paper discusses ways to quantify contamination and correct for misincorporated bases. Basically, one can quantify and correct all such problems with multiple coverage. The more copies of a given sequence are produced, the more it becomes possible to estimate the proportion of contaminant sequence -- at least in areas where the contaminant may be differentiated based on its sequence. But even where contaminants are possibly identical to the endogenous sequence, it may be possible to find them in various ways (for example, if three homologous sequences were present instead of two, or if a significant excess of some particular sequence were present. Likewise, misincorporations can be found and corrected with multiple reads.
The demonstration of these problems helps to increase confidence in the outcome: if we understand the errors, we can better evaluate results. The important (and unanswered) question is how much fossil grinding it will take to get good genomes.