Via Jay Shendure, who shared this ad on Twitter this weekend:
Original advertisement that brought in the donors for Human Genome Project (Buffalo News, 3/23/1997), h/t Pieter de Jong, who placed the ad
People who worked with HGP data in the early days will remember how the entire genome appeared to be designed by committee. Genetic samples from around thirty people were ultimately included, so different parts actually reflected the genetic heritage of entirely different individuals.
These were chosen to be “representative” of the genetics of the U.S., meaning that some parts of the draft genome were African in ancestry, most were European, and a few were Asian. But the identities of the individuals were anonymous, and the first draft of the genome was being completed at a time when the diversity of most parts of the genome was unknown (by definition, since they hadn’t ever been sequenced in anybody!).
Given the incredible expense of the project, I think this was an appropriate (if unavoidable) decision, but it did make some kinds of population genetic analysis very difficult to carry out. In genetics, how variation was first identified–the “ascertainment” of a variant–exerts a statistical bias on results. To understand the significance of variations, first it is necessary to know the direction of this bias. Many of us did a lot of complicated modeling to try to work around this aspect of the Human Genome Project draft.
The decision had a legacy that lived on for the first few generations of microarrays, because the single nucleotide polymorphisms (SNPs) that these microarrays tested were found in human samples that were initially very small, many of them HGP samples. When applying a microarray to individuals from a population, it is very important to know whether the SNPs were ascertained within the same population or a different population–a microarray will always miss rare variation in a sample, but it will miss much more common variation in a sample from a different population than the ascertainment sample.
Over time, microarray SNPs began to be ascertained on broader samples of populations, and resequencing–especially the 1000 Genomes Project–began to address the problems of representation that were insoluble in the HGP. But it’s interesting to see this historical ad that put into motion a long-lasting statistical problem.