Ascertainment into the future, what worth datasets?

A reader helpfully pointed me to a new paper in PNAS that looks at the sampling scheme of the 1000 Genomes Project from the point of view of SNP discovery. They frame the question as “How many variants are yet to be found?”

Here’s part of the abstract:

Consistent with previous descriptions, our results show that the African population is the most diverse in terms of the number of variants expected to exist, the Asian populations the least diverse, with the European population in-between. In addition, our results show a clear distinction between the Chinese and the Japanese populations, with the Japanese population being the less diverse. To find all common variants (frequency at least 1%) the number of individuals that need to be sequenced is small (?350) and does not differ much among the different populations; our data show that, subject to sequence accuracy, the 1000 Genomes Project is likely to find most of these common variants and a high proportion of the rarer ones (frequency between 0.1 and 1%). The data reveal a rule of diminishing returns: a small number of individuals (?150) is sufficient to identify 80% of variants with a frequency of at least 0.1%, while a much larger number (>3,000 individuals) is necessary to find all of those variants.

Well, if the main goal of the 1000 Genomes project is better chip design for the future, then this question – what fraction of rare variants will be ascertained in the sample – is the most pertinent.

However, I for one am looking forward to the larger sample for a different reason. It should allow us to test for recent selection on lower-frequency variants. It may also help to localize the sites under selection. For that, I’m not all that interested in finding the one-percent alleles, I’m more interested in having a sufficient number of copies of them to test hypotheses about change over time.

At the end of the abstract, there is this interesting sentence:

Finally, our results also show a much higher diversity in environmental response genes compared with the average genome, especially in African populations.

I can’t get the PDF today, so I’ll be waiting to find out what this actually means. That is, I have no idea what they mean by “environmental response” genes.


Ionita-Laza I, Lange C, Laird NM. 2009. Estimating the number of unseen variants in the human genome. Proc Nat Acad Sci (early online) doi:10.1073/pnas.0807815106