Human genetic variation in a (very large) nutshell

Hinds and colleagues (2005) report in Science on a study that involved determining the genotypes in a sample of 71 people of 1,586,383 single nucleotide polymorphisms (SNPs). The sample is drawn from Americans in three subsamples representing African, Asian, and European ancestry. The goal of the study was to add to knowledge about the frequencies of SNP variation in different medically relevant populations, while assessing the linkage among SNPs. These data would help formulate better strategies for tracing the genetic correlates of disease and other phenotypic traits.

The data were acquired with these medical goals in mind, which limits to some extent their ability to address interesting issues about human evolution. For example, they select known SNPs that were judged to be likely to be high in frequency in multiple populations. This process, called ascertainment, was complicated enough to make it difficult to use the data in models of genetic evolution. For example, a large set of the candidate SNPs were selected from public databases, which are not random representatives of the three subpopulations considered here, making it likely that the three would differ in allele frequencies in ways characteristic of this bias. Because of the ascertainment complexity, it is unlikely that geneticists would be able to use these data to accurately reconstruct ancient evolutionary events (although it may not stop them from trying).

The most interesting part is a brief consideration of the role of natural selection in differentiating populations from each other. As the authors note, one suggestion concerning the distribution of genetic differentiation (as measured by FST) is that different genes have undergone very different patterns of global or local selection. The suggestion from this hypothesis would be that candidate genes to examine local selection could be identified from relatively large FST values. (Such genes would have high FST in any event; the distinction is that if genetic drift were largely responsible for human differentiation, then many non-locally-adapted genes might also have high FST values.) As they put their findings:

If this is true, then larger FST values should be found near functional genetic elements. We looked at the distribution of FST for SNPs that were genic or nongenic, coding or noncoding, and synonymous or nonsynonymous. We performed the analysis within subsets of SNPs grouped by MAF [mean allele frequency], so that effectively, we looked at the fraction of between-population variance for SNPs with the same total genetic variance. Common SNPs in genetic regions do have slightly but significantly higher FST values than nongenic SNPs with the same MAF . . . and common coding SNPs have slightly higher FST values than noncoding SNPs in genic regions. . . . These results are consistent with local selection changing the distribution of FST near functional sequences. However, because the distributions of FST among genic and nongenic SNPs are very similar, large FST values by themselves appear to be very weak evidence of selection (1074).

Of course there is another reason that genic and coding SNPs might not be much more differentiated than the average: if global selection has constrained them to similar frequencies. Given the huge range of genes in the scope of this analysis, it is hard to say which force of selection should be predominant, or if they should be nearly balanced in the way they would appear to be to explain the data. Certainly genes like the MHC genes would be expected to be held at broadly similar frequencies across populations. But then some of those are precisely the genes that should be very different among populations, as a result of different microbial histories. The authors also examined the private (confined to one sample) SNPs to see if they were more likely to be genic, finding that they were not. This is not surprising, since these alleles are by definition rare, and therefore unlikely to underlie strong selected differences between populations. The few that might be locally selected are surely lost in the volume of rare alleles that are either deleterious or subject entirely to drift.

It seems to me that the way to address the FST issue is to examine the distribution of FST estimates for the SNPs. Given the observed sample frequencies of the SNPs and some assumptions about population histories, it should be possible to derive an expected distribution of FST. Comparing that expected distribution to the observed distribution would give some information about whether the genes had been subject to drift alone, or whether they had been significantly perturbed in some way.

The data are publicly available; if you can think of a good use for them, have at it!

References:

Hinds DA, Stuve LL, Nilson GB, Halperin E, Eskin E, Ballinger DG, Frazer KA, and Cox DR. 2005. Whole-genome patterns of common DNA variation in three human populations. Science 307:1072-1079. Science Online