Overstating the obvious

I'm reading this interesting paper by Joseph Pickrell and colleagues, titled, "Signals of recent positive selection in a worldwide sample of human populations". The paper recounts the results of a selection scan in the Human Genome Diversity panel, which was reported in two publications last year. This is an interesting sample because it includes individuals from 53 population samples around the world.

I was waiting to present any observations about selection from the HGDP set until Pritchard's lab had published on them, since the initial publications had mentioned that this analysis was forthcoming. Now that it's appeared, I'll be pointing to a lot of these data in upcoming posts.

So I was reading with great interest. Then I found this statement:

Reports of ubiquitous strong (s = 1-5%) positive selection in the human genome (Hawks et al. 2007) may be considerably overstated (8).

I'm a little concerned that someone reading that might think that Pickrell and colleagues had actually tested our hypothesis about the number of recent strongly selected alleles. I'm also uncertain about the word, "ubiquitous", which means "everywhere." I mean, does that really sound like the kind of word I would use? It's just begging for trouble. It's like saying there's "ubiquitous" evidence of Neandertal contribution to the later European gene pool. Even if I thought it was true, I wouldn't put it in a paper!

We reported that roughly seven percent of genes appeared to be selected. Pickrell and colleagues list a rather large number of candidate loci for selection, and don't give any estimate or test of the number genome-wide. I think one might be able to count the regions listed in the data supplement for an estimate of what they thought was important enough to list, but I can't get the supplement yet. Since these candidate loci require 16 supplementary figures to list, maybe there are a lot of them. They do list a subset of more than 110 in the paper itself.

So what's the basis for saying we overstated anything? They suggest one reason for caution about the interpretation of candidate loci for selection:

We find that putatively selected haplotypes tend to be shared among geographically close populations. In principle, this could be due to issues of statistical power: broad geographical groupings share a demographic history and thus have similar power profiles. However, strongly selected loci are expected to show geographical patterns largely independent of demographydepending on the relevant selection pressures, they can be highly geographically restricted despite moderate levels of migration, or spread rapidly throughout a species even in the presence of little migration (Nagylaki 1975; Morjan and Rieseberg 2004) (8).

But wait a minute! If a gene were selected strongly and still polymorphic in human populations, it shouldn't be very old. So it can't have spread rapidly throughout the human species even in the presence of little migration. There hasn't been any time for this kind of spread.

To give a little mathematical perspective, one common way of modeling the dispersal of an advantageous gene is the Fisher diffusion wave model. In a Fisher wave, the gene grows logistically at any single point in space, and the allele frequencies form a standing wave that travels through space at a constant velocity. That velocity in a population uniform across 2-dimensional space is σ times the square root of s, where s is the selection coefficient and σ the root mean square dispersal distance -- basically, the average distance a person moves between his birth and the birth of his children.

If we want to know about dispersal of selected genes in early agriculturalists, we will need to know how far they move -- that's generally less than 10 km on average. So a gene selected strongly with a 5 percent advantage should move around 2.2 km/generation. Over the 400 generations since the beginning of agriculture, we'd expect a new allele to have dispersed across an area with a radius of less than 1000 km.

So in other words, it's just implausible that a selected allele would have a geographic distribution very different from drift, at least under the Fisher wave model. But obviously, some alleles have gone a lot farther than 1000 km in the last 10,000 years. Humans don't disperse strictly according to a Gaussian distribution, as assumed by the Fisher model; they sometimes disperse long distances. This can have a large impact on the spread of an advantageous allele. But it is an irregular phenomenon -- a stochastic event.

Let's consider the results a bit further. Here's a passage from page 1:

We find extensive sharing of putative selection signals between genetically similar populations, and limited sharing between genetically distant ones. In particular, Europe, the Middle East, and Central Asia show strikingly similar patterns of putative selection signals.

Which is exactly what we would predict from the history of these populations. Most signals of selection in Europe are Neolithic in date. The Neolithic was not only a time of massive population growth, but also the time of greatest mismatch between the human population and its novel agricultural environment. The dispersal of Neolithic lifeways from West Asia into Europe, and the recurrent incursions of Central Asian languages westward across the steppe into Europe and southward into the Indian subcontinent are the major features of the last 10,000 years of history in those regions. Don't we expect them to share a lot of selection? And if it took the massive migrations and interactions in those regions to generate this shared pattern of selection, shouldn't we expect other regions of the world, which lacked as extensive long-distance movements, to share fewer?

In this case, the critical information for evaluating the evidence is historical and archaeological. We can't just say that the candidate loci for selection have a similar geographic distribution to those that aren't selected. We need to evaluate the likelihood that they would have some other distribution. That likelihood is very low for most instances of selection, but may be high for a fraction of cases, or for some regions where long-distance dispersal was a more important aspect of population history.

So if we have a locus that is inconsistent with drift on the basis of linkage, we can reject drift. What if the geographic distribution is still consistent with drift? Should we doubt the linkage analysis? I don't see why -- basic biogeography says that most recently selected genes should have similar geographic distributions to drift.

References:

Pickrell JK, Coop G, Novembre J, Kudaravalli S, Li JZ, Absher D, Srinivasan BS, Barsh GS, Myers RM, Feldman MW, Pritchard JK. 2009. Signals of recent positive selection in a worldwide sample of human populations. Genome Res (early online) doi: 10.1101/gr.087577.108