Looking for the balances
A nice paper from last August by K. L. Bubb and colleagues went looking for new balanced polymorphisms in the human genome. They didn't find any.
There's a lot of complexity in the research approach, involved with sifting through SNP data looking for true (i.e., not false) positives. That part is not very interesting, and will probably be superseded by new data. But the plot thickens in the discussion, where the paper reviews patterns of selective balances and the conditions under which they may persist.
Bees R Us
The PNAS Early Edition this week includes a paper by bee genome researchers Amro Zayed and Charles Whitfield. After a short review of honeybee phylogeny, they demonstrate two things:
1. An ancient dispersal of honeybees from Africa into Europe was accompanied by a pulse of positive selection on coding genes, amounting to selection on approximately 10 percent of bee genes.
2. As Africanized bees have spread across South and into North America, adaptive genes from the existing populations of European bees have introgressed into the Africanized population, increasing under positive selection.
These are remarkable parallels to the worldwide evolution of humans. In bees, the geographic pattern is not the same, and the timescale is different, but the overall genetic impact is quite similar.
Here's the bee history:
In its native range, A. mellifera is classified into approximately two dozen subspecies, which are further organized into four major geographically and genetically distinct groups: African, Western and Central Asian (hereafter referred to as Asian), Eastern European, and Western and Northern European (hereafter referred to as West European) (9-11). European honey bees were introduced by humans to the New World by European settlers as early as the 1600s. In Brazil in 1956, an intentional introduction of African honey bees (A. mellifera scutellata), which hybridized with previously introduced European bees, led to the establishment and spread of the highly invasive and economically devastating Africanized honey bees in North America and South America (12). Subsequent studies have shown that Africanized bees are predominantly African in ancestry with minor but consistent contribution from European genotypes (11, 12). Using recently developed SNP panels, Whitfield et al . (11) demonstrated that the honey bee originated in Africa and subsequently expanded into Eurasia in two or more independent ancient expansions. One expansion gave rise to Western European honey bees, and at least one other independent expansion gave rise to Asian and Eastern European honey bees. Honey bee subspecies vary in a host of phenotypic traits, such as morphology, behavior, physiology, and gene expression (9-11, 13, 14) (Zayed and Whitfield 2008:3421).
I was not aware of the initial dispersals of bees into Europe and Asia. The genetic data show that the Western European strains are the ones with the most adaptive evolution since their dispersal from Africa. The separate ancient bee dispersals were documented by Whitfield et al. (2006), but they were not able to provide date estimates for the ancient dispersals, and none are attempted in this study.
This is the kind of test that ought to fail in most wild populations. Without a shift in the adaptive landscape, the fraction of new mutations with potential adaptive value is bound to be small -- because species are optimized to the environments that they have occupied for a long time. But European bees have a number of recent environmental changes, ranging from the simple effect of moving from a tropical to a temperate environment, the need to use new and different flora, and the effects of domestication. In a very numerous, rapidly dispersing species, these effects led to a rapid adaptive response in a large proportion of genes. These are the basic principles underlying the recent acceleration of positive selection in our lineage also.
The introgression of European genes into the dispersing Africanized bees in the Americas is interesting, because it seems counter-intuitive. The main differences between Africanized bees and European bees involve adaptations to climate. European bees put up lots of honey for the winter, and swarm less frequently, in addition to being more sedate. African bees don't bother with as much honey, which together with their more frequent swarming would seem to be a good fit for the tropical pattern of seasonality. These African traits explain why the African bees have spread at the expense of the European bees across the tropical New World. But Africanized bees have picked up a lot of genes from the European bees in the New World.
The authors propose some possible explanations:
The adaptive value of functional (coding) portions of Western European genomes could be related to positive selection on novel variation in West European bees, to positive selection on novel hybrid gene combinations, and/or to selection for heterozygous genotypes. Our study thus provides direct evidence that invasive populations can exploit hybridization in an adaptive fashion -- a finding of immense relevance to understanding the dynamics of biological invasions (Zayed and Whitfield 2008:3424).
In other words, behavioral correlates of climate may be a target of selection and introgression -- I would speculate because of the intrinsic rarity of adaptive mutations in these functions.
This is a relatively course-grained analysis of positive selection, since the study basically averages within SNP categories, determining FST between pairs of populations. For non-coding SNPs, the Africanized bees are very similar to African bees (FST = 0.05), while for coding SNPs they are twice as divergent (FST = 0.10). That's a lot of difference in allele frequencies over a short time; it must have been caused by strong positive selection across a broad sample of loci. They do not attempt the same kind of "10% of genes" estimate for the introgression, but their figures show that it is quite significant across their data.
I don't know but it may be a while before this initial study can be followed up with recombination based selection tests, because of this little known fact: bees have a recombination rate of 19 cM/Mb -- roughly 15 times higher than humans. Still, Whitfield et al. (2006) found an excess of linkage disequilibrium in the West European subspecies of bees. It now seems likely that some of this LD is explained by the widespread selection documented in the current study.
In other words, the genetic structure of global bee populations provides another strong example of the importance of rapid evolution in abundant species, coupled with ecological changes. Bees also now provide a strong example of adaptive introgression -- in this case, within a very tightly timed dispersal with known climatic conditions.
References:
Zayed A, Whitfield CW. 2008. A genome-wide signature of positive selection in ancient and recent invasive expansions of the honey bee Apis mellifera. Proc Nat Acad Sci USA 105:3421-3426. doi:10.1073/pnas.0800107105
Whitfield CW and 9 others. 2006. Thrice out of Africa: Ancient and recent expansions of the honey bee, Apis mellifera. Science 314:642-645. doi:10.1126/science.1132772
Genome-wide selection in humans
I spent like a half-hour looking for this paper this morning, which I didn't take notes on when it came out. I don't know if Nature is blocking Google Scholar or what, but it certainly didn't help that I thought I remembered it being in Science! In any event, blogging defeats all memory loss!
There's a LiveScience piece about the paper also.
Natural selection on protein-coding genes in the human genome
Carlos D. Bustamante et al.
...Here we contrast patterns of coding sequence polymorphism identified by direct sequencing of 39 humans for over 11,000 genes to divergence between humans and chimpanzees, and find strong evidence that natural selection has shaped the recent molecular evolution of our species. Our analysis discovered 304 (9.0%) out of 3,377 potentially informative loci showing evidence of rapid amino acid evolution. Furthermore, 813 (13.5%) out of 6,033 potentially informative loci show a paucity of amino acid differences between humans and chimpanzees, indicating weak negative selection and/or balancing selection operating on mutations at these loci. We find that the distribution of negatively and positively selected genes varies greatly among biological processes and molecular functions, and that some classes, such as transcription factors, show an excess of rapidly evolving genes, whereas others, such as cytoskeletal proteins, show an excess of genes with extensive amino acid polymorphism within humans and yet little amino acid divergence between humans and chimpanzees (Bustamante et al. 2005:1153).
The study identified selection by comparing the level of within-species polymorphism to the level of between-species divergence (comparing humans and chimpanzees) for both synonymous and nonsynonymous sites. Basically, they set up a 2 x 2 contingency table with four cells (divergence-nonsynonymous, divergence-synonymous, polymorphism-nonsynonymous, polymorphism-synonymous) and test for equality of rates -- except they show departures one way or the other with a parameter, gamma:
The parameter is negative if a gene shows an excess of amino acid polymorphism (or paucity of amino acid divergence) and positive if a gene has an excess of amino acid divergence relative to the genomic average for synonymous sites.
They can't apply the test to all genes, because many don't have enough polymorphism -- it takes a few segregating nonsynonymous variants to get a significant result. So they can test for positive selection at 3277 genes and negative selection at 6033.
They find the following:
Although we find that most of the genes in the informative data sets (n = 4,916) show no evidence of selection according to our methods, we do classify 813 loci as significantly negatively selected and 304 as positively selected at a 5% cutoff. Of the 50 loci identified by ref. 18 as rapidly evolving, 45 were informative about positive selection in our data set. Of these, 14 (31.1%) had more than 95% of their posterior mass above = 0, and 37 (82.2%) had a majority of their posterior mass above neutrality, indicating good agreement between population genetic and phylogenetic approaches for identifying rapidly evolving genes. There is a high degree of overlap in the types of genes classified as positively selected by both studies (see Table 1 and Supplementary Table 1), including defence/immunity proteins (P = 0.00965), gametogenesis (P = 0.03411), apoptosis (P = 0.00336) and sensory perception (P = 0.04577). One interesting result of our analysis is that transcription factors as a group seem to be rapidly evolving (P < 0.0001), with 39 out of 240 in the informative data sets (16.25%) having P+ greater than 97.5% as compared to 9.6% of all loci in the IPS data set. Similarly, we find evidence that the categories of nuclear hormone receptors (P = 0.00143) and genes involved in nucleoside, nucleotide and nucleic acid metabolism (P = 0.00467) have an excess of rapidly evolving genes.
The choice of a cutoff for these kinds of multiple-comparison analyses is an interesting story in itself. I was at a lecture by Bret Payseur yesterday, and he described the problem very well. Naturally, if you choose 5 percent as a cutoff for significance, you are going to be saying that 5 percent of genes are in your category of interest -- which for the human genome is going to amount to 1000 genes or more no matter what. Were 1000 genes under selection? Should you choose a more conservative cutoff -- maybe 1 percent? Maybe 0.1 percent?
The solution these researchers are using is to compare results with functional classes and chromosome position. If you find that there is a nonrandom association between certain functional classes of genes (e.g. immunity proteins) and your selection criterion, then it is prima facie evidence against neutrality. Likewise, if you find a nonrandom association between your selection criterion and chromosome position, that is evidence against neutrality.
In both cases, this is a statistical conclusion taken from lots of loci and not a strong judgment about any individual locus. That is why these papers take to discussing the "top 50" candidates and such arbitrary numbers. The idea is that if you have the most extreme set of genes, these are least likely to be neutral. Find convincing relationships with phenotypes under selection --- like skin color, or muscular dystrophy, or such --- and, again, you confirm the hypothesis of selection.
What we lack thus far is a good way to tell for sure whether a locus has been under selection from molecular information alone. And we may never get such a criterion, because drift can look like selection too often, and vice versa.
I think there may be something more interesting behind these observations:
We also used our approach to identify loci and classes of genes that show a paucity of amino acid divergence between humans and chimpanzees, yet have moderate to high levels of amino acid polymorphism, which we believe to harbour an excess of mildly deleterious variation. For example, loci involved in actin binding (P = 0.00013; Supplementary Table 1) and cytoskeletal formation (P < 0.00001) contain an over-representation of negatively selected loci, with 36 out of 205 cytoskeletal proteins in the informative set (17.6%) having P- greater than 97.5% (Table 1). For example, 6 out of the 9 myosin heavy chain loci exhibit a large excess of amino acid polymorphism, including non-muscle myosin (MYH9, P- > 0.983), embryonic (MYH3, P- > 0.999), perinatal (MYH8, P- > 0.962) and adult skeletal (MYH4, P- > 0.946; MYH13, P- > 0.957) myosin as well as the smooth muscle (MYH11, P- > 0.999) form. Other cytoskeletal proteins with excess amino acid polymorphism within human populations include myomesin 2 and 3 (MYOM2, MYOM3), dystrophin related protein 2 (DRP2), alpha- and beta-adducin (ADD1, ADD2), sarcospan (SSPN) and scinderin (SCIN). These results are consistent with the fact that as a group, genes involved in cell structure and motility (Table 1) show a signature of negative or purifying selection (P = 0.00008), with 27 out 176 (15.3%) loci exhibiting excess amino acid polymorphism relative to divergence.
Mutations in cytoskeletal protein-coding genes are known to cause a number of mendelian diseases and have been implicated in various complex disorders. For example, with the dystrophin gene many different types of mutation are known to cause both Duchenne and Becker types of muscular dystrophy (DMD, P- > 0.969). Also in this set is myosin VIIA (MYO7A, P- > 0.99), a gene implicated in Usher syndrome (1B), the most common cause of congenital deafness and blindness in developed countries. Similarly, the alpha- and beta-adducin genes (P- > 0.99 for both) are associated with hypertension and cardiovascular disease, and one known causative variant (ADD1 G460W (refs 19, 20)) is found at moderate frequencies in both African Americans (9.7%) and European Americans (28.9%) in our sample.
So human variation is enriched for coding variants in many muscle and cell-structure related genes. The low divergence between-species is evidence that these genes are conserved, but they have relatively many coding variants in humans today. Hmmm...
Then there is this:
Another interesting group of genes that show excess amino acid polymorphism (P = 0.02805) are those involved in ectoderm development, with 12 loci out of 98 exhibiting significantly elevated levels of amino acid polymorphism relative to amino acid divergence (Table 1). These include three loci in which mutations are known to cause disease: GLI3 (polydactyly), NOTCH3 (Drosophila homologue implicated in cerebral arteriopathy) and DCC (colorectal carcinoma). Interestingly, all three of these genes are also involved in neurogenesis according to the Panther molecular function classification. Genes involved in general vesicle transport also show excess amino acid polymorphism as a class (P = 0.00016). Sixteen loci have posterior distributions with more than 95% and four with more than 99% mass above = 0 (for example, COPE, HD, KIF4A, SEC31L1 and STX11). The most familiar of these is HD -- the gene that causes Huntington's disease.
So some genes affecting neurogenesis also have an excess of coding polymorphism in humans compared to the amount of between-species divergence.
I'm not sure that negative selection is the story for genes like these. Maybe it is -- for example, one explanation of excess deleterious variation being present in human populations today would be a recent population bottleneck. Genetic drift could have carried a few slightly deleterious variants to detectable frequencies, which ultimately will be selected out of the human population. But if that were true, I'd expect the pattern would be stronger at genes that are really strongly conserved, like histones. You would have to have some explanation for why some conserved genes show an excess of polymorphism and others don't -- presumably, because they happen to be really weakly selected within humans recently, but more strongly selected sometime in the more distant past.
References:
Bustamante CD et al. 2005. Natural selection on protein-coding genes in the human genome. Nature 437:1153-1157. Full text (subscription)
The problem with HapMap: a parable of potholes
I've been sifting through some HapMap-related stuff. It's a tremendous resource for looking at human variation, but it also presents some tremendous problems. An interesting review of some of the information coming out of HapMap by Gil McVean and colleagues is running in PLoS Genetics.
These guys are statisticians working to analyze some of the data, and they put into words very well some of the issues I've been butting against.
There is great heterogeneity across the genome in terms of patterns of genetic variation. Some of this heterogeneity is due to variation in factors such as mutation rate and recombination rate. Some of this heterogeneity arises because of the stochastic properties of mutation and genealogical history. But there are also other forces such as natural selection and genomic features such as inversions that may influence local patterns of variation. How can we look for the effects of such factors? There are two approaches. Either we can try to predict what we would expect to observe under models with and without such effects [23,24], or we can simply look at the empirical distribution of statistics of genetic variation and take as candidate regions those showing extreme or unusual patterns. The difficulty of the first approach is that accurately modelling human variation (and SNP ascertainment) is probably impossible. The difficulty of the latter approach is that there is no guarantee that empirically unusual patterns point to biologically interesting features (McVean et al. 2005:e54, emphasis added).
I would add that there is an additional difficulty with the second approach -- namely, that it assumes that selection or other factors cause heterogeneity. More about that later.
This point about ascertainment and demography is an important one. Predictions about the effects of evolution (including drift) upon genetic variation are simplest under random sampling. Since the beginning of human genetics, almost no one has attempted to sample people at random. Some nonrandomness may be desirable -- for example, recent demographic changes make a random sample of today's humans very different in composition from a random sample of humans in 1491, or 6000 B.C., or almost any time in the past. Which of these "random" samples would represent the population whose history we care about? If we are interested in prehistoric events, we may find it desirable to represent peoples in proportion to their prehistoric distributions. This has precisely been the approach of some genetic surveys.
But even from this simple example, the problems with human demography are clear. We can try to shift samples to represent prehistoric "distributions" of populations, but geography is not the only aspect of demography that has changed. There have been massive population mixtures that would never have had the opportunity to occur in prehistoric times. There have been diseases and demographic crashes that we know about, and probably many that we don't know about.
The question is whether it is possible to arrive at a "rough draft" of human demographic history that would be precise enough to generate theoretical distributions of human variation. As it happens, that is precisely the goal of this paper by Stephen Schaffner and colleagues in Genome Research. From the abstract:
With the advent of large empirical data sets, it is now possible to calibrate population genetic models with genome-wide data, permitting for the first time the generation of data that are consistent with empirical data across a wide range of characteristics. We present here the first such calibrated model and show that, while still arbitrary, it successfully generates simulated data (for three populations) that closely resemble empirical data in allele frequency, linkage disequilibrium, and population differentiation. No assertion is made about the accuracy of the proposed historical and recombination model, but its ability to generate realistic data meets a long-standing need among geneticists (Schaffner et al. 2005:1576, emphasis added).
The problem with the approach is precisely in the boldface sentence. It is possible to use empirical data to calibrate a model that generates simulated data that is similar to the empirical data. The point of using such a calibrated model is to be able to show how strange certain regions are if they don't fit the simulated distribution, which is based on the empirical distribution.
But it's all circular.
Suppose we wanted to use a detailed topographic survey of a road to find the potholes. But for everyday roads, there is a problem -- there are lots of bumps and grooves that aren't potholes. And different parts of the road are more or less bumpy. It would help a lot if we could use the empirical distribution of bumps to simulate a section of road -- then we could figure out whether anomalies in the real road were likely to be potholes or not.
Now suppose that the road isn't just pocked with the occasional pothole -- it has a pothole every three or four feet. Remember why we're using simulations -- not only do we not know where the potholes are, we don't know how common they are. So our simulations based on the pothole-rich road will find that pothole-sized bumps are normal. If pothole-sized bumps are not unusual, then our simulation can have only one result: a pothole is not a pothole.
Consider this current paper on the CCR5Δ32 allele, generally thought to be recently selected in Europeans as a disease defense against smallpox (my earlier entry). Sabeti et al. (2005) revise the date of the mutation from only around 700 years ago to around 5000. But more important, they deny that the allele was necessarily selected. Why? Because its pattern of linkage disequilibrium is relatively common across the genome.
Our reanalysis of CCR5 shows that CCR5-Δ32 does not clearly stand out from the rest of the genome in terms of allele frequency distribution, population differentiation, or long-range LD (Figure S8). The high population differentiation and long-range LD found for CCR5-Δ32 are, in fact, far more common in the genome than previously believed, and therefore do not provide support for the hypothesis of strong selection for CCR5-Δ32. Using methods described both in the previous study [8] and in the current study, and by examining currently available data, there is no detectable evidence for recent selection for CCR5-Δ32 (Sabeti et al. 2005:e378).Ceci n'est pas un pothole.
Why is it that simulations like these are not attempts to make accurate historical models? Quite simply, because they can't. After years of attempts at reconstruction human evolution based on "neutral" genetic loci, the HapMap at last has thrown out the possibility entirely. If we want to use the broadest source of information, we have to take with it some significant lumps. And one of the biggest is that we simply don't know the selective dynamics of most of the genome.
For the anthropologist interested in history, it is not critical that we be able to develop a model of demographic history accurate enough to serve as a theoretical distribution for testing selection. Our goals are often much less ambitious, and there is much information about demographic history to be had from the HapMap and like projects. But neither should we minimize the problems. Attempting to use simulations to match the variation of the genome as a whole simply isn't going to work, if any substantial proportion of the genome has been under recent selection. And as we move to higher parameter models of demography (Schaffner et al. 2005 attempt a 21-parameter model) selection on relatively few sites becomes more and more capable of distorting demographic estimates.
By far the worse problem is finding selection.
For example, McVean et al. (2005) embark upon an examination of the "tails" of the genome -- the parts that show highly unusual patterns of variation compared to most regions. The idea is that if these very unusual loci had areas of biological interest (i.e., particular genes), their very strangeness might lead to a hypothesis of historical change (such as selection).
But they run into problems:
Of the 19 genes with previous evidence for historical selection, 12 show an unusual pattern of genetic variation in at least one population (defined as having a value lying in either the bottom 5% or top 5% of empirical values). Superficially, this result suggests that statistical tests based on rejecting a simple population genetics model are effective at detecting genes of interest. However, for 114 tests, we might expect 11 to lie in either the top or bottom 5% of observations, compared to the 17 observed. Another concern is that genes of known functional and selective importance, such as Duffy and CD40 ligand, do not fall in the tails of the empirical distribution of Tajima's D and Fay and Wu's H statistics and others, such as MMP3, hemochromatosis (HFE), and aldehyde dehydrogenase 2 (ALDH2) show patterns that are unusual, but not indicative of the action of recent selective sweeps.
Multiple comparisons are pretty tricky across the entire genome and in multiple populations. Considering this, we might see the HapMap and similar surveys as hotbeds of type II error: if we go looking for strange things, we are going to miss the forest for the trees.
There are two main conclusions from these analyses. First, that biologically interesting loci often do have unusual patterns of genetic variation, but that there is no single way of measuring "unusual" that is uniformly powerful for detecting the action of natural selection. Second, that rejection of neutral evolutionary models is no guarantee that the locus is unusual when compared to the rest of the genome (McVean et al. 2005:e54).
"Unusual compared to the rest of the genome" is a phrase you should expect to hear a lot of in the next few years.
References:
McVean G, Spencer CCA, Chaix R. 2005. Perspectives on human genetic variation from the HapMap project. PLoS Genetics 1:e54. Full text (free)
Sabeti PC et al. 2005. The case for selection at CCR5-Δ32. PLoS Biol 3:e378. Full text (free)
Schaffner SF, Foo C, Gabriel S, Reich D, Daly MJ, Altshuler D. 2005. Calibrating a coalescent simulation of human genome sequence variation. Genome Res 15:1576--1583. Full text (free)
Looking for local selection via STR diversity
This is one of those old papers I run across sometimes doing research:
A Genome Scan to Detect Candidate Regions Influenced by Local Natural Selection in Human Populations
Manfred Kayser, Silke Brauer and Mark Stoneking
As human populations dispersed throughout the world, they were subjected to new selective forces, which must have led to local adaptation via natural selection and hence altered patterns of genetic variation. Yet, there are very few examples known in which such local selection has clearly influenced human genetic variation. A potential approach for detecting local selection is to screen random loci across the genome; those loci that exhibit unusually large genetic distances between human populations are then potential markers of genomic regions under local selection. We investigated this approach by genotyping 332 short tandem repeat (STR) loci in Africans and Europeans and calculating the genetic differentiation for each locus. Patterns of genetic diversity at these loci were consistent with greater variation in Africa and with local selection operating on populations as they moved out of Africa. For 11 loci exhibiting the largest genetic differences, we genotyped an additional STR locus located nearby; the genetic distances for these nearby loci were significantly larger than average. These genomic regions therefore reproducibly exhibit larger genetic distances between populations than the "average" genomic region, consistent with local selection. Our results demonstrate that genome scans are a promising means of identifying candidate regions that have been subjected to local selection.
I hadn't noticed this wrinkle at the time, but it would seem that this study proves that STR variation is noticeably affected by positive selection at linked sites.
There is of course something fundamentally circular about choosing the "most extreme" differences among loci to define selection. After all, who is to say that 11 is the right number to choose? Why not 50? Why not 100? In fact, the study picked 15 loci and found only 11 with nearby STR's that could be typed. So 11 out of 332 isn't really the proportion; there are an unknown proportion of STR's affected by positive selection, some of which are affected differently by local positive selection in different populations and as a consequence show significantly high RST values.
In any event, this certainly makes suspect the idea that STR loci are unbiased neutral markers for reconstructing population history. Some of them may be, but at present we can't tell the "neutral" ones from the ones linked to positively selected sites. The proportion of these loci linked to sites under local positive selection -- the kind surveyed in this study -- might be relatively small. But how many are linked to sites under global positive selection?
Studying linkage with STR markers is tough. We are already throwing out STR loci with low repeat variance, because they don't vary. But some of these loci have low repeat variance because they have short genealogies -- either as a consequence of selection (presumably on linked sites) or drift. Now, we have to sift through loci that are variable to assess how much their variability may be affected by linkage to selected sites. And their variability isn't affected in any simple way -- complete linkage to a site currently under selection would increase the frequency of one or a few alleles at an STR site; partial linkage might affect different alleles in different populations; in neither case is there any easy way to tell what's going on -- because as the paper explains, there isn't even a theoretical distribution to compare them against. What a mess!
And this is interesting:
We assumed that local selection would primarily influence Europeans, as modern humans originated in Africa, and hence new opportunities for local selection would have occurred as modern human populations spread out of Africa. Some support for this assumption comes from the distribution of ln RV [ratio of allele size variance in Africans vs. Europeans] values (fig. 3), in which there is an excess of extreme positive values (i.e., in the right-hand tail); since the variance in Africans appears in the numerator of the ln RV value, this indicates that there are many more loci showing significantly reduced variation in Europeans (relative to Africans) than in Africans (relative to Europeans). However, this should be interpreted cautiously, as an extreme bottleneck in Europeans, which is suggested by some genetic data (Tishkoff et al. 1996; Yu et al. 2002), could also lead to an excess of loci with significantly reduced variation in Europeans relative to Africans (Schlötterer 2002a).
There's another reason for cautious interpretation: the European sample was all from Leipzig, while the African sample was taken from four groups in different locations in Ethiopia and South Africa!
In fact, that is a generally unrecognized problem in these kinds of studies: Are African and European samples apples and oranges? How much does post-Pleistocene population history affect the genetics of these populations? How much difference does it make sampling people in a large city vs. people in a bunch of villages? How much difference does local selection make within continents, if it already seems to make a large difference between them?
Clearly Africans are more variable on average than Europeans -- that's not at issue. But too many studies treat that observation as the end of the issue --- if you observe that your African sample is more variable than your European sample, then nothing else about them matters, QED. In this study, a lot of loci were more variable in the European sample than the African sample (i.e., have negative lnRV). Is the proportion of such loci informative? Depends on how much the sampling scheme might have affected the diversity of the samples.
I don't think there is really any easy solution to this sampling problem -- we aren't going to know how much difference the Neolithic may have made for a long time, for instance. But skepticism seems like a healthy attitude.
References:
Kayser M, Brauer S, Stoneking M. 2003. A genome scan to detect candidate regions influenced by local natural selection in human populations. Mol Biol Evol 20:893-900. Full text online
The probability of parallel evolution
Orr (2005) considers the likelihood of the same mutants being fixed in two populations as a function of parallel selection, compared to drift. The model used is a very simple one, basically involving a single locus in each population with a limited number of advantageous mutants that may be presented to both populations.
The argument for the idea that beneficial mutations are limited is probably right:
Throughout this analysis, I make a major assumption: the number of beneficial mutations is small. This will almost certainly be true for two reasons. First, environments are autocorrelated through time, making it unlike [sic] that a previously highly fit wild-type allele would suddenly plummet in relative fitness; second, random changes in a functional protein are much more likely to worsen than to improve protein function (216).
The result of the paper is that parallel evolution is likely under such circumstances. This is not especially surprising, and the innovative aspects of the paper are the demonstration that this is true under many models of the distribution of fitnesses of mutations. The equations in the paper are derived from extreme value theory, with the basic theme being that the fittest possible new mutations are also the rarest, so these will preferentially be incorporated into populations.
Does this study apply to natural populations? Even most closely related populations typically differ in ecology in some respects, so it is hard to say that the model where mutations have the same fitness characteristics in two different populations is always relevant. Likewise, over the long term it is likely that a natural population will be as near to an optimum allele as is practicable. That is to say, the argument above that wild-type alleles are unlikely to plummet in relative fitness, carried to its logical extreme, would predict that any natural population of substantial size would already have had the opportunity to explore all the adaptive space available to it by recurring mutations.
Only in fairly unusual circumstances will populations be limited from achieving higher fitness (for any single gene) because mutations don't occur often enough. Instead, they will be limited by the fact that the mutations that do occur are never more adaptive than the current wild-type. The unusual circumstances would include cases in which the adaptive landscape really is complex; for example, where the phenotypic characters influenced by the gene are themselves subject to complex patterns of stabilizing selection. Here, the possibility for stepped advantages among many genes creates the opportunity for a progression of mutations. That is to say, many genes that interact with each other are all highly optimized and adaptive mutations at each of them are incredibly rare. But when an adaptive mutation occurs at one of these genes, it may shift the interaction in ways that make a new (perhaps recurring and previously neutral or deleterious) mutation at one or more of the other genes more likely to be adaptive. In this way, a highly polygenic trait might be mutation-limited in its evolution, while no individual gene can be said to be mutation-limited.
References:
Orr HA. 2005. The probability of parallel evolution. Evolution 59(1):216-220.
Why have variants influencing recombination rate been selected in non-Africans?
A complicated story is tangled through this paper by Augustine Kong and colleagues, and I don't see where it may end. But here's the abstract:
The genome-wide recombination rate varies between individuals, but the mechanism controlling this variation in humans has remained elusive. A genome-wide search identified sequence variants in the 4p16.3 region correlated with recombination rate in both males and females. These variants are located in the RNF212 gene, a putative ortholog of the ZHP-3 gene that is essential for recombinations and chiasma formation in Caenorhabditis elegans. It is noteworthy that the haplotype formed by two single-nucleotide polymorphisms (SNPs) associated with the highest recombination rate in males is associated with a low recombination rate in females. Consequently, if the frequency of the haplotype changes, the average recombination rate will increase for one sex and decrease for the other, but the sex-averaged recombination rate of the population can stay relatively constant.
Perhaps it's not so curious that alleles of this gene have opposite effects on recombination in males and females. The mechanisms of gamete production are obviously different in the two sexes, and we might expect some kind of frequency-dependent mechanism to regulate recombination. At least, it's a hypothesis.
What I find mysterious is this:
A phylogenetic analysis of a 55-kb region containing rs3796619 and rs1670533 in the HapMap data (24) revealed three well-differentiated clusters of haplotypes showing notable differences in frequency between the Yoruban Nigerians (YRI) and CEU and East Asians (CHB and JPT) (fig. S6). The [C,T] and [T,C] haplotypes that associate most strongly with recombination rate have a combined frequency of only 17% in the YRI sample, but reach a frequency of 91% and 98% in the CEU and East Asian samples, respectively. Several SNPs in this region show an unusual degree of divergence among the HapMap groups, on the basis of the rank percentile of their FST values (Wright's coefficient, a measure of variance in allele frequencies among populations) among all autosomal SNPs with the same overall frequency in the HapMap. Specifically, we identified eight SNPs whose FST values are in the top 0.5% for differences between the YRI and East Asian HapMap samples and also in the top 5% of differences between the YRI and CEU samples. Each of these SNPs differentiated a subset of [T,T] haplotypes from the rest, perhaps indicating an episode of positive selection (or a severe founder effect) that increased the frequency of [C,T] and [T,C] haplotypes in the ancestors of European and East Asian populations.
The [C,T] and [T,C] haplotypes are the ones associated with increased recombination rate in males and females, respectively. The markers are in strong disequilibrium (no [C,C] haplotypes were observed), and seem to have been selected outside of Africa.
I have no idea why.
The recombination rates were all inferred from a large Icelandic sample, so maybe the rates don't really characterize the haplotypes in other populations. Maybe recombination rate is incidental to the real reason for the selection. Or maybe in populations roaring with positive selection on many genes at once, it is a good thing to break them apart more often.
References:
Kong A and 16 others. 2008. Sequence variants in the RNF212 gene associate with genome-wide recombination rate. Science 319:1398-1401. doi:10.1126/science.1152422
Selection on synonymous mutations
Here's an interesting thought:
Background
In mammals, contrary to what is usually assumed, recent evidence suggests that synonymous mutations may not be selectively neutral. This position has proven contentious, not least because of the absence of a viable mechanism. Here we test whether synonymous mutations might be under selection owing to their effects on the thermodynamic stability of mRNA, mediated by changes in secondary structure.
Results
We provide numerous lines of evidence that are all consistent with the above hypothesis. Most notably, by simulating evolution and reallocating the substitutions observed in the mouse lineage, we show that the location of synonymous mutations is non-random with respect to stability. Importantly, the preference for cytosine at 4-fold degenerate sites, diagnostic of selection, can be explained by its effect on mRNA stability. Likewise, by interchanging synonymous codons, we find naturally occurring mRNAs to be more stable than simulant transcripts. Housekeeping genes, whose proteins are under strong purifying selection, are also under the greatest pressure to maintain stability.
Conclusion
Taken together, our results provide evidence that, in mammals, synonymous sites do not evolve neutrally, at least in part owing to selection on mRNA stability. This has implications for the application of synonymous divergence in estimating the mutation rate.
That's the abstract of a study (free text online) in Genome Biology by J. V. Chamary and Laurence D. Hurst.
What is the net effect of such selection? The short answer is nobody has any idea. Consider:
The substitution rate at synonymous sites in exons is often used as a measure of the mutation rate [8,9]; however, this assumes neutral evolution of synonymous mutations [1,2]. By providing a parsimonious mechanism by which selection could act on synonymous sites, we can ignore the objection that prior evidence is indirect. Nevertheless, it is presently unclear to what degree synonymous mutations are favored or opposed by selection due to their effects on mRNA stability. Without being able to quantify the latter, as well as the net effect of other biases (for example, splice-associated), it will not be possible to directly estimate the extent to which use of the synonymous substitution rate leads to underestimates of the mutation rate and the mutational load.
And:
Indeed, it is quite possible that there exist no preferred codon within a gene while at the same time synonymous mutations are under selection. More generally, a complex set of trade-offs between different forms of selection and mutational biases may render interpretation of patterns of codon usage very difficult.
It seems possible to me that selection on mRNA stability may allow certain kinds of fine-tuning changes, analogous to selection on promotors or inhibitors. If the half-life of an mRNA within a cell could be decreased slightly, it might well have an adaptive (or conversely maladaptive) result. Would it make enough difference to be selected? Maybe not in most cases, but in some cases it might well do so. And Charmary and Hurst are able to show that genes subject to strong purifying selection appear to have greater constraint on mRNA stability -- in my view the most persuasive of their arguments for the effect.
It also occurs to me that selection on mRNA structural stability would also be a consideration for the origin of the genetic code. Certain patterns of redundancy in silent sites would potentially be targets of selection, for long-lasting or shorter-lasting RNA codons as options for the same amino acid.
It just goes to show how many real unknowns we still face when looking at molecular evolution. If synonymous mutations aren't really neutral, what will we discover next?
References:
Chamary JV and Hurst LD. 2005. Evidence for selection on synoymous mutations affecting stability of mRNA secondary structure in mammals. Genome Biol 6:R75. Free full text online
More on selection outside Africa
I have a feeling I'll have many occasions to use that headline.
Here's the central part of Stajich and Hahn (2004) concerning selection in Europeans vs. Africans:
There is a highly significant, positive relationship between Tajima's D statistic from loci in the European-American data set and recombination rates across loci, even while controlling for (R2 = 0.09, F = 14.7, P = 0.0002). This relationship is not significant in the African-American data (R2 = 0.02, F = 3.1, P = 0.08). If we do not control for levels of polymorphism and use simply the Tajima's D statistic values alone, there is a marginally significant relationship between recombination and D statistic values in both populations (fig. 4) (EA, R = 0.10, F = 16.5, P < 0.0001; AA, R2 = 0.03, F = 4.9, P = 0.027). The positive relationship between the D statistic values and recombination in the European-American population suggests that multiple hitchhiking events have been associated with the migration out of Africa and colonization of novel habitats. Repeated fixation of advantageous mutations throughout the genome has caused a skew in linked variation towards lower frequencies, with more pronounced effects in regions of low recombination. The lack of a relationship in the African-American sample could be an effect of sampling from an admixed population, but the rank-order of D statistic values in this population is significantly correlated with the values in the European-American sample (Spearman's = 0.33, P < 0.0001). We believe that the significant correlation present only in the European-American sample is caused by the increased number and/or effect of advantageous alleles that are associated with migration into new habitats. This conclusion assumes that the environments in Africa are more similar to the ancestral human environments than those found outside of Africa, and that migration out of Africa brought humans into novel environments. These conclusions are consistent, however, with a number of recent studies also reporting a preponderance of evidence for selective sweeps in non-African human populations (Kayser, Brauer, and Stoneking 2003; Storz, Payseur, and Nachman 2004). Also, because we have controlled for levels of polymorphism, the relationship between recombination and the frequency spectrum in European-Americans should be independent of any mutagenic effect of recombination (Hellmann et al. 2003).
The paper ends this way:
Whereas demographic events such as population bottlenecks or expansions will affect all genes in a genome, natural selection is expected to have only locus-specific or region-specific effects on DNA variation. Our analyses have shown that the demographic histories of human populations can largely account for the level and frequency of variation across the genome. However, even working within a nonequilibrium framework, we were able to show deviations from neutral expectations at the ABO and TRPV6 loci and in many regions of low recombination. The results for this data set are consistent with the combined effects of a population bottleneck and repeated selective sweeps in the human migration out of AfricaÑin agreement with previous reports (Kayser, Brauer, and Stoneking 2003; Storz, Payseur, and Nachman 2004)Ñand suggest that natural selection affects a relatively large proportion of the genome.
The evidence for widespread selection, in other words, is the association of Tajima's D and recombination rate; this association is significant in Europeans but not in African-Americans. The lack of such an association in the African-American sample may be partly explained by a dual ancestry among Africans and Europeans, since recent population mixture tends to inflate the number of rare alleles in a sample (and hence decrease Tajima's D).
This is not the greatest test for selection, since recent selection (i.e. new positively selected variants that are not yet near fixation) may not strongly affect Tajima's D. And the effect in Europeans is manifested as high-recombination loci having positive D values, not low-recombination loci having negative values. So clearly there is selection here, but it is not being picked up that strongly.
References:
Stajich JE, Hahn MW. 2004. Disentangling the effects of demography and selection in human history. Mol Biol Evol 22:63-73. Posted at 21:24 on 12/11/2005 | permanent link
At least 10 percent of human genes under recent selection
It's hard to beat the abstract of this paper by Eric Wang and colleagues (2006):
By using the 1.6 million single-nucleotide polymorphism (SNP) genotype data set from Perlegen Sciences [Hinds, D. A., Stuve, L. L., Nilsen, G. B., Halperin, E., Eskin, E., Ballinger, D. G., Frazer, K. A. & Cox, D. R. (2005) Science 307, 1072-1079], a probabilistic search for the landscape exhibited by positive Darwinian selection was conducted. By sorting each high-frequency allele by homozygosity, we search for the expected decay of adjacent SNP linkage disequilibrium (LD) at recently selected alleles, eliminating the need for inferring haplotype. We designate this approach the LD decay (LDD) test. By these criteria, 1.6% of Perlegen SNPs were found to exhibit the genetic architecture of selection. These results were confirmed on an independently generated data set of 1.0 million SNP genotypes (International Human Haplotype Map Phase I freeze). Simulation studies indicate that the LDD test, at the megabase scale used, effectively distinguishes selection from other causes of extensive LD, such as inversions, population bottlenecks, and admixture. The 1,800 genes identified by the LDD test were clustered according to Gene Ontology (GO) categories. Based on overrepresentation analysis, several predominant biological themes are common in these selected alleles, including host-pathogen interactions, reproduction, DNA metabolism/cell cycle, protein metabolism, and neuronal function.
Most tests of selection are blunt instruments. They depend on observations of the frequency spectrum of mutations, but mutations don't happen very often for most genetic loci. With most methods, recent selection is very difficult to find. It's like trying to find potholes when you're driving a tank -- it takes a pretty big pothole to notice anything. To find a higher proportion of the selection that happened, you need a more sensitive metric.
The mark of a selected allele is a rapid increase in frequency. If the selection is recent, then the allele should have appear to originate recently. A rapid increase in the frequency of an allele leaves a pattern of linkage disequilibrium (LD), because recombination does not have a chance to break the selected locus apart from nearby neutral loci. The longer ago the allele increased in frequency, the more recombination and the less LD.
Wang et al. (2006) used the prediction that the LD should decrease over time to establish a test of recent selection. They surveyed the linkage among nearby SNPs to determine whether a variant has increased rapidly in frequency during the recent past. The sensitivity of this test depends on the SNP coverage of the genome. At present, SNP coverage is very good for variants with moderate to high frequencies, so although low-frequency selected variants (those with less than a 5 - 10 percent global frequency) were missed by the current survey, it has found a huge number of selected loci.
In conclusion, we have introduced a simple probabilistic method to detect unusual genetic architectures associated with recent selection that does not require haplotype information. It is, therefore, suitable for large chromosomal scans with large population samples. Homo sapiens have undoubtedly undergone strong recent selection for many different phenotypes, including but certainly not limited to the general categories we have defined in this work (Fig. 5). Such inferred selective events are not rare (Fig. 3). The numbers obtained, however, are similar to estimated numbers obtained for artificial selection (by humans) on the maize genome (45). Given that most of these selective events likely occurred in the last 10,000 Ð 40,000 years, a time of major population expansion out of Africa followed by regional shifts from hunterÐgatherer to agrarian societies, it is tempting to speculate that gene Ð culture interactions directly or indirectly shaped our genomic architecture (46, 47). As such, we suggest that such recently selected alleles may provide
useful "markers" for investigating the evolutionary migrations of our species, as an adjunct to studies using neutral markers. We also propose that many of these alleles, because of their high prevalence and recent selection, should be considered likely "functional candidates" for association with human variability and the common disorders afflicting humankind.
They also assign the loci with evidence of recent selection to different functional categories. Pathogen-host interaction loci have a high representation in the recently selected genes, as do genes related to protein and gene metabolism. And this:
One of the more intriguing categories overrepresented in inferred selective events is neuronal function. We define this category to include a diverse assortment of genes, including the serotonin transporter (SLC6A4), glutamate and glycine receptors (GRM3, GRM1, and GLRA2), olfactor y receptors (OR4C13 and OR2B6), synapse-associated proteins (RAPSN), and a number of brain-expressed genes with largely unknown function (ASPM, RNT1; see Fig. 4).
It would be hard for me to overstate how important this paper is. Even if it weren't central to my own current research (about which you will just have to wait for more), it brings home the vast importance of adaptive change during the most recent parts of human evolution.
References:
Wang ET, Kodama G, Baldi P, Moyzis RK. 2006. Global landscape of recent inferred Darwinian selection for Homo sapiens. Proc Nat Acad Sci USA 103:135-140. Abstract
Evaluating selection and demography in human evolution
Williamson et al. (2005) present a new mathematical method for deriving information about population size change and selection from the allele frequency spectrum of variation taken at multiple genetic loci. Their method depends on separating sites that are selected from those that are neutral, and thereby isolating the effects of demography from those of selection. They then apply their technique to human genetic data to derive estimates of the average selection on selected sites, and the timing and magnitude of population size change from nonselected sites.
Oh, if it were really so easy.
Selection
To be fair, the paper states a major concern with accurately identifying selection in the context of species like humans and Drosophila that have experienced recent population growth. In other words, the main interest is not in deriving evidence about human prehistory, but instead about making sure that estimates of selection are not biased by population growth.
With respect to selection, their major conclusion is as follows:
We find evidence that negative selection on nonsynonymous mutations is widespread, which implies that deleterious mutations make up a significant proportion of standing nonsynonymous variation. Exactly how this genetic variation contributes to phenotypic variation is a matter of considerable debate, especially for medically interesting phenotypes such as multifactorial genetic disease. Because deleterious mutations, by definition, have phenotypic effects, and because of the widespread nature of negative selection on nonsynonymous mutations, it seems likely that negatively selected, generally rare nonsynonymous SNPs have some negative impact on human health. If there is a general relationship between nonsynonymous polymorphism and human genetic disease, then our genomic estimates of the fitness effects of different types of mutations contain prior information about the likelihood that a mutation contributes to disease. It may be possible to use this information to aid in identifying SNPs that cause disease. Other studies have suggested this approach (e.g., Livingston et al. 2004), but it was unclear which of the many measures of exchangeability to use. We feel that the relative fitness of different amino acid changes is the best way to evaluate exchangeability, and we have done that here by using a model that includes demography and selection (Williamson et al. 2005:7887).
Readers may note that other studies have found evidence for a very high proportion of positive selection across the human genome (discussed in this post). The test applied in the current paper is not well suited to detecting evidence of positive selection, particularly if it is widespread, because it depends on the difference in frequency spectra between "selected" and "neutral" sites. Why the scare quotes? Because although noncoding sites or synonymous SNPs may well be neutral in the literal functional sense of not being targets of selection, it is impossible to verify that they are unlinked to selected sites. For the purposes of detecting negative (purifying) selection, this is not such a problem, because linkage will affect nearby sites only weakly (although this weak effect, called "background selection," may well influence the average level of variation in Drosophila).
In any event, even if positive selection has been very common across the genome, most sites that have been subject to positive selection should have been fixed long ago. Only a few should still be under selection now, and these are predominantly very recent mutations.
Consider the following scenario. The study considered 301 human genes. According to common knowledge, repeated here, positive selection leads to a relative excess of high-frequency alleles, compared to the predictions of neutrality (which predicts that there should be very few high-frequency alleles). But these high-frequency variants represent only a small proportion of the total number of genes currently under positive selection, since an allele being driven to fixation passes through every intermediate frequency, not merely the high ones. To detect evidence for positive selection, this study would have to find dozens of high-frequency variants in excess of neutral theory, representing scores of selected genes. But suppose instead that only one positively selected gene actually was in the sample. If so, then out of the human genome of approximately 20,000 genes, we might expect to find 60 or 70 genes currently under positive selection. In our fictive scenario it would be rash to extrapolate from a sample of 1, but in fact there are good reasons to think the true number is much higher. One such gene might take 1000 generations to transit from its appearance to fixation. There have been 100,000 generations in the 2 million years since the origin of our genus, and at least 300,000 since our divergence from chimpanzees. In other words, the complete transformation of the human genome by positive selection, altering thousands of genes -- or even all of them, multiple times -- would be far from detectable by this test.
But remarkably, this test does find evidence for positive selection -- in noncoding substitutions! The authors put it less sensationally: "Interestingly, we find marginal evidence for weak positive selection on noncoding indel polymorphisms" (7885). I have no explanation for it. But if there actually is a statistically detectable excess of high-frequency variants for these polymorphisms, it may reflect selection at linked sites, or issues with the composition of the sample. If the level of positive selection is detectable, it is another strong evidence of the power of such selection over the long timespan of human evolution.
In contrast to positive selection, even very strong purifying selection may leave low-frequency variants within the population for a long time. These variants are picked up within samples in large numbers. Low frequency variants are predicted to make up most genetic variation under neutrality, so the proportion of such variants is always a substantial part of the sample. High numbers make for powerful tests. For the human data examined in this study, the nonsynonymous coding sites have a higher proportion of low-frequency variants than do the noncoding, synonymous sites. Thus, they provide strong evidence of negative (purifying) selection.
Human demography
So the results of the method applied to selection are mixed. It detects the weak force of purifying selection strongly; it detects the strong force of positive selection weakly. But as the authors perceptively note, the inference of demographic history and the inference of selection are not independent of each other. Therefore, the inferences about demography are in part subject to the weaknesses in detecting the effects of past selection. This study shares this problem with all previous work that has attempted to estimate past human population size from genetic evidence.
How can selection affect interpretations of demography? Here's one way: Positive selection occurs rapidly relative to rate of recombination between sites. This means that a selective sweep may affect a relatively large section of a chromosome, including many "neutral" sites. This is the principle behind John Gillespie's (2002) pseudohitchhiking, or "genetic draft" model of neutral evolution. In a nutshell, if positive selection has been common, there is no reason to think that genetic variation at noncoding sites provides any indication of demographic parameters. The current study (by Williamson et al. 2005) assumes that positive selection has not had such an effect, nor has any other force significantly affected the variation of neutral sites.
These are the kinds of influences that have been suggested to result in the large difference between census population sizes (the number of individuals within living species) and estimates of effective population sizes (measures of the rate of genetic drift) in nature. In humans and in most other animal species, the rate of genetic drift on neutral sites appears to have been much stronger than the census population sizes of those species would predict. This is a systematic difference that leads species to have much lower genetic variation than would be expected if they evolved under genetic drift alone. At present, the relative importance of selection and demographic factors in leading to this systematic difference is unknown. I suspect that selection has been strongly important in this difference, others argue that demographic factors have been the most important.
In most previous genetic work, the effective population size (denoted as Ne) is around 10,000 individuals. Some scientists have suggested that the human population actually was once that small -- that only a few tens of thousands of people once comprised the entirety of humanity. If this were true, then the human population must have expanded in size massively sometime in the recent past. The evidence for a recent change in the mitochondrial DNA molecule was once suggested to be evidence for this change in population size, which was inferred to have occurred during the Late Pleistocene, perhaps 50,000 years ago. From these estimates comes the scenario of an expansion from a single small African population beginning after 100,000 years ago, reaching Europe and the Far East by 30,000 - 50,000 years ago.
Recently, it has become clear that a single massive expansion of a global human population cannot explain the pattern of genetic variation in living people. Simply put, the pattern of the 16,000 base pairs of the mtDNA molecule is not replicated by the 3 billion base pairs of the nuclear genome.
To be sure, some genes do show a pattern of recent ancestry and apparent expansion. The FoxP2 gene, for example, has a recent common ancestor for living people (within the past 200,000 years), and shows strong signs that it has not evolved neutrally. If all other genes looked like this, it would be strong evidence of massive population growth.
But most genes do emphatically not look like this. This has been understood for several years, following reviews by Molly Przeworski and colleagues (2000), Jeff Wall (2000), and even my own dissertation (Hawks 1999). Many genes show no excess of rare variants, most show only a slight excess. The average gene shows no sign whatsoever of a massive population expansion during the Late Pleistocene. This has been concluded most powerfully by recent genome-wide studies of SNP variation by Marth and colleagues (2003; 2004; reviewed previously in this post).
Where does the current paper (Williamson et al. 2005) come in? Summarizing evidence from over 300 genes, this study does find evidence of a population expansion. Yes indeed -- a population expansion that happened 18,000 years ago! This expansion took the human population from a previous size of around 8000 individuals to a current size of around 50,000 individuals.
Of course these estimates are far from realistic in anthropological terms. If anything, 18,000 years ago much of the human population should have been contracting rather than expanding. The idea that the human population could have been as amll as 8000 individuals (or very generously 100,000 individuals) during the LGM is simply ridiculous. By that time, the certain ancestors of living people were present from the western tip of Iberia to the edge of (or possibly well into) Beringia. If a genetic estimate cannot gauge a population that must have numbered several millions of people, it is time to stop talking about genetic estimates.
To be fair, the demographic conclusions of the paper are phrased cautiously:
Therefore, although we find it striking that the time of population growth (18,200 years B.P.) roughly corresponds with events in human history that may have induced population growth, such as the end of the last ice age and the origin of agriculture, we feel that our demographic inferences should be interpreted cautiously until the full range of plausible demographic models has been explored in one coherent framework (7887).
At the same time, this apparently cautious discussion raises the more critical problem of a complete lack of communication or citation from any anthropologist. Hmm, I guess the last glacial maximum does roughly correspond to the "end of the last ice age and the origin of agriculture," in the usual manner of genetics confidence intervals. That is to say, it is only twice as old as either, so it might as well be the same.
Speaking of confidence intervals, again in this paper there are none. No confidence intervals on the demographic estimates, no confidence intervals in the supporting text, no figure showing the likelihood surface, none, nothing, nada.
What remains?
In a sense, the bone I am picking is different from that pursued by Williamson et al. (2005). What I care about is evidence for ancient demography. What they care about is better quantifying selection. I think that their paper is incomplete on their own terms, because of the problem quantifying positive selection, but that it is a credible theoretical effort. In particular, the insights about the frequency of genetic disorders based on their findings are a likely contribution to the future study of genetic variation in coding gene regions.
But the inclusion of demography in this study confuses much more than it clarifies from the perspective of the anthropologist. Its estimates of demographic changes are clearly false, and the lack of detail about confidence intervals makes them impossible to evaluate. In the face of this fatal problem, it is fair to wonder whether the apparent insights about purifying selection have any value.
The main importance of the data from these many genes is what they do not show. They do not show an expansion of many orders of magnitude. They do not show a current effective size that is anywhere near the current human population size (or a size sufficient to settle any large part of the world). They do not show evidence for expansion coincident with an "out of Africa" movement of people, over 50,000 years ago.
Instead, the conclusion is concordant with the discussion of Eswaran and colleagues (2005:3):
Thus, the nuclear data do not consistently signal expansion, and when they do, the signal is of a mild expansion, perhaps reflecting only post-Pleistocene population growth associated with the spread of agriculture.
The summary of current work is that we can completely exclude the hypothesis that "neutral" genetic variation in humans is explained entirely by past human population size. It simply cannot be true, because if it were, there should be strong signs of expansion that we do not in fact observe.
On the other hand, perhaps genetic data may tell us something about past human population size, even if population size is not the only explanation for genetic variation. We might expect that some demographic changes may have influenced genetic variation in distinctive ways that could be separated from the effects of selection. If so, then the results of the current paper may be relevant. If natural selection -- especially purifying selection -- explains most rare alleles at nonsynonymous coding sites, then perhaps the residue of rare alleles at synonymous or noncoding sites is a sign of recent changes in demographic patterns?
This possibility is suggestive, but it appears at present to be fairly far from the data. If unknown factors (which may include selection) have altered "neutral" genetic variation by an order of magnitude or more from their neutral predictions, then it is hard to believe that a relatively small change in population size will be accurately measured by any genetic observations.
References:
Eswaran V, Harpending HC, and Rogers AR. 2005. Genomics refutes an exclusively African origin of humans. J Hum Evol Online advance before print.
Gillespie JH. 2000. Genetic drift in an infinite population: the pseudohitchhiking model. Genetics 155:909-919.
Williamson SH, Hernandez R, Fledel-Alon A, Zhu L, Nielsen R, and Bustamante CD. 2005. Simultaneous inference of selection and population growth from patterns of variation in the human genome. Proc Nat Acad Sci USA 102:7882-7887. PNAS online
John Hawks Department of Anthropology
University of Wisconsin—Madison
Copyright © 2007 John Hawks