population genetics

I can't believe the amount of attention the paper by Martin Nowak, Corina Tarnita and Edward O. Wilson [1] has gotten. It was in last week's Nature. The basic idea was that the evolution of eusociality in insects could be explained in a different way that the usual explanation, which involves calculating the relatedness of worker insects to their reproductive siblings. Eusociality has been one of the most visible applications of inclusive fitness theory -- that is, the observation that the fitness of a gene that alters behavior may be calculated in terms of its effects on the reproduction and survival of relatives. The paper notes that some aspects of eusociality are not well explained in terms of relatedness, and derives an alternative explanation.

The weird part of the paper is the way it describes inclusive fitness as some kind of theoretical afterthought, useful only as an ad hoc explanation for eusocial insects. It contrasts the inclusive fitness concept with "standard natural selection" as if it were possible for organisms to erase the fact that they're related to each other! And the authors imply that they have fatally damaged the concept of kin selection.

It's so contrary to evolutionary theory, that I thought maybe I was missing something. But I've been spending time on another problem this week and haven't had time to follow it up.

Fortunately, Jerry Coyne and Richard Dawkins have both given the paper some attention, and written notes and reactions to it. First Coyne ("A misguided attack on kin selection") reminds us of why kin selection has been such a successful part of "standard" evolutionary theory for the past fifty years.

Sex ratio theory, in which mothers produce different proportions of males and females, has been a particularly fruitful area for applying inclusive fitness theory. So has “altruism”—suicidal honeybees are just one example. And so are parental care and aspects thereof, especially parent-offspring conflict, a field brought to life by Bob Trivers using inclusive fitness theory. How else can you explain weaning conflict except by a conflict between the mother’s genetic welfare and that of her offspring?

I’m baffled not only by Nowak et al.’s apparent and willful ignorance of the literature, but by statements that are just wrong. They flatly assert, for instance, that “inclusive fitness theory” is something different from “standard natural selection theory.” But it’s not: it’s simply a natural extension of population genetics to the situation in which one’s behavior affects related individuals.

Richard Dawkins has also posted notes about the paper:

Kin selection is not a subset of group selection, it is a logical consequence of gene selection. And gene selection is (everything that Nowak et al ought to mean by) 'standard natural selection' theory: has been ever since the neo-Darwinian synthesis of the 1930s. Inclusive fitness theory is not some kind of supernumerary excrescence, to be 'resorted to' only if 'standard natural selection theory' is found wanting (Misunderstanding One). On the contrary, inclusive fitness theory is one way of expressing what was logically inherent in the synthesis ever since Fisher and Haldane, but had been largely overlooked because people (with the exception of those two geniuses) didn't think about collateral kin.

Yes, unless they're going to repeal the Price equation, they'll have to rely on relatedness to explain those phenotypes that never occur in reproductive individuals. As Dawkins puts it, "You have to talk about shared genes in individuals, with conditional phenotypic expression."


References

Mailbag: mtDNA "out of whack"

Re: "Time to revise the mtDNA timescale?":

You said "The timescale of mtDNA divergence is already out of whack with the rest of the genome."

What's the time scale for the rest of the genome? It seems to me it should be expected to be at least twice as much as that for mtDNA since at least half the instances of mtDNA - those in males - dead end each generation. With perfect mixing and replacement, 50% of the mtDNA instances pass from one generation to the next, while 75% of the autosomal instances do. Imperfect mixing and replacement would make both numbers lower, but the mtDNA number would still remain much lower than the autosomal number, so the coalescence time should still be expected to be much lower.

Thanks for noticing that, it's leading to something but I haven't yet described the problem. My apologies for being less than clear.

What you're describing (you probably already know) is commonly described as the "four-times rule" -- the uniparental inheritance and single copy number give mtDNA one fourth the effective size, on expectation, as an autosomal locus.

That's in a constant-sized population. Which of course we haven't been. For around the past 100,000 years, African populations were big enough that genetic drift didn't decrease their genetic diversity markedly. The mtDNA coalesces around 100,000 years before that, compared to more than 700,000 years for the typical autosomal locus -- it's 7 times instead of four. That discrepancy is probably not significant given the huge intrinsic variance of the coalescent. But I don't think it's been seriously investigated.

The real problem is that the out-of-Africa timescale for mtDNA is now very short -- less than 65,000 years -- while the nuclear timescale looks long -- maybe up to 140,000 years. Maybe these can also be reconciled; it's not yet clear. But it's a problem.

Migration thinking

Murray Cox and Michael Hammer have a short commentary piece in the current BMC Biology, titled, "A question of scale: Human migrations writ large and small" [1]. They review a few recent papers concerning human migration and intermixture -- including the Neandertal genome draft [2], the paper by Chuanxiang Li and colleagues showing Bronze Age admixture in the Tarim Basin [3], and their own work quantifying historical gene flow inside and outside Africa [4].

It's a short review, but I thought their conclusion serves some thought -- they discuss some of the theoretical complexity of estimating ancient rates of gene flow. The simple model assumes constant rates, but human populations aren't simple.

We expand on just one of these points for illustration (Figure 3). Even when gene flow is inferred explicitly, existing methods invariably assume that it has remained constant through time. However, it seems more reasonable that two diverging populations might share more migrants initially (due to shared geography or existing social relationships), with gene flow subsequently decreasing exponentially as the two populations move apart (Figure 3a). Or gene flow might increase exponentially as two geographically separated populations begin to move closer together (Figure 3b). Alternatively, gene flow might suddenly resume between two long separated populations; for instance, where geographically disconnected populations came back into contact, either as hunter-gatherer groups during the late Pleistocene (Figure 3d), or as human mobility increased following the development of farming in the Holocene (Figure 3c). The important point is this: two populations can look very similar (FST = 0) or very different (FST = 0.3) even when they have exchanged the same number of migrants (that is, graph lines with the same color in figure 3). It is therefore insufficient to consider only how many migrants have moved between populations; we also need to know when these movements occurred.

I don't reproduce the figure, because it's complicated and I think the text is sufficient to establish the point. Averages aren't very meaningful. I'll point out that there is some hope of testing these hypotheses, if we consider selected genes -- which have a time that they originated.


References

Time to revise the mtDNA timescale?

Krzysztof Cyran and Marek Kimmel (2010) have presented a revised set of estimates of the human mtDNA most recent common ancestor (MRCA). It's an interesting theoretical paper, written for the purpose of developing a method that doesn't rely on the same assumptions as the usual coalescent models.

Their new method gives an estimate of 174,000 years ago for the human MRCA. They report an upper/lower range as 96,000 to 449,000 years ago. That range does not represent a confidence interval on the estimate, it's an upper/lower based on extreme assumptions about human/Neandertal genetic distance and the human/Neandertal MRCA.

The Neandertal mtDNA has really affected the way we estimate human MRCA, at least for the mitochondrial genome. Chimpanzees are just too distant. When we compare human and chimpanzee mtDNA genomes, there has been a lot of parallelism and reversal on both lineages, because mutations have hit the same place multiple times. Multiple hits and purifying selection make a mess out of rate estimation -- generally, they make the human MRCA seem a lot older than it truly was. The Neandertals are closer, and are therefore less of a problem.

But the Neandertal-human MRCA itself was poorly known, as long when we had only chimpanzees to calibrate the mutation rate....

Lag times of biological invasions

A biological invasion occurs when a species rapidly colonizes a new geographical area. The new area is often very far from the regions considered to be part of the species' native range.

Well-known examples include the invasion of the southern states of the U.S. by fire ants (originally South American), zebra mussels (originally eastern European) in the Great Lakes, the dispersal of cane toads (originally South American) in Australia, and grey squirrels (originally North American) in England. I've written about invasive species before, focusing on the example of fire ants.

Many invasions are not instantly successful, and don't really get going until quite a long time after a species is first introduced to a new geographical area. This is called a lag. This phenomenon may seem mysterious. Some alien species seem to cling by their fingernails at low numbers for years, before suddenly exploding into invasiveness.

Passing on your fertility to your kids

From the NY Times earlier this spring, a profile of a New York woman with an exceptional legacy:

WHEN Yitta Schwartz died last month at 93, she left behind 15 children, more than 200 grandchildren and so many great- and great-great-grandchildren that, by her family’s count, she could claim perhaps 2,000 living descendants.

The story talks about her history and how she came to have such a large family. By itself, having 15 children would be unremarkable except that the children and grandchildren themselves all went on to have large families ("Like many Hasidim, Mrs. Schwartz considered bearing children as her tribute to God."). After a couple of generations, it adds up to a lot of descendants.

I don't think the story is all that unique. Within the United States there are many communities, like the Hutterites, Old Order Amish, and Hasidic Jews, where large family sizes are the norm. Probably hundreds of women on earth can claim more than a thousand living descendants, and thousands more have only to wait until they are old enough, while their children and grandchildren's families continue to grow.

You can get there by having 10 children, each of which has 10, and each grandchild has 10 -- that adds up to 1110, giving some extra for different generation times and losses. Of course, it's a trick to live long enough to see the 1000 great-grandchildren, but the early ones should already have given you a fraction of your 10000 great-great-grandchildren.

What's surprising here? Not the family sizes themselves -- big families are common in most human populations. The high offspring numbers are not as apparent in populations that have high juvenile and infant mortality, but many pregnancies was the norm prior to the industrial transition.

No, what's surprising about huge numbers of living descendants is the correlation between generations. In these cases, the correlation is driven by religion and various social proscriptions related to religious observance.

I often talk about models and real human population structures in my classes. One obviously unrealistic aspect of the Wright-Fisher population model is its reproductive variance. In the Wright-Fisher model, reproductive variance is binomial -- every gene in an offspring population is equally likely to descend from each gene in the parental generation. In the model, it is possible -- albeit extraordinarily unlikely -- for a single parent to give rise to the entire offspring generation. That just can't happen in a real population, certainly not in humans. The effect of that unrealistic assumption of the model is not great, however, because even in the model the chances have having more than 10 offspring, while possible in theory, are negligible. If anything, the Wright-Fisher model is too conservative about the variance of offspring number -- real human populations have a non-negligible fraction of women who have 10 or more live births.

I get more concerned about other deficiencies of simple models, which are sometimes harder to deal with. One of those is the correlation of offspring number between generations. If there is even a slight correlation, women tending to have more children because they came from larger families, it has a major effect on the amount of inbreeding in the population.

You can think about it genealogically. Suppose you live in a small town with a few big families. The chances that you yourself were born into one of those big families is small. But if today's big families tended to come from yesterday's big families, with each generation we go back in time, it becomes more and more likely that one of your ancestors came from one of those big families. Still looking backward in time, your genealogy becomes captured by those big families, branch by branch. Since there are few big families in the town, once two or more lines of your ancestry trace to them, those lines will rapidly share a common ancestor. That's inbreeding, from the perspective of your genealogy.

In small towns, that process isn't inevitable because people move in from elsewhere. Most of the lines of your genealogy will probably come from other towns within a few generations. But if we consider the human species as a small town, well, there's nowhere else to move in from. If the population structure of our species has included a strong correlation of offspring number between generations, it will have massively reduced our genetic variation.

Since we have low genetic variation as a species, you can see why this is potentially interesting.

Masatoshi Nei and Motoi Murata back in 1966 worked out a relation between intergenerational correlation in offspring number and effective population size. That's before the days of computer models, for you simulation jocks out there. The "effective" size of a population, as I've noted here many times, is the one parameter of a Wright-Fisher model, as estimated from the genetic variation within a population. It's a statement about how inbred the population looks, assuming that its evolution followed a random-mating model throughout its history. Now, that model is wrong in pretty much every interesting case, and so there are various mathematical transformations that attempt to account for the effects of different mating structures.

In the case of intergenerational correlation of offspring number, Nei and Murata derived an expression to predict the reduction of effective size to be expected from this correlation, assuming a model in which the variance in offspring number is distributed in a certain way. The solution isn't general -- if offspring number were distributed in some other way, the effect of the same measured correlation may be quite different. And in their model, they were concerned with the case where the correlation of offspring number is influenced by genes that determine fitness -- in other words, genes under selection in the population. So it's not a complete answer, but it's a start.

Nei and Murata cited empirical data from several earlier studies that showed a correlation of 0.20 to 0.40 between generations of human offspring number. Under the assumption of their model, a correlation of 0.30 would causes a reduction of the effective size by roughly half.

That's a big effect. We already expect a reduction of effective size compared to the census count of a human population, because human populations include many non-reproductive individuals -- kids and postreproductive adults make up half to two-thirds of small-scale foragers. If big families have an additional effect of half, it means that the effective size of the population starts out at a fourth to a sixth the census count. So that an effective size of 10,000 really means 40,000 to 60,000 people on the ground.

Still low, but as one factor among many it may be very important -- and possibly the distribution of variance caused a further decline. It's much worth investigation.

A correlation of offspring number between populations can be caused by many ecological or cultural factors. Nei and Murata (1966) had considered the case where fitness itself is inherited, because of the presence of selected genes. But in humans, a more pervasive force is cultural inheritance. This factor was discussed in 1976 by the demographer Samuel Preston, attending to the importance of cultural preferences in contemporary populations:

Since children of each generation are drawn disproportionately from families of women with high fertility achievements in the past, it may be expected that a pronatalist selective bias operates each generation with respect to the transmission of "tastes" for children. It has also been suggested that personality traits which may affect fertility achievement, such as the ability to defer gratification, may be transferred to some extent between parent and child (Kantner and Potter, 1954). It is also reasonable to suggest that biological fecundability is partially inherited. The positive correlation between the social classes of parent and child implies that economic constraints impinging on the childbearing process tend to be similar for the two generations (Preston 1976:110).

In small-scale societies, these forces are somewhat different. But I wouldn't expect them to be less -- indeed, the social competition between families is probably more intense. The entire "Macchiavellian intelligence" model of cognitive evolution implies that these kin-level effects were pervasive throughout human evolution over the past 2 million years or more. A strong cultural inheritance of fitness is really necessary for selection on genes that influence prosocial kin-related behaviors.

How intense? Seems like a good question to investigate, as it may have a lot of importance to understanding genetic variation in our ancestors -- including our common ancestors with the Neandertals, whose genetic variation was limited just as much as our own.

On the subject of effective population size, I'll be posting next week about chimpanzees and bonobos. More genetically variable than us? Well, some of them...

References:

Preston SH. 1976. Family sizes of children and family sizes of women. Demography 13:105-114.

Nei M, Murata M. 1966. Effective population size when fertility is inherited. Genet Res 8:257-260.

The Neandertal fraction

I've gotten the same question a few times, and have seen it elsewhere, so I thought it would be worth a short post to explain it. And for those readers who've also been asked this question, I thought that being able to provide a simple explanation might be a great help.

How can we say that today's non-Africans derive 1-4% of their genomes from Neandertals, when we are 99.86% genetically similar to Neandertals? Or 98% similar to chimpanzees? I mean, how do we have 4% to work with to make this estimate at all?

Let me explain with another example.

You are approximately 99.9% similar to any other random human today. You're just a bit closer to your relatives, because you got some of your DNA directly from them, or you both share DNA from an immediate ancestor.

Your great-grandmother on average gave you 1/8 of her genes, making up on average 12.5% of your genome.

You are more than 99.9% similar to your grandmother, on average, yet she contributed only 12.5% of your genome.

In other words, these percentages are different things -- the fraction of your total ancestry you can trace to her, versus the fraction of base pairs you actually have identical to hers. Your genome is much more identical to hers than can be explained solely by your descent from her -- this is because you share other ancestors in common with her, and because mutations don't happen very often.

Or, think about it the opposite way. Suppose that the 12.5 percent of your genome you inherited from your great-grandmother meant that you were only 12.5 percent genetically similar to her. Where did you get the rest of your genes from? A turnip? No, you got them from other people, all of whom are roughly 99.9 percent like your great-grandmother. You're 12.5 percent more like your great-grandmother than you are like randomly chosen people in the population.

For the Neandertals, we have to separate these two kinds of similarity, sorting out the genes that we must have inherited from them, from the ones that we share because we share a more distant ancestry.

Now, suppose we don't know that this woman was your great-grandmother, that it's only a hypothesis. It's kind of thing a forensic anthropologist might want to figure out, if your great-grandmother was Anastasia. We can answer the question in this way: Test the hypothesis that she's unrelated to you, by examining whether you are equally genetically similar to her as you are to the average, randomly chosen individual from your population.

This is a statistical test. In fact some of the people in the population share more genetic similarity with you than others, and our statistic has to account for that variation. We can put whatever level of statistical confidence on it we like. If your putative great-grandmother shares substantially more with you than all but some very small fraction of people, we may conclude that she is your relative.

We might do substantially better -- if the variation in the population doesn't work against us, we might even conclude that she is in fact a third-degree relative who contributed between 10 and 14 percent of your genome. Or even better.

Our conclusion has to depend on the structure of the population. If randomly chosen people tend to look like you, for some reason of population structure, we'll have to model that population structure directly. This is, of course, what was done in the case of the Neandertal genome -- a specific population model was significantly favored by the data, and alternatives that did not include population mixture were demonstrated to be so unlikely as to be essentially impossible.

And as I pointed out the other day, if Neandertals had not donated any genes to later populations, then the most recent common ancestors of human and Neandertal genes would all be earlier than the divergence of those populations, more than 250,000 years ago. It is the observation of chromosomal segments that are identical or very near some living human chromosomes that shows that, for some genes in some living people, the Neandertals are not different enough. We have to have some of their genes.

More X-Woman thoughts

I had a great session with my advanced students yesterday running through different evolutionary scenarios for the X-Woman. This and some later posts will follow up on my initial thoughts ("Hobbit version 2.0: the undiscovered hominin").

May I just say, "X-Woman" is one of the more dopey nicknames for an ancient piece of bone? I mean, it's better than "Twiggy", but jeesh. I can't be the only one who thinks of John Singer Sargent:

Madame X

"Madame X", the once-shocking salon portrait by John Singer Sargent. Fulfilling my lifelong dream of bringing Sargent together with Neandertals.

Meanwhile, I have some great e-mails about Madame X, some of which I can share. First, an exchange on the topic of incomplete lineage sorting:

I'm confused by your suggestion of an ancient divergence among Neanderthals. Wouldn't that lead to a tree with the Siberian DNA and other Neanderthal DNA samples forming their own clade, to the exclusion of human DNA? As things stand, the Neanderthals are closer to humans than to the Siberian DNA.

Not at all; it could be either way.

Consider humans today. Africans have mtDNA lineages (the L clades) that are deeper in the human tree than any outside of Africa and basically absent elsewhere except for recent migration. But Africa also contains many of the mtDNA lineages that are present in Europe, India and West Asia.

Now imagine that the human population divides into two species, Africans and non-Africans, and those species persist for 100,000 years. Assuming no huge bottlenecks in either of these species, they both ought to retain the major clades present today. If we sample their genes at that time, 100,000 years in the future, we'll discover that Africans will be more genetically diverse than non-Africans. And the Africans will have L clades that are outgroups to the clades (M and others) that include *some* Africans and *all* non-Africans.

Subsequent population bottlenecks or selection could eliminate those ancient clades, but they will hang around unless they are eliminated. That's also the explanation for why humans and gorillas are more genetically similar at some loci than either is to chimpanzees, even though humans and chimpanzees speciated more recently. The variation in that ancestral H-C-G population was retained in the ancestral H-C population, to some extent, and lineage sorting sometimes gave humans the more gorilla-like clade.

An interesting question is whether the rest of the Neandertal sample would be so relatively invariant, if some part of their population included this quite divergent mtDNA haplotype.

It's quite hard to answer that question given the small sample of Neandertal mtDNA -- only less than twenty individuals, sampled from a range of times. A "lopsided" tree, with a lot of similar sequences and a few divergent ones, is not an unlikely genealogy in a small sample. The variance in the lengths of the deep branches in a genealogy is intrinsically high, even in the simple Wright-Fisher model with no population structure or selection. A "lopsided" tree is just one possibility on a continuum, in which the deepest coalescence time in the sample is high relative to the next deepest -- not an unlikely event at all.

For those who would like to explore this process, I put together a Mathematica demonstration ("Coalescent Gene Genealogies") that generates random gene trees under the neutral Wright-Fisher model. Strange-looking trees are normal, in the sense that they occur often enough that they are not statistically unlikely for a single gene locus.

Obviously what you'd want to do is compare multiple gene loci -- in this case, to get nuclear genomic sequence. Since the Max Planck group is actively pursuing further sequencing (and already has had some success, according to their press conference), I expect they're already making progress toward testing the neutral hypothesis.

If mtDNA proves to be unusual compared to other loci, then it's either intrinsic coalescent variability, or selection. Testing those two alternatives would require a larger sample of Neandertal mtDNA.

If, on the other hand, the nuclear genetic diversity is also substantially not shared with Neandertals (or living people), then the hypothesis of population structure in Late Pleistocene-age Eurasia would be strongly supported. It's a bit more complicated to test whether a speciation had occurred, but with whole genomes such a test can almost certainly be done.

An insertion into deep history

A couple of weeks ago I noted a new article by Chad Huff and colleagues in PNAS. It wasn't available yet when I wrote, but I've had the chance to study it now.

The paper presents a tremendously clever way of using contemporary genetics to look at different time slices in Pleistocene human evolution. If you can imagine traveling to different parts of the human genome and looking at different times in the past, that's more or less what they are doing.

We have the genomes of several people now -- the paper focuses on Venter's sequence versus the official HGP draft sequence, but there are others. A whole genome is limited in its utility to look at genetic variation, but it has some very interesting sampling properties. Much of population genetics theory is based on a simple question: what happens if you sample two individuals at random? How similar are they? What will be the distribution of genetic differences between them? How long ago did each of their genes descend from a single common ancestor? Sampling a diploid genome yields precisely the data for which these questions were designed.

Huff and colleagues dredge up a relatively obscure point of theory. Suppose you take a particular kind of rare event -- they consider mobile element insertions, including Alu and LINE insertions. Even though these elements make up a large fraction of the human genome, the events that give rise to them are rare, occurring only once in a whole genome every 20 births or more. Now, look around the genome and partition it into two kinds of regions. One kind of region will include the rare events (insertions in this case) and the area immediately flanking them. The other will include everywhere else in the genome. Now, the partitioning creates a bias. The areas that include these rare events will, on average, represent more diverse parts of the genome, with deeper genealogies. This is because the intrinsically rare event is more likely to have happened in the long time span represented by such areas than in the relatively shorter times represented by the remainder of the genome. In fact, the average depth of these areas including the insertions should be precisely double the average depth of the areas that lack them.

In other words, looking at these rare events is sort of like opening the box on Schroedinger's cat. There's something that we shouldn't be able to find out a priori -- how old is the genealogy of a part of the genome? By sifting through the genome and picking out all the parts that have these insertions, we know something about them: We know that they represent a time interval double that of the rest of the genome. Our looking at these insertions has collapsed the likelihood function that relates genetic location to age. When we look at the variation around insertions, we can then ignore some of the events that changed the population's diversity in the last couple of hundred thousand years. And by comparing these sites with the rest of the genome, we have another way to test hypotheses about whether the population was once a lot bigger or smaller than it has been over the last few hundred thousand years.

The analysis shows that the population in that early part of the genealogy -- corresponding more or less to dates over 1.2 million years ago -- was consistent with an effective population size of 18000 individuals, give or take. As I pointed out in my earlier post, that value itself isn't surprising -- it's a bit higher than the average genome-wide. The best-fit model, including both areas near insertions and the rest of the genome, was one in which the effective population size actually declined from 18,500 to 8500 individuals at 1.2 million years ago. They explain that the recent value should be depressed by the separation of present human populations -- Venter and the human reference sequence both being primarily derived from Europe, they undersample human variation.

Now, it's easy to see some of the limitations on the analysis. The authors considered only a two-epoch model of population history. That is to say, once upon the time the population was x individuals, then at some time t, the population becomes y individuals. Two epochs of population size, separated by one time. Clearly the actual history of human populations was more complicated than this, but does it matter? Recent history will not greatly influence nucleotide diversity, and in particular the insertions -- because they are intrinsically rare -- are likely to reflect much more ancient events that have survived any subsequent vicissitudes of population.

But, I suspect that the distribution of insertions with relation to recent selection will make an appreciable difference to the nearby SNP diversity. The geographic distribution of variation will also make some difference, although we won't know how much until we look at non-European genomes.

Meanwhile, if I were looking to the archaeological record to identify times that made a difference to the human population, 1.2 million years ago would really not register. It certainly would not strike me as a time of substantial reduction of the human population.

The lack of any archaeological referent is typical of such studies -- after all, they're not trying to match numbers from archaeology, they're trying to establish internally consistent genetic tests of population history. But if these values are real, they must match what we know from the fossil and archaeological record. There is some text in the paper about the small effective size and its relevance to humans as a sign of repeated bottlenecks or other events. As I pointed out earlier, I think 18,000 is pretty significantly large compared to most other estimates of human effective population size. When we get an estimate of human effective size so near those of other apes, we are looking at a value consistent with habitation of a large, certainly continent-wide range by large populations. So now I have to think what the pertinent comparison from the archaeological record should be.

One archaeological comparison is of special interest to me: a real-life comparison that will be immediately relevant. This study should be giving us information about the population ancestral to Neandertals and humans. In that sense, it duplicates the information that we ought to be able to derive from the comparison of human and Neandertal genomes.

Interestingly, the effective size estimates published so far for the human-Neandertal ancestral population are much lower than the 18,500 estimated in this study. Green and colleagues (2006) made a point estimate of 3000 effective individuals at the time of Neandertal-human divergence. That estimate is likely to be supplanted by the Neandertal genome release, because the Green et al. (2006) estimate was influenced by some fraction of contaminating sequence from humans. And the error bars on that estimate are large. But there's a lot of space between them -- we're talking about at least a sixfold difference.

Something doesn't add up. The human-Neandertal ancestral population must have contained all these polymorphic insertions that supposedly occurred before 800,000 years ago. The effective size of the population may have been lower, but if so we should look for some explanation for that substantial loss of variation.

UPDATE (2010-02-10): A couple of people have asked about effective population size. Here's a helpful post that explains why a small effective size may not mean a small population size, and some of the current hypotheses that try to explain the human value.

References:

Green RE, Krause J, Ptak SE, Briggs AW, Ronan MT, Simons JF, Du L, Egholm M, Rothberg JM, Paunovic M, Pääbo S. 2006. Analysis of one million base pairs of Neanderthal DNA. Nature 444:330-336. doi:10.1038/nature05336

Huff CD. Xing J, Rogers AR, Witherspoon D, Jorde LB. 2010. Mobile elements reveal small population size in the ancient ancestors of Homo sapiens. Proc Nat Acad Sci USA (early online) doi:10.1073/pnas.0909000107

R. A. Fisher's model of adaptation

Chapter 2 of R. A. Fisher's Genetical Theory of Natural Selection is remarkable for many reasons. In it, he presents a model of selection in an age-structured population, the concept of reproductive value, and the Fundamental Theorem. Toward the end of the chapter, he discusses "The Nature of Adaptation," presenting a geometric model to justify the assertion that the probability of favorable genetic changes declines as the effect size of those changes increases.

Sergey Gavrilets on the two fitness landscapes

Sewall Wright's metaphor of the "fitness landscape" is fundamental in the way many biologists think about adaptation. The idea of a population "climbing" toward "adaptive peaks" is a visually compelling image for the increase in mean fitness that results from selection on many genes.

However, the correspondence between this metaphor and the mathematics of population genetics leaves several ambiguities that tend to confuse people. One of the main sources of ambiguity concerns the meaning of the spatial dimensions in the fitness landscape. Do the dimensions represent the frequencies of alleles in the population? Or do they represent particular genotypes that individuals may have? Wright used mathematics that implied both approaches in different places. For purposes of metaphorical visualization, the difference between these perspectives may not matter. But if we want to guide our thinking about the evolutionary process, it's helpful to know where real-life cases are supposed to fit.

Sergey Gavrilets' book, Fitness Landscapes and the Origin of Species takes on this problem in chapter 2. This post comes from my notes about the book, which I read some time ago. So although I've brushed them up, many holes remain -- think of it as a synopsis of points I found worth noting. What I don't have is a thesis -- in case you're wondering why you should care.

For me (and many others), the most important aspect of Gavrilets' work is the demonstration that a "rugged" landscape does not exist if we consider a sufficiently high number of interacting genotypes. The genomes of organisms, from E. coli to humans, don't have that many genes, but the number of combinations among only 1000 biallelic genes is so large that Wright's "rugged landscape" analogy may never apply to them. Never mind our 20,000 multiallelic genes. I'll return to that issue another time, because this question of genomic searches has shaped my thinking about mutation-limited evolution and recent selection.

R. A. Fisher and Sewall Wright introduced diffusion approximation methods into genetics; Fisher (1937) was the first to consider spatial disperal using a reaction-diffusion model. I found this quote a useful expression of his acknowledgment of the limits of the model:

The use of the analogy of physical diffusion will only be satisfactory when the distances of dispersion in a single generation are small compared with the length of the wave. In reality diffusion is a complex process, compounded often of the diffusion of gametes, and that of larvae, in addition to adult forms; a more exact treatment than that supplied by a simple coefficient would involve the interaction of these components, and the stages at which the selective advantage was enjoyed. So far as it is applicable, the analogy of physical diffusion, therefore, greatly simplifies the problem (355-356).

The paper has no references.

A new printing of a classic population genetics text has been issued this year: An Introduction to Population Genetics Theory, by James Crow and Motoo Kimura.

I discovered it by accident on Amazon last week, and ordered my copy right away. Now with it safely in hand, I can tell the world!

Crow and Kimura's telling starts with demography, mirroring Fisher's (1930) presentation but with more clarity of description. From the demographic background of genetic change, they are able to pursue genetic drift and selection as stochastic and deterministic realizations of similar processes.

The fact is, not much has changed since the book's first publication in 1970. I think you could teach a great seminar using Crow and Kimura by itself. But if you need a more up-to-date mathematical presentation, I highly recommend Mathematical Population Genetics, by Warren Ewens. The books bear a closer comparison; where Crow and Kimura built their presentation from a demographic perspective, Ewens begins with quantitative genetics, relating the Wright-Fisher population model to phenotypes.

Molecular systematics and species trees

I'd like to point readers to a recent essay in Evolution, by Scott V. Edwards, titled, "Is a new and general theory of molecular systematics emerging?"

Edwards covers some of the recent progress and problems encountered when using molecular evidence to test phylogenetic hypotheses. A sampling of the issues: How do we combine information from different sets of molecular data? Can we just compile sequences from many gene loci together into one analysis ("concatenation"), or do we need to make allowances for genealogical diversity among loci? How do prior assumptions affect the outcomes of analyses, like the presence or absence of polytomies (branching points where three or more species emerge simultaneously)?

I try to think of things that students should read as they get up to speed with evolutionary genetics. Edwards' essay raises many important points, and as I read through it, I reflected on the ways that paleoanthropologists increasingly need to be aware of the inner workings of molecular studies of phylogeny.

If we're interested in the phylogeny of species, we need to know how the "tree" of relationships of species may be manifested in the genealogical relationships among genes. Discordances between genes result from the fact that gene trees are not species trees. Species are genetically variable, and the living descendants of an ancient species may have inherited different parts of the variation of ancient species. Depending on the demography of that ancient population, gene trees representing the evolution of two distinct genetic loci may have different topological properties.

From Edwards:

John Avise encapsulated the relationship between gene and species trees well in 1994: “Gene trees and species trees are equally “real” phenomena, merely reflecting different aspects of the same phylogenetic process. Thus, occasional discrepancies between the two need not be viewed with consternation as sources of “error” in phylogeny estimation. When a species tree is of primary interest, gene trees can assist in understanding the population demographies underlying the speciation process” (pp. 133 and 138 in Avise 1994). This essay is in part meant to reemphasize Avise' perspective and to remind readers that species trees are in fact the “primary interest” of systematics.

Genealogies involve some unknown parameters. Applying the fossil and archaeological record may let us constrain those parameters, just as applying molecular biology and pedigree comparisons may let us constrain the parameters describing the mutational process.

To my mind, this is where paleoanthropologists need to be most attentive: Molecular methods are not in conflict with fossil approaches, they implicitly depend upon them. Yet, communication between the two fields rarely involves actual numbers, so a frequent occurrence is that a "bottleneck" in paleoanthropology with a 10 percent reduction in population becomes a "bottleneck" in genetics with a 1000-fold reduction in population.

Testing of demographic hypotheses moved on to genome-wide polymorphism data several years ago. The logical equivalent for species divergences is lineage sorting -- a model that's been applied since the mid-1990's. The hominoids are extremely well studied from the standpoint of molecular systematics, and remain the central example in most theoretical papers incorporating multiple loci. This year I have noticed several interesting implementations of whole-genome polymorphism comparisons among species embedded in phylogenetic trees. The higher mutation rate of CpG sites has long been known, but we now know that a 50-bp or longer flanking region may influence local mutation rate. As we move from genes to gene networks, our comparisons will not be the same nucleotide, but classes of mutations across classes of genes.

This is another of those cases where the future lies in better algorithms. Edwards seems a man after my own heart -- the computer programs lend a superficial veneer of rigor, when the underlying assumptions are in need of challenge:

Producing phylogenies directly from gene sequences essentially in one step, without additional transformations, is now the dominant mode of phylogenetic analysis and indeed it has advanced the field enormously. Nonetheless, I suggest that the very success of this paradigm and the ease with which phylogenies could be produced directly from DNA matrices led to a comfort zone in phylogenetics. If we can imagine systematic methods themselves as a likelihood surface, I suggest that the current paradigm is a local optimum in that surface, an optimum that is useful but ultimately incomplete in so far as it has failed to model the potential for gene tree/species tree discordance even cursorily (Fig. 3) (Edwards 2009:6).

His theme is an old one -- how do we use "total evidence" methods in phylogenetics. Variance among loci gives the problem a newish twist, one that may add information that other techniques have left on the table. But we have to wring it out of the data.

References:

Edwards SV. 2009. Is a new and general theory of molecular systematics emerging? Evolution 63:1-19. doi:10.1111/j.1558-5646.2008.00549.x

An (old) interview with Warren Ewens

I ran across an interview between Anna Plutinski and population geneticist Warren Ewens.

I cannot say enough about Ewens' book, Mathematical Population Genetics. If you can work through it, you can do population genetics. It doesn't cover every au courant topic, but those will change next week anyway. And it's on Kindle now. Which I suppose probably looks pretty good on the DX, assuming the math displays well -- the book's format is just the right size for it.

Anyway, this interview from 2004 was probably conducted around the time the book was released. It covers pretty much the gamut of his career. I have to select some part to quote for you, so I'll select the passage that would be most likely to come out of my own math in my genetics class:

WE: Of course there is a strong possibility that the neutral theory is assumed not because it is appropriate but because the math of that theory is so very simple compared to the math applying for any selective theory.

AP: Can I follow that up? Do you think that that has lead to models of phylogenetic change that is not very well supported by the evidence?

WE: I think that that is quite possible. However, here we enter into another question. In mathematical population genetics theory you know from the very start that you are making big simplifying assumptions. You are in a very different position from a physicist, who might believe that his mathematical models describe reality exactly. No sensible population geneticist would make any claim along those lines. He or she is forced to simplify, because reality is so complicated that you don’t know it in any detail, and even if you did know it and used math describing it faithfully, the analysis would be impossible to carry through. So simplification is unavoidable. I do not know whether the use of the neutral theory is too much of a simplification and has lead us to incorrect and distorted views about the true evolutionary tree, it’s shape and dimensions, but I suspect that there has been quite a significant distortion.

There is much more at the link, some history of association testing, genetic draft, a lot on Ewens sampling theory, and a touch about his work here in Madison.

People often complain that R. A. Fisher wrote in a hard-to-read style; unnecessarily verbose and indirect. Either I don't tend to mind, or I find that the style makes me read with greater care. In either case, there are select passages from his writings that stand out as very clear to me. His description of epistasis and dominance as deviations from additivity, in his famous 1918 paper (p. 404), is one of them:

The steps from recessive to heterozygote and from heterozygote to dominant are genetically identical, and may change from one to the other in passing from father to son. Somatically the steps are of different importance, and the soma to some extent disguises the true genetic nature. There is in dominance a certain latency. We may say that the somatic effects of identical genetic changes are not additive, and for this reason the genetic similarity of relations is partly obscured in the statistical aggregate. A similar deviation from the addition of superimposed effects may occur between different Mendelian factors. We may use the term Epistacy to describe such deviation, which although potentially more complicated, has similar statistical effects to dominance. If the two sexes are considered as Mendelian alternatives, the fact that other Mendelian factors affect them to different extents may be regarded as an example of epistacy.

The terms we use today are familiar by use. A biologist doesn't necessary consider how idiosyncratic is the genetic use of term "additive". When I read a passage like this, it brings to mind a long-ago time when the select group of people using a term all had read the same papers. I wonder how many geneticists still read Fisher during their training. I can tell you this: the bound volume of the Proceedings of the Royal Society of Edinburgh in our library didn't look like it's been picked up for 30 years. I mean, serious dust on the cover.

I wrote last month about how Fisher invented "variance", and noted the very useful property that the variance is a sum of contributions from different causes. It seems remarkable that Fisher could arrive at statistical framework for identifying the interactions of multiple genes on a trait, at a time when only a relative handful of "Mendelian factors" had yet been found.

Now that we are able to find Mendelian factors in whole-genome association studies, it's remarkable that Fisher's framework is so often forgotten!

References:

Fisher RA. 1918. The correlation between relatives on the supposition of Mendelian inheritance. Proc R Soc Edinburgh 52:399-433.

Phenotypic variance

I've intermittently been reading through William Provine's The Origins of Theoretical Population Genetics. It's related to a project simmering on my back burner.

Meanwhile, last week I was talking with some students about the recent papers at the AAPA meetings about natural selection as assessed by quantitative traits. The students thought that some of these papers had omitted some basic details that seemed obvious from the point of view of quantitative genetics. Also, George Armelagos had mentioned Raymond Pearl, so I figured as long as I'm reading about Pearl, William Castle, R. A. Fisher and their attitudes toward quantitative genetics, I might as well note a few passages from Provine's account.

Provine:

Fisher's express purpose in the paper was to interpret the well-established results of biometry in terms of Mendelian inheritance by ascertaining the biometrical properties of a Mendelian population.

I'll just pause to note that Fisher's formulation begins almost all textbooks in quantitative genetics and many in population genetics. The model that relates quantitative variation and genotypic variation is essential to all genetic analysis.

In particular, he wanted to show that Pearson was mistaken in concluding that the correlations between relatives in man contradicted the Mendelian scheme of inheritance. He began by defining a measure of the variability of a character in a population.

This is an essential step for any introduction to genetics also. I spend some time in all my courses talking about the relationship between genetic and phenotypic variation, using the measures of each as ways to talk about the ways they differ. We can analogize genetic variation to a digital readout -- you have a genotype, or a set of genotypes, and the population's variation has to do with the frequencies of those genotypes or the alleles that comprise them. So the variation is something that emerges from counting genes. You have heterozygosity (expected frequency of heterozygous genotypes), or number of alleles. At the sequence level, you count both alleles and the number of mutations that separate them -- average pairwise difference, number of segregating sites.

Back to Provine:

Often the standard deviation σ was used for this purpose. But Fisher noted that

Now Provine gives a direct quote from Fisher 1918:

when there are two independent sources of variability capable of producing in an otherwise uniform population distributions with standard deviations σ1 and σ2, it is found that the distribution, when both causes act together, has a standard deviation σ12 + σ22. It is therefore desirable in analysing the causes of variability to deal with the square of the standard deviation as the measure of variability. We shall term this quantity the Variance of the normal population to which it refers, and we may now ascribe to the constituent causes fractions or percentages of the total variance which they together produce (Fisher 1918:399).

I have always thought that this was a work of magic by Fisher. The additive quality of variance is such a useful characteristic for a measure of variation, it's hard to imagine using anything else. Fisher continues:

For stature the coefficient of correlation between brothers is about .54, which we may interpret by saying that 54 per cent of their variance is accounted for by ancestry alone, and that 46 per cent must have some other explanation.

It is not sufficient to ascribe this last residue to the effects of environment. Numerous investigations by Galton and Pearson have shown that all measurable environment has much less effect on such measurements as stature. Further, the facts collected by Galton respecting identical twins show that in this case, where the essential nature is the same, the variance is far less. The simplest hypothesis, and the one which we shall examine, is that such features as stature are determined by a large number of Mendelian factors, and that the large variance among children of the same parents is due to the segregation of those factors in respect to which the parents are heterozygous. Upon this hypothesis we will attempt to determine how much more of the variance, in different measurable features, beyond that which is indicated by the fraternal correlation, is due to innate and heritable factors (Fisher 1918:400).

And that, in a nutshell, is why the correlation between relatives is not a measure of heritability. Fisher attempted to show that the segregation of Mendelian factors could account for a large fraction of the variance of stature, and substantially succeeded in showing that the environment had much less impact than had been assumed from the correlation between relatives.

Provine's discussion continues along a different line, but he includes the characteristic line:

Fisher's 1918 paper was well received by the few geneticists who could understand his mathematics (147).

Could genetic drift really break your heart?

Are these people crazy?

The combination of such a large risk with such a high frequency is, fortunately, unique. "How can such a harmful mutation be so common?" asks Chris Tyler-Smith from The Wellcome Trust Sanger Institute, Hinxton, UK. "We might expect such a deleterious change to have 'died out'.

"We think that the mutation arose around 30,000 years ago in India, and has been able to spread because its effects usually develop only after people have had their children. A case of chance genetic drift: simply terribly bad luck for the carriers."

This is a 25-bp deletion in a muscle protein gene, MYBPC3. The current allele frequency in India is estimated to be 4 percent; it is estimated to be carried by 60 million people. The paper suggests that it originated 30,000 years ago. Carriers of the gene have a massive increase in their chance of cardiomyopathy.

Here's the relevant passage from the paper:

The presence of a disease-associated variant at substantial frequency raises an evolutionary question: if it is disadvantageous, how did it become so common? In principle, it could be evolutionarily neutral, manifesting its disadvantages only late in life; alternatively, its disadvantages could be outweighed by advantages early in life, or in a different environment, so that it could have been positively selected. To address this question, we examined the haplotype structure surrounding the deletion. Using five short tandem repeat (STR) markers, spanning ca. 3.4 Mb surrounding the deletion in 287 heterozygous individuals, we found similar high degrees of variation in the inferred haplotypes from chromosomes with and without the deletion (Supplementary Fig. 7 and Supplementary Table 6 online). We then used allele-specific amplification to resequence ca. 10-kb haplotypes centered on the 25-bp deletion from nine heterozygous individuals (Supplementary Tables 7 and 8 online). The chromosomes carrying the 25-bp deletion showed five closely related haplotypes (Supplementary Fig. 8 online). After excluding variants likely to have arisen by recombination, we estimated a time to most recent common ancestry (TMRCA) of ca. 33 ± 23 thousand years for the deletion haplotypes (Supplementary Methods). This time slightly postdates the initial peopling of the subcontinent 30,000–50,000 years ago and together with its restricted geographical distribution suggests that the deletion did not arrive with the first modern human settlers from Africa [more than] 50,000 years ago, but arose subsequently within the subcontinent. Its occurrence in two populations from Southeast Asia can be explained by recent gene flow from India (Supplementary Note online). Collectively, these observations provide no evidence for rapid spread of a recent founder haplotype or any departure from neutral evolution (Dhandapany et al. 2009:4).

The issue is not really whether a gene could go from 1 copy to 4 percent in 1200 generations by chance. That wouldn't be so terribly unlikely in Pleistocene humans -- in fact, the mean time for a mutation to go from 1 copy to 4 percent by drift in a population of effective size 10,000 individuals is not 30,000 years, but only around 20,000 years. On the other hand, mtDNA variation today suggests that South Asia experienced early and rapid population growth -- so we're not likely talking about a population of 10,000, but more like a minimum of 100,000 effective individuals through the past 30,000 years at least. It would take genetic drift at least 10 times longer to accomplish the requisite frequency change given that demographic history. Still, a single allele at a single gene locus might be exceptional.

But that scenario, however unlikely, is simply not the situation we have here. Here we have a deletion that must have some disadvantage, because it gives people a fatal disease. This disadvantage is apparently dominant in effect, based on the case-control study. Yet the deletion has managed to persist within the large South Asian populations of the last 10,000 years so that today it is still around 4 percent.

People mainly die of cardiac problems after age 40. But human reproductive lives aren't over until they're done investing in their children. Further, a weakened heart may reduce work potential or health even if it kills slowly. The fitness cost of this deletion is smaller than if it gave people a chance at a fatal disease when they are 17, but a smaller fitness cost is still a fitness cost. In a large population, that small fitness cost is going to whittle away the frequency of the allele over time.

A thousand generations is a lot of potential whittling. Using some quick calculations, it looks like selection against the deletion as low as 0.001 to 0.0015 in heterozygotes should have been enough to cut the frequency down to around 1 percent, from an initial value of 4 percent. So even if drift increased the deletion early after its origin, it ought to be much rarer today. Meanwhile, drift looks even more unlikely, since the chances of a mutation growing from 1 copy to 4 percent against such selection are nil.

Did this deletion have a fitness cost as high as one in a thousand? It increases cardiomyopathy by 5-fold or more compared to the wild type. So it seems very plausible. But really, we don't have any good estimates of the fitness costs of chronic diseases in pre-industrial populations.

If the deletion was favored by some selection, that would probably be antagonistic, that is, acting against the fitness cost of the deletion late in life. The authors briefly investigated this hypothesis, as described above. They found no evidence for a recent expansion of a single haplotype around the deletion. That means that if there was strong selection favoring this deletion, it must have happened early after its origin and then petered out. If the expansion had been late in South Asian history, it would show more LD around it, and most of the deletion-carrying chromosomes would share a single long-range haplotype. So this deletion has not been increasing rapidly in the past few thousand years.

I would hypothesize that the disadvantages of the deletion have actually increased over time. The average lifespan increased into the Upper Paleolithic and probably later as well. Meanwhile, as the population grew, larger completed family sizes became more important to fitness. As people became more sedentary, the accumulation and inheritance of possessions and land became an important means of investing in children. The increasing importance of later survival and investment in children should have raised the fitness cost of chronic disease. That would explain a pattern of evolution in which this deletion increased in frequency early in its history, but later remained static or declined.

So, I don't suppose I can say people are crazy for thinking genetic drift could explain this deletion's current high frequency. But considering the powerful effect of weak selection over the many generations involved here, and the very large size of the South Asian population during most of that time, genetic drift seems pretty unlikely.

References:

Dhandapany PS and 23 others. 2009. A common MYBPC3 (cardiac myosin binding protein C) variant associated with cardiomyopathies in South Asia. Nat Genet (online early) doi:10.1038/ng.309

Cultural impedance, demographic growth, effective population size

This is a complicated story with many interlocking parts. Telling the whole story may well take me fifty posts. There's a lot of new science hiding in here waiting to get out.

I'm starting now because of the new paper by Luke Premo and Jean-Jacques Hublin, titled "Culture, population structure, and low genetic diversity in Pleistocene hominins." This paper is not the final word on its topic, nor is it the first word. But it is very much worth reading.

It makes an excellent point of departure to explain what we know and don't know about the genetics of prehistoric humans. Premo and Hublin propose an interesting model with interaction between culture and natural selection, as an explanation for a 35-year-old problem in human evolution: Our low level of genetic variation.

Their model may be right. I certainly think there's a kernel of truth in it, shared with a number of other models, as I'll describe below. And it's testable -- a project to which we'll be returning in the next few months.

Syndicate content