population genetics

An insertion into deep history

A couple of weeks ago I noted a new article by Chad Huff and colleagues in PNAS. It wasn't available yet when I wrote, but I've had the chance to study it now.

The paper presents a tremendously clever way of using contemporary genetics to look at different time slices in Pleistocene human evolution. If you can imagine traveling to different parts of the human genome and looking at different times in the past, that's more or less what they are doing.

We have the genomes of several people now -- the paper focuses on Venter's sequence versus the official HGP draft sequence, but there are others. A whole genome is limited in its utility to look at genetic variation, but it has some very interesting sampling properties. Much of population genetics theory is based on a simple question: what happens if you sample two individuals at random? How similar are they? What will be the distribution of genetic differences between them? How long ago did each of their genes descend from a single common ancestor? Sampling a diploid genome yields precisely the data for which these questions were designed.

Huff and colleagues dredge up a relatively obscure point of theory. Suppose you take a particular kind of rare event -- they consider mobile element insertions, including Alu and LINE insertions. Even though these elements make up a large fraction of the human genome, the events that give rise to them are rare, occurring only once in a whole genome every 20 births or more. Now, look around the genome and partition it into two kinds of regions. One kind of region will include the rare events (insertions in this case) and the area immediately flanking them. The other will include everywhere else in the genome. Now, the partitioning creates a bias. The areas that include these rare events will, on average, represent more diverse parts of the genome, with deeper genealogies. This is because the intrinsically rare event is more likely to have happened in the long time span represented by such areas than in the relatively shorter times represented by the remainder of the genome. In fact, the average depth of these areas including the insertions should be precisely double the average depth of the areas that lack them.

In other words, looking at these rare events is sort of like opening the box on Schroedinger's cat. There's something that we shouldn't be able to find out a priori -- how old is the genealogy of a part of the genome? By sifting through the genome and picking out all the parts that have these insertions, we know something about them: We know that they represent a time interval double that of the rest of the genome. Our looking at these insertions has collapsed the likelihood function that relates genetic location to age. When we look at the variation around insertions, we can then ignore some of the events that changed the population's diversity in the last couple of hundred thousand years. And by comparing these sites with the rest of the genome, we have another way to test hypotheses about whether the population was once a lot bigger or smaller than it has been over the last few hundred thousand years.

The analysis shows that the population in that early part of the genealogy -- corresponding more or less to dates over 1.2 million years ago -- was consistent with an effective population size of 18000 individuals, give or take. As I pointed out in my earlier post, that value itself isn't surprising -- it's a bit higher than the average genome-wide. The best-fit model, including both areas near insertions and the rest of the genome, was one in which the effective population size actually declined from 18,500 to 8500 individuals at 1.2 million years ago. They explain that the recent value should be depressed by the separation of present human populations -- Venter and the human reference sequence both being primarily derived from Europe, they undersample human variation.

Now, it's easy to see some of the limitations on the analysis. The authors considered only a two-epoch model of population history. That is to say, once upon the time the population was x individuals, then at some time t, the population becomes y individuals. Two epochs of population size, separated by one time. Clearly the actual history of human populations was more complicated than this, but does it matter? Recent history will not greatly influence nucleotide diversity, and in particular the insertions -- because they are intrinsically rare -- are likely to reflect much more ancient events that have survived any subsequent vicissitudes of population.

But, I suspect that the distribution of insertions with relation to recent selection will make an appreciable difference to the nearby SNP diversity. The geographic distribution of variation will also make some difference, although we won't know how much until we look at non-European genomes.

Meanwhile, if I were looking to the archaeological record to identify times that made a difference to the human population, 1.2 million years ago would really not register. It certainly would not strike me as a time of substantial reduction of the human population.

The lack of any archaeological referent is typical of such studies -- after all, they're not trying to match numbers from archaeology, they're trying to establish internally consistent genetic tests of population history. But if these values are real, they must match what we know from the fossil and archaeological record. There is some text in the paper about the small effective size and its relevance to humans as a sign of repeated bottlenecks or other events. As I pointed out earlier, I think 18,000 is pretty significantly large compared to most other estimates of human effective population size. When we get an estimate of human effective size so near those of other apes, we are looking at a value consistent with habitation of a large, certainly continent-wide range by large populations. So now I have to think what the pertinent comparison from the archaeological record should be.

One archaeological comparison is of special interest to me: a real-life comparison that will be immediately relevant. This study should be giving us information about the population ancestral to Neandertals and humans. In that sense, it duplicates the information that we ought to be able to derive from the comparison of human and Neandertal genomes.

Interestingly, the effective size estimates published so far for the human-Neandertal ancestral population are much lower than the 18,500 estimated in this study. Green and colleagues (2006) made a point estimate of 3000 effective individuals at the time of Neandertal-human divergence. That estimate is likely to be supplanted by the Neandertal genome release, because the Green et al. (2006) estimate was influenced by some fraction of contaminating sequence from humans. And the error bars on that estimate are large. But there's a lot of space between them -- we're talking about at least a sixfold difference.

Something doesn't add up. The human-Neandertal ancestral population must have contained all these polymorphic insertions that supposedly occurred before 800,000 years ago. The effective size of the population may have been lower, but if so we should look for some explanation for that substantial loss of variation.

UPDATE (2010-02-10): A couple of people have asked about effective population size. Here's a helpful post that explains why a small effective size may not mean a small population size, and some of the current hypotheses that try to explain the human value.

References:

Green RE, Krause J, Ptak SE, Briggs AW, Ronan MT, Simons JF, Du L, Egholm M, Rothberg JM, Paunovic M, Pääbo S. 2006. Analysis of one million base pairs of Neanderthal DNA. Nature 444:330-336. doi:10.1038/nature05336

Huff CD. Xing J, Rogers AR, Witherspoon D, Jorde LB. 2010. Mobile elements reveal small population size in the ancient ancestors of Homo sapiens. Proc Nat Acad Sci USA (early online) doi:10.1073/pnas.0909000107

R. A. Fisher's model of adaptation

Chapter 2 of R. A. Fisher's Genetical Theory of Natural Selection is remarkable for many reasons. In it, he presents a model of selection in an age-structured population, the concept of reproductive value, and the Fundamental Theorem. Toward the end of the chapter, he discusses "The Nature of Adaptation," presenting a geometric model to justify the assertion that the probability of favorable genetic changes declines as the effect size of those changes increases.

Sergey Gavrilets on the two fitness landscapes

Sewall Wright's metaphor of the "fitness landscape" is fundamental in the way many biologists think about adaptation. The idea of a population "climbing" toward "adaptive peaks" is a visually compelling image for the increase in mean fitness that results from selection on many genes.

However, the correspondence between this metaphor and the mathematics of population genetics leaves several ambiguities that tend to confuse people. One of the main sources of ambiguity concerns the meaning of the spatial dimensions in the fitness landscape. Do the dimensions represent the frequencies of alleles in the population? Or do they represent particular genotypes that individuals may have? Wright used mathematics that implied both approaches in different places. For purposes of metaphorical visualization, the difference between these perspectives may not matter. But if we want to guide our thinking about the evolutionary process, it's helpful to know where real-life cases are supposed to fit.

Sergey Gavrilets' book, Fitness Landscapes and the Origin of Species takes on this problem in chapter 2. This post comes from my notes about the book, which I read some time ago. So although I've brushed them up, many holes remain -- think of it as a synopsis of points I found worth noting. What I don't have is a thesis -- in case you're wondering why you should care.

For me (and many others), the most important aspect of Gavrilets' work is the demonstration that a "rugged" landscape does not exist if we consider a sufficiently high number of interacting genotypes. The genomes of organisms, from E. coli to humans, don't have that many genes, but the number of combinations among only 1000 biallelic genes is so large that Wright's "rugged landscape" analogy may never apply to them. Never mind our 20,000 multiallelic genes. I'll return to that issue another time, because this question of genomic searches has shaped my thinking about mutation-limited evolution and recent selection.

R. A. Fisher and Sewall Wright introduced diffusion approximation methods into genetics; Fisher (1937) was the first to consider spatial disperal using a reaction-diffusion model. I found this quote a useful expression of his acknowledgment of the limits of the model:

The use of the analogy of physical diffusion will only be satisfactory when the distances of dispersion in a single generation are small compared with the length of the wave. In reality diffusion is a complex process, compounded often of the diffusion of gametes, and that of larvae, in addition to adult forms; a more exact treatment than that supplied by a simple coefficient would involve the interaction of these components, and the stages at which the selective advantage was enjoyed. So far as it is applicable, the analogy of physical diffusion, therefore, greatly simplifies the problem (355-356).

The paper has no references.

A new printing of a classic population genetics text has been issued this year: An Introduction to Population Genetics Theory, by James Crow and Motoo Kimura.

I discovered it by accident on Amazon last week, and ordered my copy right away. Now with it safely in hand, I can tell the world!

Crow and Kimura's telling starts with demography, mirroring Fisher's (1930) presentation but with more clarity of description. From the demographic background of genetic change, they are able to pursue genetic drift and selection as stochastic and deterministic realizations of similar processes.

The fact is, not much has changed since the book's first publication in 1970. I think you could teach a great seminar using Crow and Kimura by itself. But if you need a more up-to-date mathematical presentation, I highly recommend Mathematical Population Genetics, by Warren Ewens. The books bear a closer comparison; where Crow and Kimura built their presentation from a demographic perspective, Ewens begins with quantitative genetics, relating the Wright-Fisher population model to phenotypes.

Molecular systematics and species trees

I'd like to point readers to a recent essay in Evolution, by Scott V. Edwards, titled, "Is a new and general theory of molecular systematics emerging?"

Edwards covers some of the recent progress and problems encountered when using molecular evidence to test phylogenetic hypotheses. A sampling of the issues: How do we combine information from different sets of molecular data? Can we just compile sequences from many gene loci together into one analysis ("concatenation"), or do we need to make allowances for genealogical diversity among loci? How do prior assumptions affect the outcomes of analyses, like the presence or absence of polytomies (branching points where three or more species emerge simultaneously)?

I try to think of things that students should read as they get up to speed with evolutionary genetics. Edwards' essay raises many important points, and as I read through it, I reflected on the ways that paleoanthropologists increasingly need to be aware of the inner workings of molecular studies of phylogeny.

If we're interested in the phylogeny of species, we need to know how the "tree" of relationships of species may be manifested in the genealogical relationships among genes. Discordances between genes result from the fact that gene trees are not species trees. Species are genetically variable, and the living descendants of an ancient species may have inherited different parts of the variation of ancient species. Depending on the demography of that ancient population, gene trees representing the evolution of two distinct genetic loci may have different topological properties.

From Edwards:

John Avise encapsulated the relationship between gene and species trees well in 1994: “Gene trees and species trees are equally “real” phenomena, merely reflecting different aspects of the same phylogenetic process. Thus, occasional discrepancies between the two need not be viewed with consternation as sources of “error” in phylogeny estimation. When a species tree is of primary interest, gene trees can assist in understanding the population demographies underlying the speciation process” (pp. 133 and 138 in Avise 1994). This essay is in part meant to reemphasize Avise' perspective and to remind readers that species trees are in fact the “primary interest” of systematics.

Genealogies involve some unknown parameters. Applying the fossil and archaeological record may let us constrain those parameters, just as applying molecular biology and pedigree comparisons may let us constrain the parameters describing the mutational process.

To my mind, this is where paleoanthropologists need to be most attentive: Molecular methods are not in conflict with fossil approaches, they implicitly depend upon them. Yet, communication between the two fields rarely involves actual numbers, so a frequent occurrence is that a "bottleneck" in paleoanthropology with a 10 percent reduction in population becomes a "bottleneck" in genetics with a 1000-fold reduction in population.

Testing of demographic hypotheses moved on to genome-wide polymorphism data several years ago. The logical equivalent for species divergences is lineage sorting -- a model that's been applied since the mid-1990's. The hominoids are extremely well studied from the standpoint of molecular systematics, and remain the central example in most theoretical papers incorporating multiple loci. This year I have noticed several interesting implementations of whole-genome polymorphism comparisons among species embedded in phylogenetic trees. The higher mutation rate of CpG sites has long been known, but we now know that a 50-bp or longer flanking region may influence local mutation rate. As we move from genes to gene networks, our comparisons will not be the same nucleotide, but classes of mutations across classes of genes.

This is another of those cases where the future lies in better algorithms. Edwards seems a man after my own heart -- the computer programs lend a superficial veneer of rigor, when the underlying assumptions are in need of challenge:

Producing phylogenies directly from gene sequences essentially in one step, without additional transformations, is now the dominant mode of phylogenetic analysis and indeed it has advanced the field enormously. Nonetheless, I suggest that the very success of this paradigm and the ease with which phylogenies could be produced directly from DNA matrices led to a comfort zone in phylogenetics. If we can imagine systematic methods themselves as a likelihood surface, I suggest that the current paradigm is a local optimum in that surface, an optimum that is useful but ultimately incomplete in so far as it has failed to model the potential for gene tree/species tree discordance even cursorily (Fig. 3) (Edwards 2009:6).

His theme is an old one -- how do we use "total evidence" methods in phylogenetics. Variance among loci gives the problem a newish twist, one that may add information that other techniques have left on the table. But we have to wring it out of the data.

References:

Edwards SV. 2009. Is a new and general theory of molecular systematics emerging? Evolution 63:1-19. doi:10.1111/j.1558-5646.2008.00549.x

An (old) interview with Warren Ewens

I ran across an interview between Anna Plutinski and population geneticist Warren Ewens.

I cannot say enough about Ewens' book, Mathematical Population Genetics. If you can work through it, you can do population genetics. It doesn't cover every au courant topic, but those will change next week anyway. And it's on Kindle now. Which I suppose probably looks pretty good on the DX, assuming the math displays well -- the book's format is just the right size for it.

Anyway, this interview from 2004 was probably conducted around the time the book was released. It covers pretty much the gamut of his career. I have to select some part to quote for you, so I'll select the passage that would be most likely to come out of my own math in my genetics class:

WE: Of course there is a strong possibility that the neutral theory is assumed not because it is appropriate but because the math of that theory is so very simple compared to the math applying for any selective theory.

AP: Can I follow that up? Do you think that that has lead to models of phylogenetic change that is not very well supported by the evidence?

WE: I think that that is quite possible. However, here we enter into another question. In mathematical population genetics theory you know from the very start that you are making big simplifying assumptions. You are in a very different position from a physicist, who might believe that his mathematical models describe reality exactly. No sensible population geneticist would make any claim along those lines. He or she is forced to simplify, because reality is so complicated that you don’t know it in any detail, and even if you did know it and used math describing it faithfully, the analysis would be impossible to carry through. So simplification is unavoidable. I do not know whether the use of the neutral theory is too much of a simplification and has lead us to incorrect and distorted views about the true evolutionary tree, it’s shape and dimensions, but I suspect that there has been quite a significant distortion.

There is much more at the link, some history of association testing, genetic draft, a lot on Ewens sampling theory, and a touch about his work here in Madison.

People often complain that R. A. Fisher wrote in a hard-to-read style; unnecessarily verbose and indirect. Either I don't tend to mind, or I find that the style makes me read with greater care. In either case, there are select passages from his writings that stand out as very clear to me. His description of epistasis and dominance as deviations from additivity, in his famous 1918 paper (p. 404), is one of them:

The steps from recessive to heterozygote and from heterozygote to dominant are genetically identical, and may change from one to the other in passing from father to son. Somatically the steps are of different importance, and the soma to some extent disguises the true genetic nature. There is in dominance a certain latency. We may say that the somatic effects of identical genetic changes are not additive, and for this reason the genetic similarity of relations is partly obscured in the statistical aggregate. A similar deviation from the addition of superimposed effects may occur between different Mendelian factors. We may use the term Epistacy to describe such deviation, which although potentially more complicated, has similar statistical effects to dominance. If the two sexes are considered as Mendelian alternatives, the fact that other Mendelian factors affect them to different extents may be regarded as an example of epistacy.

The terms we use today are familiar by use. A biologist doesn't necessary consider how idiosyncratic is the genetic use of term "additive". When I read a passage like this, it brings to mind a long-ago time when the select group of people using a term all had read the same papers. I wonder how many geneticists still read Fisher during their training. I can tell you this: the bound volume of the Proceedings of the Royal Society of Edinburgh in our library didn't look like it's been picked up for 30 years. I mean, serious dust on the cover.

I wrote last month about how Fisher invented "variance", and noted the very useful property that the variance is a sum of contributions from different causes. It seems remarkable that Fisher could arrive at statistical framework for identifying the interactions of multiple genes on a trait, at a time when only a relative handful of "Mendelian factors" had yet been found.

Now that we are able to find Mendelian factors in whole-genome association studies, it's remarkable that Fisher's framework is so often forgotten!

References:

Fisher RA. 1918. The correlation between relatives on the supposition of Mendelian inheritance. Proc R Soc Edinburgh 52:399-433.

Phenotypic variance

I've intermittently been reading through William Provine's The Origins of Theoretical Population Genetics. It's related to a project simmering on my back burner.

Meanwhile, last week I was talking with some students about the recent papers at the AAPA meetings about natural selection as assessed by quantitative traits. The students thought that some of these papers had omitted some basic details that seemed obvious from the point of view of quantitative genetics. Also, George Armelagos had mentioned Raymond Pearl, so I figured as long as I'm reading about Pearl, William Castle, R. A. Fisher and their attitudes toward quantitative genetics, I might as well note a few passages from Provine's account.

Provine:

Fisher's express purpose in the paper was to interpret the well-established results of biometry in terms of Mendelian inheritance by ascertaining the biometrical properties of a Mendelian population.

I'll just pause to note that Fisher's formulation begins almost all textbooks in quantitative genetics and many in population genetics. The model that relates quantitative variation and genotypic variation is essential to all genetic analysis.

In particular, he wanted to show that Pearson was mistaken in concluding that the correlations between relatives in man contradicted the Mendelian scheme of inheritance. He began by defining a measure of the variability of a character in a population.

This is an essential step for any introduction to genetics also. I spend some time in all my courses talking about the relationship between genetic and phenotypic variation, using the measures of each as ways to talk about the ways they differ. We can analogize genetic variation to a digital readout -- you have a genotype, or a set of genotypes, and the population's variation has to do with the frequencies of those genotypes or the alleles that comprise them. So the variation is something that emerges from counting genes. You have heterozygosity (expected frequency of heterozygous genotypes), or number of alleles. At the sequence level, you count both alleles and the number of mutations that separate them -- average pairwise difference, number of segregating sites.

Back to Provine:

Often the standard deviation σ was used for this purpose. But Fisher noted that

Now Provine gives a direct quote from Fisher 1918:

when there are two independent sources of variability capable of producing in an otherwise uniform population distributions with standard deviations σ1 and σ2, it is found that the distribution, when both causes act together, has a standard deviation σ12 + σ22. It is therefore desirable in analysing the causes of variability to deal with the square of the standard deviation as the measure of variability. We shall term this quantity the Variance of the normal population to which it refers, and we may now ascribe to the constituent causes fractions or percentages of the total variance which they together produce (Fisher 1918:399).

I have always thought that this was a work of magic by Fisher. The additive quality of variance is such a useful characteristic for a measure of variation, it's hard to imagine using anything else. Fisher continues:

For stature the coefficient of correlation between brothers is about .54, which we may interpret by saying that 54 per cent of their variance is accounted for by ancestry alone, and that 46 per cent must have some other explanation.

It is not sufficient to ascribe this last residue to the effects of environment. Numerous investigations by Galton and Pearson have shown that all measurable environment has much less effect on such measurements as stature. Further, the facts collected by Galton respecting identical twins show that in this case, where the essential nature is the same, the variance is far less. The simplest hypothesis, and the one which we shall examine, is that such features as stature are determined by a large number of Mendelian factors, and that the large variance among children of the same parents is due to the segregation of those factors in respect to which the parents are heterozygous. Upon this hypothesis we will attempt to determine how much more of the variance, in different measurable features, beyond that which is indicated by the fraternal correlation, is due to innate and heritable factors (Fisher 1918:400).

And that, in a nutshell, is why the correlation between relatives is not a measure of heritability. Fisher attempted to show that the segregation of Mendelian factors could account for a large fraction of the variance of stature, and substantially succeeded in showing that the environment had much less impact than had been assumed from the correlation between relatives.

Provine's discussion continues along a different line, but he includes the characteristic line:

Fisher's 1918 paper was well received by the few geneticists who could understand his mathematics (147).

Could genetic drift really break your heart?

Are these people crazy?

The combination of such a large risk with such a high frequency is, fortunately, unique. "How can such a harmful mutation be so common?" asks Chris Tyler-Smith from The Wellcome Trust Sanger Institute, Hinxton, UK. "We might expect such a deleterious change to have 'died out'.

"We think that the mutation arose around 30,000 years ago in India, and has been able to spread because its effects usually develop only after people have had their children. A case of chance genetic drift: simply terribly bad luck for the carriers."

This is a 25-bp deletion in a muscle protein gene, MYBPC3. The current allele frequency in India is estimated to be 4 percent; it is estimated to be carried by 60 million people. The paper suggests that it originated 30,000 years ago. Carriers of the gene have a massive increase in their chance of cardiomyopathy.

Here's the relevant passage from the paper:

The presence of a disease-associated variant at substantial frequency raises an evolutionary question: if it is disadvantageous, how did it become so common? In principle, it could be evolutionarily neutral, manifesting its disadvantages only late in life; alternatively, its disadvantages could be outweighed by advantages early in life, or in a different environment, so that it could have been positively selected. To address this question, we examined the haplotype structure surrounding the deletion. Using five short tandem repeat (STR) markers, spanning ca. 3.4 Mb surrounding the deletion in 287 heterozygous individuals, we found similar high degrees of variation in the inferred haplotypes from chromosomes with and without the deletion (Supplementary Fig. 7 and Supplementary Table 6 online). We then used allele-specific amplification to resequence ca. 10-kb haplotypes centered on the 25-bp deletion from nine heterozygous individuals (Supplementary Tables 7 and 8 online). The chromosomes carrying the 25-bp deletion showed five closely related haplotypes (Supplementary Fig. 8 online). After excluding variants likely to have arisen by recombination, we estimated a time to most recent common ancestry (TMRCA) of ca. 33 ± 23 thousand years for the deletion haplotypes (Supplementary Methods). This time slightly postdates the initial peopling of the subcontinent 30,000–50,000 years ago and together with its restricted geographical distribution suggests that the deletion did not arrive with the first modern human settlers from Africa [more than] 50,000 years ago, but arose subsequently within the subcontinent. Its occurrence in two populations from Southeast Asia can be explained by recent gene flow from India (Supplementary Note online). Collectively, these observations provide no evidence for rapid spread of a recent founder haplotype or any departure from neutral evolution (Dhandapany et al. 2009:4).

The issue is not really whether a gene could go from 1 copy to 4 percent in 1200 generations by chance. That wouldn't be so terribly unlikely in Pleistocene humans -- in fact, the mean time for a mutation to go from 1 copy to 4 percent by drift in a population of effective size 10,000 individuals is not 30,000 years, but only around 20,000 years. On the other hand, mtDNA variation today suggests that South Asia experienced early and rapid population growth -- so we're not likely talking about a population of 10,000, but more like a minimum of 100,000 effective individuals through the past 30,000 years at least. It would take genetic drift at least 10 times longer to accomplish the requisite frequency change given that demographic history. Still, a single allele at a single gene locus might be exceptional.

But that scenario, however unlikely, is simply not the situation we have here. Here we have a deletion that must have some disadvantage, because it gives people a fatal disease. This disadvantage is apparently dominant in effect, based on the case-control study. Yet the deletion has managed to persist within the large South Asian populations of the last 10,000 years so that today it is still around 4 percent.

People mainly die of cardiac problems after age 40. But human reproductive lives aren't over until they're done investing in their children. Further, a weakened heart may reduce work potential or health even if it kills slowly. The fitness cost of this deletion is smaller than if it gave people a chance at a fatal disease when they are 17, but a smaller fitness cost is still a fitness cost. In a large population, that small fitness cost is going to whittle away the frequency of the allele over time.

A thousand generations is a lot of potential whittling. Using some quick calculations, it looks like selection against the deletion as low as 0.001 to 0.0015 in heterozygotes should have been enough to cut the frequency down to around 1 percent, from an initial value of 4 percent. So even if drift increased the deletion early after its origin, it ought to be much rarer today. Meanwhile, drift looks even more unlikely, since the chances of a mutation growing from 1 copy to 4 percent against such selection are nil.

Did this deletion have a fitness cost as high as one in a thousand? It increases cardiomyopathy by 5-fold or more compared to the wild type. So it seems very plausible. But really, we don't have any good estimates of the fitness costs of chronic diseases in pre-industrial populations.

If the deletion was favored by some selection, that would probably be antagonistic, that is, acting against the fitness cost of the deletion late in life. The authors briefly investigated this hypothesis, as described above. They found no evidence for a recent expansion of a single haplotype around the deletion. That means that if there was strong selection favoring this deletion, it must have happened early after its origin and then petered out. If the expansion had been late in South Asian history, it would show more LD around it, and most of the deletion-carrying chromosomes would share a single long-range haplotype. So this deletion has not been increasing rapidly in the past few thousand years.

I would hypothesize that the disadvantages of the deletion have actually increased over time. The average lifespan increased into the Upper Paleolithic and probably later as well. Meanwhile, as the population grew, larger completed family sizes became more important to fitness. As people became more sedentary, the accumulation and inheritance of possessions and land became an important means of investing in children. The increasing importance of later survival and investment in children should have raised the fitness cost of chronic disease. That would explain a pattern of evolution in which this deletion increased in frequency early in its history, but later remained static or declined.

So, I don't suppose I can say people are crazy for thinking genetic drift could explain this deletion's current high frequency. But considering the powerful effect of weak selection over the many generations involved here, and the very large size of the South Asian population during most of that time, genetic drift seems pretty unlikely.

References:

Dhandapany PS and 23 others. 2009. A common MYBPC3 (cardiac myosin binding protein C) variant associated with cardiomyopathies in South Asia. Nat Genet (online early) doi:10.1038/ng.309

Cultural impedance, demographic growth, effective population size

This is a complicated story with many interlocking parts. Telling the whole story may well take me fifty posts. There's a lot of new science hiding in here waiting to get out.

I'm starting now because of the new paper by Luke Premo and Jean-Jacques Hublin, titled "Culture, population structure, and low genetic diversity in Pleistocene hominins." This paper is not the final word on its topic, nor is it the first word. But it is very much worth reading.

It makes an excellent point of departure to explain what we know and don't know about the genetics of prehistoric humans. Premo and Hublin propose an interesting model with interaction between culture and natural selection, as an explanation for a 35-year-old problem in human evolution: Our low level of genetic variation.

Their model may be right. I certainly think there's a kernel of truth in it, shared with a number of other models, as I'll describe below. And it's testable -- a project to which we'll be returning in the next few months.

Syndicate content