effective population size

An insertion into deep history

A couple of weeks ago I noted a new article by Chad Huff and colleagues in PNAS. It wasn't available yet when I wrote, but I've had the chance to study it now.

The paper presents a tremendously clever way of using contemporary genetics to look at different time slices in Pleistocene human evolution. If you can imagine traveling to different parts of the human genome and looking at different times in the past, that's more or less what they are doing.

We have the genomes of several people now -- the paper focuses on Venter's sequence versus the official HGP draft sequence, but there are others. A whole genome is limited in its utility to look at genetic variation, but it has some very interesting sampling properties. Much of population genetics theory is based on a simple question: what happens if you sample two individuals at random? How similar are they? What will be the distribution of genetic differences between them? How long ago did each of their genes descend from a single common ancestor? Sampling a diploid genome yields precisely the data for which these questions were designed.

Huff and colleagues dredge up a relatively obscure point of theory. Suppose you take a particular kind of rare event -- they consider mobile element insertions, including Alu and LINE insertions. Even though these elements make up a large fraction of the human genome, the events that give rise to them are rare, occurring only once in a whole genome every 20 births or more. Now, look around the genome and partition it into two kinds of regions. One kind of region will include the rare events (insertions in this case) and the area immediately flanking them. The other will include everywhere else in the genome. Now, the partitioning creates a bias. The areas that include these rare events will, on average, represent more diverse parts of the genome, with deeper genealogies. This is because the intrinsically rare event is more likely to have happened in the long time span represented by such areas than in the relatively shorter times represented by the remainder of the genome. In fact, the average depth of these areas including the insertions should be precisely double the average depth of the areas that lack them.

In other words, looking at these rare events is sort of like opening the box on Schroedinger's cat. There's something that we shouldn't be able to find out a priori -- how old is the genealogy of a part of the genome? By sifting through the genome and picking out all the parts that have these insertions, we know something about them: We know that they represent a time interval double that of the rest of the genome. Our looking at these insertions has collapsed the likelihood function that relates genetic location to age. When we look at the variation around insertions, we can then ignore some of the events that changed the population's diversity in the last couple of hundred thousand years. And by comparing these sites with the rest of the genome, we have another way to test hypotheses about whether the population was once a lot bigger or smaller than it has been over the last few hundred thousand years.

The analysis shows that the population in that early part of the genealogy -- corresponding more or less to dates over 1.2 million years ago -- was consistent with an effective population size of 18000 individuals, give or take. As I pointed out in my earlier post, that value itself isn't surprising -- it's a bit higher than the average genome-wide. The best-fit model, including both areas near insertions and the rest of the genome, was one in which the effective population size actually declined from 18,500 to 8500 individuals at 1.2 million years ago. They explain that the recent value should be depressed by the separation of present human populations -- Venter and the human reference sequence both being primarily derived from Europe, they undersample human variation.

Now, it's easy to see some of the limitations on the analysis. The authors considered only a two-epoch model of population history. That is to say, once upon the time the population was x individuals, then at some time t, the population becomes y individuals. Two epochs of population size, separated by one time. Clearly the actual history of human populations was more complicated than this, but does it matter? Recent history will not greatly influence nucleotide diversity, and in particular the insertions -- because they are intrinsically rare -- are likely to reflect much more ancient events that have survived any subsequent vicissitudes of population.

But, I suspect that the distribution of insertions with relation to recent selection will make an appreciable difference to the nearby SNP diversity. The geographic distribution of variation will also make some difference, although we won't know how much until we look at non-European genomes.

Meanwhile, if I were looking to the archaeological record to identify times that made a difference to the human population, 1.2 million years ago would really not register. It certainly would not strike me as a time of substantial reduction of the human population.

The lack of any archaeological referent is typical of such studies -- after all, they're not trying to match numbers from archaeology, they're trying to establish internally consistent genetic tests of population history. But if these values are real, they must match what we know from the fossil and archaeological record. There is some text in the paper about the small effective size and its relevance to humans as a sign of repeated bottlenecks or other events. As I pointed out earlier, I think 18,000 is pretty significantly large compared to most other estimates of human effective population size. When we get an estimate of human effective size so near those of other apes, we are looking at a value consistent with habitation of a large, certainly continent-wide range by large populations. So now I have to think what the pertinent comparison from the archaeological record should be.

One archaeological comparison is of special interest to me: a real-life comparison that will be immediately relevant. This study should be giving us information about the population ancestral to Neandertals and humans. In that sense, it duplicates the information that we ought to be able to derive from the comparison of human and Neandertal genomes.

Interestingly, the effective size estimates published so far for the human-Neandertal ancestral population are much lower than the 18,500 estimated in this study. Green and colleagues (2006) made a point estimate of 3000 effective individuals at the time of Neandertal-human divergence. That estimate is likely to be supplanted by the Neandertal genome release, because the Green et al. (2006) estimate was influenced by some fraction of contaminating sequence from humans. And the error bars on that estimate are large. But there's a lot of space between them -- we're talking about at least a sixfold difference.

Something doesn't add up. The human-Neandertal ancestral population must have contained all these polymorphic insertions that supposedly occurred before 800,000 years ago. The effective size of the population may have been lower, but if so we should look for some explanation for that substantial loss of variation.

UPDATE (2010-02-10): A couple of people have asked about effective population size. Here's a helpful post that explains why a small effective size may not mean a small population size, and some of the current hypotheses that try to explain the human value.

References:

Green RE, Krause J, Ptak SE, Briggs AW, Ronan MT, Simons JF, Du L, Egholm M, Rothberg JM, Paunovic M, Pääbo S. 2006. Analysis of one million base pairs of Neanderthal DNA. Nature 444:330-336. doi:10.1038/nature05336

Huff CD. Xing J, Rogers AR, Witherspoon D, Jorde LB. 2010. Mobile elements reveal small population size in the ancient ancestors of Homo sapiens. Proc Nat Acad Sci USA (early online) doi:10.1073/pnas.0909000107

High Pleistocene human effective population size

Nicholas Wade is reporting on an upcoming paper by Chad Huff and Lynn Jorde: "Genome Study Provides a Census of Early Humans".

The Utah team based its estimate on the genetic variation present in two complete human genomes, one prepared by the government’s human genome project and the other by J. Craig Venter, the genome sequencing pioneer. The government decoded a single copy of a mosaic genome derived from a medley of people, apparently of European and Asian origin. Dr. Venter decoded both copies of his own genome, the one inherited from his father and the one from his mother.

The Utah team thus had three genomes to work with and looked at ancient elements known as Alu insertions, the youngest class of which appeared in the human genome around a million years ago. The amount of variation seen in the DNA immediately surrounding the Alu insertions gave a measure of the size of human population at that time.

Their estimate agrees almost exactly with an earlier one, also based on Alu insertions but with sparser data. The insertions tag ancient regions of the genome that are unaffected by the recent growth in population, Dr. Huff said.

I'll probably write some more notes on this when I can get a copy.

At the moment I think it's worth pointing out that the lede of Wade's story is exactly backward. The story is all about how the effective size estimate, 18,500 effective people, is very low. But in reality that's a high estimate compared to what most human geneticists have assumed, only 10,000 individuals.

Neither estimate is really news. Observations in the early 1970's established that 10,000 was around the right order of magnitude for human effective population size. Around 10 years ago, some gene systems, including Alu insertions, appeared to support a higher estimate of effective size up around 18,000 individuals. That still seemed pretty small in evolutionary terms, and didn't change anybody's ideas about ancient population bottlenecks.

The differences between these estimates have never really been resolved. As more and more genes got sequenced, human geneticists seem to have just standardized on the small estimate of 10,000 effective individuals -- even as they started to apply more and more complicated computer models to try to derive estimates of expansion and bottleneck times. (I wrote about the problem of effective population size last year, "Cultural impedance, demographic growth, effective population size".)

A few years ago we started to get good effective size estimates for other primates. As Wade's article points out, the genetic variation of chimpanzees and gorillas lead to estimates of effective size on the order of 25,000 or so individuals. Geneticists noted that these species are therefore much more diverse than humans, with our puny effective size of around 10,000 individuals. Only bonobos seem to be close to the low human value.

Well, if Huff and Jorde are right, human variation is a lot like the amount of variation in chimanzees and gorillas. Those other apes have lived in geographically structured subspecies spanning tropical Africa for several hundred thousand years.

Or have they? Maybe there were massive bottlenecks and population replacements among chimpanzee subspecies. Maybe there was a recent "out of Congo" migration that accounts for the low genetic variation of bonobos. Maybe chimps themselves derive recently from some part of their current range.

Or, maybe the human effective population size isn't so probative.

In any event, the genomes here are all Eurasian. I wonder how much African genomes will increase the diversity? Could it be that we're even more diverse than chimpanzees?

Mutual information between strings of loci

Fourth in a series on mutual information and genetic linkage. If you’re happening upon it for the first time, you can find the entire series or the first post, “Information theory: a short introduction”.

After the last post, you might wonder what the big deal is about these information theoretic measures of linkage. After all, we’ve got lots of other measures of linkage to choose in population genetics, with many years of theory behind them. The basic conclusion about genetic drift was that it adds mutual information to samples over short regions, but that recombination over longer areas washes it out. If the net effect is no linkage, why would we bother to come up with some non-standard linkage measure?

One answer: If the existing linkage measures were so great for testing neutrality, then we might expect some of the recent genome-wide selection scans to have used them. But they didn’t – instead we have several partially incompatible methods, all of which eschew the usual measures of linkage.

When genetic drift reduces entropy

This is the third in a series on information theory and tests for recent selection. The first post, “Information theory: a short introduction”, covered some of the basics of entropy. The second post, “Information theory and mutual information between genetic loci”, showed that mutual information between independent sites will be distributed as a χ2.

We tend to think of genetic drift as a random process. Random processes operating repeatedly over time are called “stochastic,” and changes in gene frequency under genetic drift are certainly that.

Since entropy is a measure of uncertainty, it might seem natural to think that stochastic changes in gene frequency would increase the entropy in a population. After all, the gene frequency in a population under genetic drift will be more and more uncertain over time. So, considering the frequency of a single allele as the system, genetic drift appears to increase entropy over time.

But even this simple system isn’t quite so simple as it might appear. Sure if you start out knowing the allele frequency, then genetic drift will increase your uncertainty over time. You will become less and less able to say that it lies in any given interval. But what if you don’t start out knowing? What if all you know is that the locus has been subjected to t generations of genetic drift?

As t increases, the probability of fixation of the locus also increases. The net effect is to reduce the entropy in the system – going from uncertainty about the allele frequency to more and more certainty that it will be either one or zero. The only thing that will stop this process is some other evolutionary force – mutation, migration from other populations, balancing selection. Each of these will have its own distinctive effects on the entropy of the single-locus system.

Cultural impedance, demographic growth, effective population size

This is a complicated story with many interlocking parts. Telling the whole story may well take me fifty posts. There's a lot of new science hiding in here waiting to get out.

I'm starting now because of the new paper by Luke Premo and Jean-Jacques Hublin, titled "Culture, population structure, and low genetic diversity in Pleistocene hominins." This paper is not the final word on its topic, nor is it the first word. But it is very much worth reading.

It makes an excellent point of departure to explain what we know and don't know about the genetics of prehistoric humans. Premo and Hublin propose an interesting model with interaction between culture and natural selection, as an explanation for a 35-year-old problem in human evolution: Our low level of genetic variation.

Their model may be right. I certainly think there's a kernel of truth in it, shared with a number of other models, as I'll describe below. And it's testable -- a project to which we'll be returning in the next few months.

Syndicate content