john hawks weblog

paleoanthropology, genetics and evolution

models

  • Passing on your fertility to your kids

    Fri, 2010-05-14 10:37 -- John Hawks

    From the NY Times earlier this spring, a profile of a New York woman with an exceptional legacy:

    WHEN Yitta Schwartz died last month at 93, she left behind 15 children, more than 200 grandchildren and so many great- and great-great-grandchildren that, by her family’s count, she could claim perhaps 2,000 living descendants.

    The story talks about her history and how she came to have such a large family. By itself, having 15 children would be unremarkable except that the children and grandchildren themselves all went on to have large families ("Like many Hasidim, Mrs. Schwartz considered bearing children as her tribute to God."). After a couple of generations, it adds up to a lot of descendants.

    I don't think the story is all that unique. Within the United States there are many communities, like the Hutterites, Old Order Amish, and Hasidic Jews, where large family sizes are the norm. Probably hundreds of women on earth can claim more than a thousand living descendants, and thousands more have only to wait until they are old enough, while their children and grandchildren's families continue to grow.

    You can get there by having 10 children, each of which has 10, and each grandchild has 10 -- that adds up to 1110, giving some extra for different generation times and losses. Of course, it's a trick to live long enough to see the 1000 great-grandchildren, but the early ones should already have given you a fraction of your 10000 great-great-grandchildren.

    What's surprising here? Not the family sizes themselves -- big families are common in most human populations. The high offspring numbers are not as apparent in populations that have high juvenile and infant mortality, but many pregnancies was the norm prior to the industrial transition.

    No, what's surprising about huge numbers of living descendants is the correlation between generations. In these cases, the correlation is driven by religion and various social proscriptions related to religious observance.

    I often talk about models and real human population structures in my classes. One obviously unrealistic aspect of the Wright-Fisher population model is its reproductive variance. In the Wright-Fisher model, reproductive variance is binomial -- every gene in an offspring population is equally likely to descend from each gene in the parental generation. In the model, it is possible -- albeit extraordinarily unlikely -- for a single parent to give rise to the entire offspring generation. That just can't happen in a real population, certainly not in humans. The effect of that unrealistic assumption of the model is not great, however, because even in the model the chances have having more than 10 offspring, while possible in theory, are negligible. If anything, the Wright-Fisher model is too conservative about the variance of offspring number -- real human populations have a non-negligible fraction of women who have 10 or more live births.

    I get more concerned about other deficiencies of simple models, which are sometimes harder to deal with. One of those is the correlation of offspring number between generations. If there is even a slight correlation, women tending to have more children because they came from larger families, it has a major effect on the amount of inbreeding in the population.

    You can think about it genealogically. Suppose you live in a small town with a few big families. The chances that you yourself were born into one of those big families is small. But if today's big families tended to come from yesterday's big families, with each generation we go back in time, it becomes more and more likely that one of your ancestors came from one of those big families. Still looking backward in time, your genealogy becomes captured by those big families, branch by branch. Since there are few big families in the town, once two or more lines of your ancestry trace to them, those lines will rapidly share a common ancestor. That's inbreeding, from the perspective of your genealogy.

    In small towns, that process isn't inevitable because people move in from elsewhere. Most of the lines of your genealogy will probably come from other towns within a few generations. But if we consider the human species as a small town, well, there's nowhere else to move in from. If the population structure of our species has included a strong correlation of offspring number between generations, it will have massively reduced our genetic variation.

    Since we have low genetic variation as a species, you can see why this is potentially interesting.

    Masatoshi Nei and Motoi Murata back in 1966 worked out a relation between intergenerational correlation in offspring number and effective population size. That's before the days of computer models, for you simulation jocks out there. The "effective" size of a population, as I've noted here many times, is the one parameter of a Wright-Fisher model, as estimated from the genetic variation within a population. It's a statement about how inbred the population looks, assuming that its evolution followed a random-mating model throughout its history. Now, that model is wrong in pretty much every interesting case, and so there are various mathematical transformations that attempt to account for the effects of different mating structures.

    In the case of intergenerational correlation of offspring number, Nei and Murata derived an expression to predict the reduction of effective size to be expected from this correlation, assuming a model in which the variance in offspring number is distributed in a certain way. The solution isn't general -- if offspring number were distributed in some other way, the effect of the same measured correlation may be quite different. And in their model, they were concerned with the case where the correlation of offspring number is influenced by genes that determine fitness -- in other words, genes under selection in the population. So it's not a complete answer, but it's a start.

    Nei and Murata cited empirical data from several earlier studies that showed a correlation of 0.20 to 0.40 between generations of human offspring number. Under the assumption of their model, a correlation of 0.30 would causes a reduction of the effective size by roughly half.

    That's a big effect. We already expect a reduction of effective size compared to the census count of a human population, because human populations include many non-reproductive individuals -- kids and postreproductive adults make up half to two-thirds of small-scale foragers. If big families have an additional effect of half, it means that the effective size of the population starts out at a fourth to a sixth the census count. So that an effective size of 10,000 really means 40,000 to 60,000 people on the ground.

    Still low, but as one factor among many it may be very important -- and possibly the distribution of variance caused a further decline. It's much worth investigation.

    A correlation of offspring number between populations can be caused by many ecological or cultural factors. Nei and Murata (1966) had considered the case where fitness itself is inherited, because of the presence of selected genes. But in humans, a more pervasive force is cultural inheritance. This factor was discussed in 1976 by the demographer Samuel Preston, attending to the importance of cultural preferences in contemporary populations:

    Since children of each generation are drawn disproportionately from families of women with high fertility achievements in the past, it may be expected that a pronatalist selective bias operates each generation with respect to the transmission of "tastes" for children. It has also been suggested that personality traits which may affect fertility achievement, such as the ability to defer gratification, may be transferred to some extent between parent and child (Kantner and Potter, 1954). It is also reasonable to suggest that biological fecundability is partially inherited. The positive correlation between the social classes of parent and child implies that economic constraints impinging on the childbearing process tend to be similar for the two generations (Preston 1976:110).

    In small-scale societies, these forces are somewhat different. But I wouldn't expect them to be less -- indeed, the social competition between families is probably more intense. The entire "Macchiavellian intelligence" model of cognitive evolution implies that these kin-level effects were pervasive throughout human evolution over the past 2 million years or more. A strong cultural inheritance of fitness is really necessary for selection on genes that influence prosocial kin-related behaviors.

    How intense? Seems like a good question to investigate, as it may have a lot of importance to understanding genetic variation in our ancestors -- including our common ancestors with the Neandertals, whose genetic variation was limited just as much as our own.

    On the subject of effective population size, I'll be posting next week about chimpanzees and bonobos. More genetically variable than us? Well, some of them...

    References:

    Preston SH. 1976. Family sizes of children and family sizes of women. Demography 13:105-114.

    Nei M, Murata M. 1966. Effective population size when fertility is inherited. Genet Res 8:257-260.

  • The Neandertal fraction

    Tue, 2010-05-11 11:43 -- John Hawks

    I've gotten the same question a few times, and have seen it elsewhere, so I thought it would be worth a short post to explain it. And for those readers who've also been asked this question, I thought that being able to provide a simple explanation might be a great help.

    How can we say that today's non-Africans derive 1-4% of their genomes from Neandertals, when we are 99.86% genetically similar to Neandertals? Or 98% similar to chimpanzees? I mean, how do we have 4% to work with to make this estimate at all?

    Let me explain with another example.

    You are approximately 99.9% similar to any other random human today. You're just a bit closer to your relatives, because you got some of your DNA directly from them, or you both share DNA from an immediate ancestor.

    Your great-grandmother on average gave you 1/8 of her genes, making up on average 12.5% of your genome.

    You are more than 99.9% similar to your grandmother, on average, yet she contributed only 12.5% of your genome.

    In other words, these percentages are different things -- the fraction of your total ancestry you can trace to her, versus the fraction of base pairs you actually have identical to hers. Your genome is much more identical to hers than can be explained solely by your descent from her -- this is because you share other ancestors in common with her, and because mutations don't happen very often.

    Or, think about it the opposite way. Suppose that the 12.5 percent of your genome you inherited from your great-grandmother meant that you were only 12.5 percent genetically similar to her. Where did you get the rest of your genes from? A turnip? No, you got them from other people, all of whom are roughly 99.9 percent like your great-grandmother. You're 12.5 percent more like your great-grandmother than you are like randomly chosen people in the population.

    For the Neandertals, we have to separate these two kinds of similarity, sorting out the genes that we must have inherited from them, from the ones that we share because we share a more distant ancestry.

    Now, suppose we don't know that this woman was your great-grandmother, that it's only a hypothesis. It's kind of thing a forensic anthropologist might want to figure out, if your great-grandmother was Anastasia. We can answer the question in this way: Test the hypothesis that she's unrelated to you, by examining whether you are equally genetically similar to her as you are to the average, randomly chosen individual from your population.

    This is a statistical test. In fact some of the people in the population share more genetic similarity with you than others, and our statistic has to account for that variation. We can put whatever level of statistical confidence on it we like. If your putative great-grandmother shares substantially more with you than all but some very small fraction of people, we may conclude that she is your relative.

    We might do substantially better -- if the variation in the population doesn't work against us, we might even conclude that she is in fact a third-degree relative who contributed between 10 and 14 percent of your genome. Or even better.

    Our conclusion has to depend on the structure of the population. If randomly chosen people tend to look like you, for some reason of population structure, we'll have to model that population structure directly. This is, of course, what was done in the case of the Neandertal genome -- a specific population model was significantly favored by the data, and alternatives that did not include population mixture were demonstrated to be so unlikely as to be essentially impossible.

    And as I pointed out the other day, if Neandertals had not donated any genes to later populations, then the most recent common ancestors of human and Neandertal genes would all be earlier than the divergence of those populations, more than 250,000 years ago. It is the observation of chromosomal segments that are identical or very near some living human chromosomes that shows that, for some genes in some living people, the Neandertals are not different enough. We have to have some of their genes.

  • The problems of computer-aided biologists, 1

    Wed, 2010-03-17 18:53 -- John Hawks

    On the subject of modeling in genetics, John Timmer of Ars Technica has been running an excellent series on the challenges of computer models in biology. I'll devote a few words to some of these articles in the next several days.

    An article from earlier this winter, "Keeping computers from ending science's reproducibility," discusses the problems with replicability. Data from genomes and genotyping platforms go through frequent revisions, so that the same methods may lead to different results depending on the version of the dataset. Not replicable, in other words, and it may be very hard to track down exactly why slight differences in results persist. It's also hard to verify that the methods are working the same way when the same results aren't found -- it's not like the problem of significant digits in measurement, in other words.

    That problem is compounded when it comes to analytical methods:

    An analysis pipeline may involve dozens of specialized software tools chained together in series, each with a number of parameters that need to be documented for their output to be reproduced. Like the data, some of these tools are proprietary, and many of them undergo frequent revisions that add new features, change algorithms, and so on. Some of them may be developed in-house, where commenting and version control often take a back seat to simply getting software that works. Finally, even the best commercial software has bugs.

    "Getting it to work" is too often the major goal in human genetics, where in-house development of population history models is the norm. Rigorous validation of these models is beyond any single lab's purview; to be published, it is enough to cite prior art.

    The end of the article includes some reporting on possible solutions, including this:

    Even if we solve the legal and computational portions of the problem, however, we're going to run into issues with the fact that many of the people who use computational tools understand what they do, but don't feel compelled to learn the math behind them. That's where a paper in the latest edition of Science comes in. Its author, Jill Mesirov of the Broad Institute, describes how many biologists aren't well versed in computational analysis, but are increasingly reliant on tools created by those who are; she then goes on to describe one type of solution, called GenePattern, that she and her colleagues put together with the help of Microsoft Research.

    The idea is to "embed" the actual bioinformatic research methods into the paper, as one would embed a spreadsheet into a Word document. That way, anyone who reads the paper could just run an active version of the methods, to verify the results were accurate, and (potentially) play with the parameters.

    Not a bad idea for the toy example, but for simulations that take days or more to run, it isn't going to be practical. What we need is people to learn the math, not people to dumbly click buttons in a paper.

    The specific idea of an interactive workflow is implemented fairly well in the Galaxy bioinformatics platform. There are definite strengths to that approach -- most importantly, for simple operations it can be incredibly useful to have a running record of what you've done, so that you can get it again yourself. But an equivalent record can fairly easily be accomplished using Python, Perl or any other scripting language. A risk of an online system is that it runs into the versioning problem very quickly -- interactive downloads may bring inconsistent datasets that use different genome draft assemblies, for example.

    In any event, much pain can be circumvented with a little math, in many cases. We should make it a priority to get students a common-sense understanding of how genetic parameters relate to each other.

    UPDATE (2010-03-18): Another section of the article is worth discussion. Along the lines of my post from earlier this year regarding the importance of code sharing and transparency ("The bugs will out"), Timmer wrote:

    "You need the code to see what was done," [Victoria Stodden] told Ars. "The myriad computational steps taken to achieve the results are essentially unguessable—parameter settings, function invocation sequences—so the standard for revealing it needs to be raised to that of when the science was, say, lab-based experiment." This sort of openness is also in keeping with the scientific standards for sharing of more traditional materials and results. "It adheres to the scientific norm of transparency but also to the core practice of building on each other's work in scientific research," she said. But the same worries that apply to more traditional data sharing—researchers may have a competitor use that data to publish first—also apply here. In the slides from her talk, she notes that a survey she conducted of computational scientists indicates that many are concerned about attribution and the potential loss of publications in addition to legal issues. (The biggest worry is the effort involved to clean up and document existing code.)

    A lot of the code we use is really rather simple. The coalescent can be implemented in a few lines, and most common alterations of it can be handled with 10-line subroutines. A forward-time simulation can be done in a single line of Python, and again the common alterations don't take too much to implement.

    There are rather radically more complicated models in use, and we should direct more attention to making these human-readable, separating modular elements apart so that they can be run with different simulation engines, and making clear distinctions between functional code, parameters, and data. I've been doing this long enough to know how simple it can be to hard-wire your parameters into the code, undocumented, so that nobody can figure out what is going on but the author. That's not where you want to be.

  • Simulations bubbling like a stew

    Thu, 2009-04-16 11:40 -- John Hawks

    Peter Turchin writes very effectively about quantitative modeling and analytical methods in biology. So every so often I like to post an illuminative quote. Here's his description of maximum likelihood estimation, from Quantitative Analysis of Movement:

    Simpler, more direct analyses may make unwarranted assumptions, but they are better at revealing important patterns in the data, and their results can suggest what variables and functional forms to use in the modeling of data. Eventually, however, direct methods of analysis get beyond the bounds of their competence. The general approach discussed in this section can in principle estimate parameters of any model, given infinite amounts of informative data and infinite computer power.

    The basic approach is to construct a detailed simulation model (better even, a series of models) and fit it to the data using nonlinear estimation techniques. Jon Schnute colorfully describes a detailed simulation as a "stew" of calculations from which observable quantities (to be compared with the actual data) bubble up to the surface (quoted from Hilborn and Mangel 1997). Nonlinear estimation is the process of adjusting the parameters of the stew (adding more or less salt, increasing or decreasing temperature, etc.) until the stuff that bubbles up resembles the actual data the best. The crudest approach is to change parameters in the simulation by the method of trail [sic] and error and to compare the simulation results to data by eye. A more refined approach is to use some quantitative measure of goodness of fit and a nonlinear minimization routine to search for the best fit automatically (Turchin 1998:295).

    The quote has some relevance to yesterday's discussion of the Neandertal population structure paper. I'm philosophically reluctant to turn to simulations until I exhaust my analytical options. This is a matter of trusting myself -- if I really had a lot of confidence in my ability to choose the right assumptions to underlie my simulations, I might turn to them first. But assumptions are tricky. Analytical models have their own assumptions, but those have the advantage of transparency -- I didn't pick them, they are fundamental to the models.

    Still, in some cases it doesn't take long to exhaust the analytical options. So we let the observable quantities "bubble to the surface" of simulations.

  • Neandertal races?

    Wed, 2009-04-15 23:41 -- John Hawks

    There's a new paper in PLoS ONE by Virginie Fabre, Silvana Condemi and Anna Degioanni, titled "Genetic evidence of geographical groups among Neanderthals." I think this is an ambitious paper -- it uses 12 mtDNA sequences recovered from Neandertal fossils to compare different phylogeographic scenarios for Neandertal populations.

    The authors applied several different models to the data, attempting to find a population history that matches the geographic distribution of mtDNA diversity in Neandertals. They found that a model in which Neandertals had been part of three long-standing geographic populations was a better fit than others. Here's the relevant part of the abstract:

    In this paper we used a new methodology derived from different bioinformatic models based on data from genetics, demography and paleoanthropology. The adequacy of each model was measured by comparisons between simulated results (obtained by BayesianSSC software) and those estimated from nucleotide sequences (obtained by DNAsp4 software). The conclusions of this study are consistent with existing paleoanthropological research and show that Neanderthals can be divided into at least three groups: one in western Europe, a second in the Southern area and a third in western Asia. Moreover, it seems from our results that the size of the Neanderthal population was not constant and that some migration occurred among the demes.

    I like the study, and I have no strong objections to the conclusion. It has always seemed sort of likely on morphological grounds that Neandertals may have had modest geographic differentiation. Amud 1 doesn't look like a French Neandertal; nor does Teshik Tash. So I'm inclined to think the results are not too surprising.

    Still, the data have some big weaknesses. Phylogeography is a tall order when we only have 12 sequences.

    Many have pointed out, going back to McCown and Keith (1939), that time is another possible cause of morphological differentiation of Neandertals. The mtDNA sequences cover a wide range of times -- the Scladina sequence comes from roughly 100,000 years ago, the others cover the span from 50,000 down to 29,000 years ago. Why not test temporal groups instead of geographic groups? Temporal clusters might reflect interglacial colonizations, differential gene flow, or natural selection. There is a good precedent -- last year a report of complete mtDNA sequences from woolly mammoths found evidence for geographic structure among mtDNA lineages, one of which apparently replaced the other (Gilbert et al. 2008).

    Time is just one example of an alternative model for variation. But I think it helps to clarify the basic problem of the a priori models -- you have to draw boundaries between the specimens somewhere. In the current paper, Fabre and colleagues divided the samples into one, two, or three groups. The one-group model amounts to a simulation of panmixia. The other models are a little like the setup of a Bayesian STRUCTURE analysis -- how well does the sample fit a model in which the the latter as the more likely null hypothesis.

    But unlike STRUCTURE, in this case, each specimen had to be assigned deliberately to one group or another. That's why the authors generated three different versions of their three-group model -- in each version, the boundaries between groups were drawn in slightly different places.

    That's not a criticism of the paper; it's just an inherent property of the method. There's no better way to come up with boundaries of the groups, and I've done similar things in earlier work. It's rational to have the groups contiguous with respect to geography, but without clear isolating barriers, no special reason why the groups should be bounded along any particular line.

    In this case, one of the three-group models provided a substantially better fit between simulated data and the observed mtDNA sequences. So the paper concludes that three groups are supported.

    Thinking about it, I would probably use the data to test a slightly different set of hypotheses.

    I would start with an analytical approach, explicitly testing the hypothesis of panmixia; then explicitly testing isolation-by-distance. Panmixia should be easy to refute -- if you don't then phylogeography is a non-starter. I see isolation-by-distance as the appropriate null hypothesis, and while the time-dispersion of the samples makes life a little more complicated, a simple test of IBD would be straightforward. I don't expect you need simulations for either of these tests, although you could use simulations to explicitly include the ages of the specimens in the test.

    One reason to start with IBD is that the specimens are heterogeneously distributed through space. Since there are some large gaps in the geographic distribution, the observed sequences may tend to clump into groups even if no real boundaries between groups existed. The best-supported model in the paper divides the sample with a clump of Italian and Croatian specimens, a large gap between Ukraine and Uzbekistan, and most of the specimens in one large group. That looks like a pattern that might be consistent with IBD, complicated by the actual ages of the specimens.

    Archaeology

    At any rate, the conclusion in the paper should make one set of people nervous: those who think that Paleolithic archaeological industries reflect populations. I can't see any obvious alignment between these three "groups" of Neandertals and well-known cultural units at any time interval. There are some localized and relatively long-lasting industries or variants within the boundaries of some groups, and there are others that span the boundaries.

    Now, if these groups really reflect long-standing population boundaries -- spanning some 100,000 years in the model -- then we might expect it would have been hard to exchange information across them.

    The same should have been true of Africa, which I've mentioned shows evidence for population differentiation going far back into the Late Pleistocene if not earlier. In Africa, the MSA shows both long-standing variations in different regions and relatively rapid temporal fluctuations within regions. The same general picture holds for the Mousterian, although one may argue whether the correspondence is exact. In any event, the African regional variants show no obvious correspondence to the genetic differentiation of African populations today. Maybe that's because of subsequent changes in the African population -- today's differences don't necessarily reflect those of MSA populations.

    But suppose we take the Neandertal model seriously. Information transfer in living people occurs on a much more rapid timescale than genetic exchange. That cannot always have been true in human evolution -- it isn't generally true of other primates, where long-distance information transfer basically depends on the transfer of individuals from their natal groups. What should an intermediate stage look like, in which the amount of information transfer may be less than in recent human groups (with writing, accounting and vastly more people), but the pace of transfer may have been comparable? I doubt they would correspond well to genetic populations over much longer timescales, although they may be limited by them to some extent.

    Are these Neandertal races?

    I raised the question in my class today. If these really are groups of Neandertals, occupying different geographic ranges for a hundred thousand years, what do we call them? I thought it was a good lead-in to talk about species concepts in paleoanthropology, and of course it is.

    If these aren't species, why aren't they? Presumably because we think that genetic exchanges across this range would have been likely. You can test the hypothesis by comparison with living humans and other primates. Mitochondrial phylogeography of human populations includes some long-standing population structure going back more than 60,000 years. Within great apes, there are long-standing subspecies that go back much further, hundreds of thousands of years. In humans, we tend to call the resulting groups races or populations. Among great apes, we tend to call them subspecies.

    So are these subspecies of Neandertals? Races? Geographic populations? I wouldn't interpret further without really determining the nature of the boundaries here. As I mentioned earlier, I think the null hypothesis is isolation-by-distance. It's conceivable that the Neandertal population was patterned in a similar way to recent humans -- although considering our rapid recent evolution, I wouldn't be quick to assume that human differentiation is a good model.

    One other thing. Let's assume that the Neandertals really were differentiated from each other, and that the groups proposed by Fabre and colleagues are generally right. In that case, the Neandertal Genome Project has been concentrating on an individual from the Southern subpopulation, a subpopulation otherwise very far the population interface between Neandertals and other humans before 45,000 years ago. Hence, that sequence may be a bad place to look for evidence of interactions between Neandertals and modern humans. Genetic exchanges are more likely to have happened across long-standing areas of contact -- which in Fabre and colleagues' best model, would likely involve the Western or Eastern subpopulations.

    That's entirely speculative on my part, but it does seem to be one implication of the model.

    References:

    Fabre V, Condemi S, Degioanni A. 2009. Genetic evidence of geographic groups among Neanderthals. PLoS ONE 4:e5151. doi:10.1371/journal.pone.0005151

    Gilbert MTP and 32 others. 2008. Intraspecific phylogenetic analysis of Siberian woolly mammoths using complete mitochondrial sequences. Proc Nat Acad Sci USA 105:8327-8332. doi:10.1073/pnas.0802315105

  • Perils of modeling

    Mon, 2009-02-23 22:35 -- John Hawks

    You're not coming here for economic analysis, but I found this Wired article on quants, risk, and the financial crisis useful:

    Bankers should have noted that very small changes in their underlying assumptions could result in very large changes in the correlation number. They also should have noticed that the results they were seeing were much less volatile than they should have been—which implied that the risk was being moved elsewhere. Where had the risk gone?

    They didn't know, or didn't ask. One reason was that the outputs came from "black box" computer models and were hard to subject to a commonsense smell test. Another was that the quants, who should have been more aware of the copula's weaknesses, weren't the ones making the big asset-allocation decisions. Their managers, who made the actual calls, lacked the math skills to understand what the models were doing or how they worked. They could, however, understand something as simple as a single correlation number. That was the problem.

    These models are not so different from genetic analyses, and in fact phenotype prediction on the basis of genome-wide SNP data or sequences will likely involve many of the same problems. In particular, the problem of testing for correlations with limited data is one that I run up against with phenotype evolution quite often.

    On the same topic, Nassem Taleb's essay, "The Fourth Quadrant" is also useful in understanding the present difficulties.

  • Gene-culture models and reductionism

    Sat, 2008-11-08 19:40 -- John Hawks

    In the random corners of Google, I was led to a short 2004 letter in American Anthropologist by Daniel Wildcat, Irena Sumi and Vine Deloria, Jr. I found this paragraph thought-provoking:

    No doubt one can find high correlations between genes, languages, and kinship systems in many places. However, the definition of social structure that is associated with such analyses is terribly simplistic and reified. Involving population genetics is particularly misleading: Population genetics employs a mathematical model whose crucial dynamic variable is “mutation” readable in genetic markers such as mtDNA and the Y-chromosome. The frequency of these mutations is matter of speculation: It is deemed accurate or credible when the computed spans of time between mutations fit a preset, hypothetical scenario of a “demic expansion.” Exciting as such speculative science may be, it nevertheless yields bland, linear, and unimaginative speculation on humankind’s past. Moreover, it seems to exploit the fact that few social scientists familiarize themselves with modern genetics, and likewise geneticists seem largely ignorant of what social scientists know about the way humans build their communities and imagine the past, as well as how social scientists in turn represent these notions. This mutual ignorance seems to increasingly produce unquestioned mutual belief (Wildcat et al. 2004:641).

    This is at its root an anti-reductionist argument, and since I think there is substantial promise for reductionist approaches to culture history, I don't subscribe to the motivating spirit of the remark.

    But there is much here with which I do agree. Attempts at connecting genetic variation with linguistic or cultural history have been "bland, linear, and unimaginative." They follow essentially nineteenth-century models of culture transmission, in which both culture and genes are inherited in a vertical direction, and horizontal transfer (of either) has little importance. Because selection has been assumed to be unimportant or insignificant, the genetic models must rely on bottlenecks and isolation to explain genetic differences. Cultural diffusion -- and its extreme manifestation, language and subsistence shifts -- are unwanted noise that only obscures efforts to reconstruct "deep history." Correlations between genes and cultural traits are accidents of history.

    I can't say that the current pattern of biocultural research is wrong. People have developed clever ways to test the usual models, and those models will -- after all -- be correct in some cases. At worst, they will be rejected, and that increases our knowledge as well.

    But there is much of interest that remains to be explained, that must involve different patterns of culture-gene interactions. Not just vertical transmission, but codiffusion, true coevolution of genes and culture traits, and historical constraints on both cultural and genetic changes. As we extend our data across the genome, we can study the interactions not of one or two genetic loci with culture history, but of thousands. Natural selection has been very common, not rare. This gives us an incredible opportunity to test hypotheses about the historical causes of genetic change.

    These efforts may also be criticized as overly reductionist. But I find them very compelling because they make us focus on the role of individuals in culture-historical processes. In the usual models, individuals are passive repositories of genes and culture traits as they are passed forward through time. In models with genetic and cultural selection, individuals become agents, making decisions about adopting traits horizontally or vertically transmitted, and succeeding or failing based on both those decisions and combinations of selected genes.

    References

    Wildcat D, Sumi I, Deloria V, Jr. 2004. A response to Doug Jones. Am Anthropol 106:641.

  • The utility of theoretical models

    Tue, 2008-10-21 16:33 -- John Hawks

    I'm reading through Peter Turchin's 1998 book, Quantitative Analysis of Movement, for a project I'm working on. I found that his second chapter gives a very nice introduction to the reasons why biology depends on formal mathematical models. This is a topic I often review in my courses, so I'll quote some of his discussion.

    He lists six objectives for model-building on pp. 33-35, each with some explanatory text. This amounts to a paragraph or so for each reason; I'm only giving one or two sentences of each, with much omitted.

    Formal statement of the problem ...The necessity of stating the assumptions of the model is another benefit. A mathematical description of a problem forces one to be very clear about what the different variables and parameters in the model are, and how they are interrelated.

    Identifying knowledge gaps ...It may turn out that good quantitative data are available to estimate some functions and parameters but not others, immediately suggesting a focus for the empirical program. When there are many gaps, one has to decide which parameters need to be estimated precisely, and for which parameters ``guesstimates'' will do....

    Gaining theoretical insights There is a large class of models that are never intended to be directly confronted with data.... The purpose of such models is to gain insights into possible causal interconnections between various factors and, in general, extend our intuition...

    Quantitative tests of theory ...A qualitative prediction allows one to test the theory that generated it, but it does not provide a very strong test. Because there are only a few possible outcomes in a qualitative situation (e.g., factor X will either increase, stay the same, or decrease), the probability that the ``correct'' outcome will happen by chance is correspondingly high. A quantitative prediction, on the other hand, can be a much stronger test of the theory, because it will not only say that X will increase, but how much...

    Interpreting the data Sometimes an investigator is motivated not by a desire to test general theory, but by the necessity of measuring some specific quantity [that would be impossible to measure directly]...

    Forecasting and prediction ...Forecasting is weaker than prediction, and uses the knowledge of the past behavior of the system to forecast its future state. Forecasting does not necessarily require an in-depth understanding of the system's dynamics, and can be done at the phenomenological level. However, forecasting will most likely fail if the system's dynamics change. I use prediction in its strongest sense: that is, to predict a situation that was not encountered in the past. For example, it may be necessary to generate predictions about how a system's behavior will change as a result of a certain human intervention. Prediction, in general, requires a mechanistic understanding of the system....

    I especially appreciate the point about quantitative tests --- one that has eluded many paleontologists who are content with categorical statements that are essentially untestable, because they only assert that something should happen ``regularly'' or ``more often'' than something else.

    Also, the final point, about forecasting and prediction, is valuable -- although perhaps idiosyncratic, as I have not seen that distinction made elsewhere. Still, it applies far beyond theoretical biology and into historical science generally. If we consider our state of knowledge about climate change in response to human activity, clearly this is an example where the distinction between forecasting and prediction is relevant. We can have confidence in a prediction only if it entails a suitable understanding of the mechanisms of change in the system, whereas forecasting is accurate only to the extent that we can depend on a uniformitarian assumption -- that the conditions observed in the past followed the same mechanistic relations that will be relevant to the future.

    I tend to lecture about genetic models, for which there is a great value in simplicity (point 3), but which may require quite complicated extensions to handle reasonable biological populations (point 2). In that connection, some reasonable people go to extremes of interpretation -- sometimes claiming that the data necessitate some assumption on the basis of a very simplified model, and in other cases claiming that no model can apply to the complex history of the population. It is our task (my task) to determine which factors are important and conceivably affect results, and which will always be too weak to influence the interpretation of the data (point 1). And the end will often be to discover evidence for values in past human populations for which we have no direct means of estimating aside from genetic variation (point 5).

    References:

    Turchin P. 1998. Quantitative analysis of movement. Sinauer Associates, Sunderland MA.

  • Handling exponential growth in demographic models

    Fri, 2008-06-06 10:50 -- John Hawks

    Exponential growth is a feature of current human populations, and was may represent how the human population behaved during some episodes of its demographic history. However, "exponential" can mean different things to different people, if you're not used to thinking mathematically about growth. So I need to lay out some definitions:

    1. Linear population growth: The same number of individuals is added in each successive time interval. Hence, population size is a linear function of time. Think of driving your car at a constant velocity. Or, you deposit your paycheck every month into a bank account, without interest.

    2. Geometric population growth: The same proportion of individuals is added in each discrete time interval -- for example, in each generation. Time is not measured continuously. Consider a bank account, compounded annually.

    3. Instantaneous population growth: At one discrete time, the population is considered to transition immediately, without any time passing, from a small to a large size. Suddenly, a benefactor makes a large deposit in your bank account.

    4. Exponential population growth: The population grows by a constant proportion per unit time, measured continuously. Consider a petri dish with a growing colony of E. coli, or a bank account compounded continuously.

    If you drive your car at a constant speed, then in half the time it would take to reach your destination, you will be halfway there.

    But exponential growth does not work this way. Suppose you have a dollar in the bank now, and you invest at a continuous rate equivalent to 5 percent annually. In 100 years, you expect to have $148. If your account grew linearly, you would have $74 in 50 years. But at your exponential growth rate, you will have only $12. In fact, it will take 86 years for your account to reach halfway to its "destination" of $148.

    Now, what if we approached the question from the opposite direction? Suppose that our account really does grow exponentially, that we really did put in one dollar at the beginning, and we really did end up with $148 after 100 years. But suppose that we also really did have $74 in the account after 50 years. The form of the solution here is obvious: we are dealing with at least two different rates of increase -- one for the early part of the 100-year interval, and a different rate for the later part.

    In fact, there are an infinite number of ways that the rate might change over time to attain this result. Maybe it changed 30 years into the span, or 55 years in. Maybe it changed continuously. Maybe the account shrank at some times and grew at others.

    We can only attempt to deal with these unknowns by taking additional samples. What was the account balance after 20 years? After 21? 22? 73? I'll call these observations "signposts" -- because they give us markers along the path taken by the size of the account.

    You get the idea: this bank problem is very much like our problem reconstructing ancient demography in human populations. When we consider genetic variation, what we observe in today's genes was affected not only by the population sizes at the signposts that we observed in the past, but by every point in between.

    Suppose that our bank account was not merely symbolic money, but that the bank put in actual pennies when the amount increased. It's a simple enough matter to examine all 14,800 pennies at the end of the 100 years. We can ask, how many of those pennies will have mint marks dating 20 years into the span? How many will have mint marks dating 73 years in? The answers to those questions depend on the account balances across the entire 100-year span. That is the kind of question that we address about human history when we observe today's genetic variation. How many people today share haplotypes that originated 5000 years ago? What about 35,000 years ago? 143,000?

    When we make a prediction from evolutionary theory -- for example, the prediction of the age distribution of haplotypes in a population given the assumption of no selection -- then we must assert a model of demographic history. It used to be that you could simply assert a constant population size. But that's no longer any good for human evolution, since our population has obviously grown massively over time.

    If we want our predictions to relate to the real population history, then we ought to use as many signposts as we can find, so that we can constrain our models. For human demographic history, those signposts come from several sources, including the archaeological record, ethnographic comparisons, and increasingly genetic sampling. As I'm going to show, it's really not good enough to just pick numbers out of thin air. The reason is that there are many ways that your model can work against you unless you put in as accurate numbers as you can find.

    How not to handle exponential growth

    A simple exponential model has the benefit of simplicity. But if we don't choose our signposts carefully, a simple model will lead us badly wrong. Here, I'm going to examine the demographic simulations performed by Voight et al. (2006). I'm not picking on this paper in particular -- it actually stands out as a relatively good example of demographic modeling in genetics. This paper has been cited a lot of times, and it is valuable in part because of its detailed analysis of the power of detecting recent selection.

    Some of the power analyses were based on demographic models applied to the data from the Yoruba HapMap sample. Voight et al. (2006) considered only exponential growth models for the Yoruba (as opposed to the Asian and CEU HapMap samples, for which they also considered bottlenecks of various kinds). At the low end, the authors considered a model with no growth at all -- a constant effective population of 11,156 individuals. At the high end, they considered a model in which the population grew exponentially from an ancestral size of 10,018 individuals up to a current size of 1,910,000 individuals, with growth commencing 750 generations in the past. Other models were in between these extremes, although many had earlier onsets of population growth (up to 4000 generations ago). These values are reported in the online correction to the original article.

    At the outset, we can observe that these values are far too low, both for the ancestral and the current populations. The current population size of sub-Saharan Africa is on the order of 650 million individuals. This, of course, disproportionately represents the last few generations of rapid growth. But even in the year 1500, sub-Saharan Africa had a population on the order of 80 million people (Biraben 2003). The effective size of this population would be between 20 and 40 million. Of course, the Yoruba HapMap sample does not represent this population uniformly. The present population of Nigeria is 148 million, the number of Yoruba within this population approximately 30 million. Applying the same growth constant, we might estimate that this population had numbered around 5 million in the year 1500. But as we go back in time, we must encompass a wider cone of ancestry, as genes have flowed into the Yoruba from other populations. Hence, an effective 2 million individuals is certainly too small for the present population by a factor of five to ten, and plausibly too small for the population of 500 years ago by a smaller factor.

    The ancestral size is more seriously in error. Certainly, going back to 500,000 years ago or earlier, the long-term effective population size for humans really was on the order of 10,000 individuals. Since autosomal genes coalesce across that span or longer, we need to employ demographic models that incorporate this small ancestral size. However, we now know that this small size did not characterize any of the Late Pleistocene of Africa (as I discussed last month). Instead, the African population had reached an effective 38,000 individuals by 144,000 years ago, and grew after that time. So the initial size used by Voight et al. (2006) is small by a factor of more than four.

    But what matters much more is the combination of date and size. That's because the entire period matters to genetic variation, not merely the signposts.

    The models applied by Voight et al. (2006) may be fourfold too small at the beginning of the Late Pleistocene. But what does archaeology tell us about the African population in the early LSA, around 20,000 years ago, when Voight et al. (2006) suggest it had just begun to increase in numbers? Biraben (2003) puts the world population over 5 million individuals by that time. Taking this estimate, the sub-Saharan fraction of the global population at that time may have been substantial, more than a million individuals. That would mean that the Voight et al. (2006) estimate is perhaps only a thirtieth of the true value. Still, Atkinson et al. (2008), surveying mtDNA variation, found that the sub-Saharan population was apparently small compared to southern Asia around 20,000 years ago, with a sub-Saharan effective size less than 100,000 individuals. In that view, the Voight estimate is at least a tenth of the most accurate value.

    But what across the span from 10,000 to 5000 years ago -- the time range corresponding to the highest fraction of ascertained selection in their data? At the end of this time range, 5000 years ago, the best demographic estimates place the sub-Saharan African population around 6 million individuals, or perhaps 1.5 to 3 million effective individuals. The largest exponential growth model applied by Voight et al. (2006) predicts a continuous growth rate of 0.00028 per year during the last 750 generations. That would predict an effective size 5000 years ago of only 470,000 individuals -- perhaps a third to a sixth of the real value.

    In other words, the simulations conducted by Voight et al. (2006) have overestimated the power of genetic drift during the last 144,000 years, and most critically in the period around 20,000 to 5000 years ago. The problem is that the signposts are wrong: replace the demographic assumptions with better ones, and you bring them more into line with reality. In this case, the estimate of current effective size was wrong, but not unreasonably so -- it's possibly within factor of two. But the early values are wrong by a factor of ten or more, and the errors compound by the use of the simple exponential growth model. Replacing the more recent interpolated values with real estimates taken from archaeological and ethnographic models would be more complicated, but would actually remove uncertainty in the model.

    What are the effects of these models on the results of the paper? Figure 4 in the corrected paper shows the comparison of the real Yoruba data to the simulated datasets. In all cases, the simulated datasets have less variation in the critical statistic than the real data, which indicates the presence of widespread selection within the real data. If we incorporated a more accurate demographic model, the variation within the simulated data should reduce yet more, because genetic drift should have been much weaker than in the simulations performed by Voight et al. (2006). This would increase the proportion of inferred selection represented by the data. Likewise, the power to detect selection should increase for lower-frequency selected alleles -- because of the smaller chance that a long haplotype would increase by genetic drift alone.

    Next: Bottlenecks

Subscribe to models

Neandertals

For years, I've worked on their bones. Now I'm working on their genes. Read more about the science studying these ancient people.

Denisova

From a finger bone of an ancient human came the record of a completely unexpected population. My lab is working on the science of the Denisova genome.

Acceleration

The advent of agriculture caused natural selection to speed up greatly in humans. We're uncovering some of the ways that populations have rapidly changed during the last 10,000 years.

Malapa

Just outside Johannesburg, the Malapa site is producing some of the most exciting finds in human evolution. This site is the headquarters of the Malapa Soft Tissue Project.