john hawks weblog

paleoanthropology, genetics and evolution

modeling

  • Neandertal ancestry "Iced"

    Wed, 2012-08-15 15:24 -- John Hawks

    I've been mobbed with e-mails from readers asking about my reaction to the new paper by Anders Eriksson and Andrea Manica in PNAS, titled "Effect of ancient population structure on the degree of polymorphism shared between modern human populations and ancient hominins" [1]. The paper asserts that Neandertal similarity in the genomes of living people outside Africa can be explained only in terms of incomplete lineage sorting from the shared human-Neandertal common ancestral population in Africa. If the paper's assertions were accurate, we could go back to thinking that all the genetic heritage of people today traces back to Africa, although we would still need to abandon the idea that the African population had undergone a small bottleneck.

    I have not been posting as frequently the last month or two because I have been out of the country doing science.

    The new paper's press release has given rise to quite a lot of media attention, much of which unfortunately misrepresents our current knowledge of human and Neandertal genomes. Razib Khan summarized the situation on Monday, in a post titled, "Why you shouldn't publish in PNAS". I agree with his criticism, although I have a perspective coming out soon in PNAS. In fact, I suppose this episode shows why everyone should publish in PNAS, because so many journalists will just parrot press releases instead of asking relevant experts. Ewen Callaway did a great job on this story by putting it into the broader context ("Neandertal sex debate highlights benefits of pre-publication"). You will notice how no other science writers with any Neandertal knowledge picked up this press release...

    Paleoanthropology is a field where data are rare and precious, and we do a lot of arguing about the validity of models. I love arguing about the validity of models (Cliff Notes version: All models are wrong).

    Genomics is not such a field. We have abundant data today to compare with Neandertal genomes. Yet puzzlingly, the idea of Neandertal ancestry has been challenged by several papers that haven't performed any new empirical comparisons at all. I'm struggling to figure this out. We have an unparalleled ability to explore the genomes of humans and Neandertals, and we should believe a computer model with no empirical data?

    I've been assessing the Neandertal similarity of 1000 Genomes Project samples here on my blog (e.g., "Which population in the 1000 Genomes Project samples has the most Neandertal similarity?"). This is ongoing research here in my group, but we've been making it open because it tells us immediately that some hypotheses about Neandertal similarity must be wrong. Modeling is a lot of work. We're trying to avoid putting a lot of investment into modeling that will be easily refuted by the next piece of genomic data. Data are flowing now so rapidly that we can afford to be naive empiricists.

    For example, our comparisons quickly refute the hypothesis that Neandertal similarity comes only from ancient population structure in Africa. That hypothesis predicts much more heterogeneity within Africans in Neandertal similarity than exists today. We've shown that the heterogeneity in Africans is basically the same as within Europeans or Asians, and that the variance among African populations so far is quite small. Those are very simple observations, which are consistent with what Yang and colleagues [2] concluded on the basis of the frequency spectrum of Neandertal alleles in large samples of living people. Even though many Neandertal-shared SNP alleles came from incomplete lineage sorting, the signature of excess Neandertal sharing outside Africa must come mostly from recent introgression. In Ewen Callaway's article about this research, David Reich dismissed the new paper by Eriksson and Manica as "obsolete". I agree. The paper describes a model without carrying out any new empirical comparisons, and so has fallen behind where the science has gone.

    Another example is the proportion of Neandertal ancestry. Initially, the proportion of ancestry from Neandertals in living people was argued to be between 1 and 4 percent [3]. That was a model-based estimate that was the best possible under the assumption that Africans have no Neandertal ancestry. We now have a lot more human comparisons, which would make possible a more precise estimate of the mean. I hesitate to provide a new estimate, because we have shown that some Africans have substantial evidence of Neandertal similarity, which throws the baseline for any estimate into question. How much Neandertal ancestry is present in living people must depend on a more complex model of mixture among later populations. The result will still be small (probably less than 6 percent) but understanding this proportion will help us to evaluate when and where Neandertal genes flowed into our populations.

    Here's a third example. I haven't written about here yet, but I have been lecturing about it quite widely over the past few months. Earlier this year, the genome of Ötzi the Tyrolean Iceman was reported by Andreas Keller and colleagues [4]. Aaron Sams and I downloaded the data and have been carrying out several different kinds of comparisons. A picture:

    Otzi 1000 Genomes Neandertal comparison

    I'd like to see the model of African population structure that could explain this result...

    If you'll remember my earlier posts on the 1000 Genomes Project samples, this chart is a histogram of the number of shared Neandertal derived SNP alleles in different samples. The European and Asian samples are substantially greater than either African sample (here, Luhya and Yoruba colored differently). If we took as a baseline that Europeans have an average of 3.5 percent Neandertal, Ötzi would have around 5.5 percent (again, the actual percentage would be highly model-dependent). He has substantially greater sharing with Neandertals than any other recent person we have ever examined.

    You can imagine, we have carried out just about every comparison we can think that could explain this result as anything other than greater Neandertal ancestry. Aaron and I will be putting our manuscript on the arXiv as soon as we've both signed off on all the text and figures, hopefully this week. This is simple stuff, and I see no reason not to be open about it -- anybody with the Ötzi data can immediately do the same thing.

    We think that showing and sharing these comparisons will save people a lot of useless effort. Personally, I can't believe that these people spending effort on population models for Neandertals aren't talking to those of us who have already carried out these comparisons and have already presented them in public. I guess we'll find out if secrecy or openness leads to better science.

    Meanwhile, I can share the abstract of the conference paper I'll be presenting in September at the meeting of the European Society of Human Evolution in Bordeaux:

    Evaluating recent evolution, migration and Neandertal ancestry in the Tyrolean Iceman

    Paleogenetic evidence from Neandertals, the Neolithic and other eras has the potential to transform our knowledge of human population dynamics. Previous work has established the level of contribution of Neandertals to living human populations. Here, I consider data from the Tyrolean Iceman. The genome of this Neolithic-era individual shows a substantially higher degree of Ne- andertal ancestry than living Europeans. This comparison suggests that early Upper Paleolithic Europeans may have mixed with Neandertals to a greater degree than other modern human populations. I also use this genome to evaluate the pattern of selection in post-Neolithic Europeans. In large part, the evidence of selection from living people’s genetic data is confirmed by this specimen, but in some cases selection may be disproved by the Iceman’s genotypes. Neolithic-living human comparisons provide information about migration and diffusion of genes into Europe. I compare these data to the situation within Neandertals, and the transition of Neandertals to Upper Paleolithic populations – three demographic transitions in Europe that generated strong genetic disequi- libria in successive populations.


    References

  • Toba "cut down to size"

    Wed, 2010-12-01 15:29 -- John Hawks

    Thanks to a reader:

    Science last week carried a news article by Naomi Lubick, describing a new model for the climatic effects of the Toba volcanic eruption, around 74,000 years ago.

    The simulation revealed that Toba's impact was not as extreme as some scientists believed. Temperatures dipped only 3˚ to 5˚C across the globe, for example. The model also showed that the high concentrations of sulfur particles were short-lived; they settled out of the stratosphere—where they can have the largest cooling effect—within 2 to 3 years, the team reports online this month in Geophysical Research Letters. Extreme temperature changes in Africa and India lasted only a year or two, with a temperature decrease of at most 10˚C in the first year after the eruption, followed by 5˚C the second year. Overall, Toba didn't wipe out flora and fauna, Timmreck says, but it would have made life harder for a few years.

    The issue comes down to the assumptions they have to make when they scale up the measured effects of recent volcanic eruptions such as Mt. Pinatubo, Philippines. The new model is argued to be consistent with ice core data about atmospheric sulfate concentrations after the eruption.

    I think these climate models continue to shift too much to really interpret the importance for ancient human populations. A global reduction in temperature and biosphere productivity is not going to be happy times for most Pleistocene hunter-gatherers. But the kind of extreme, prolonged population contraction seems like it must require a rather more severe event, seriously forcing global climates out of their

    I've been a very consistent Toba skeptic, because a global catastrophic event in the Late Pleistocene really is not required to explain the present pattern of human genetic diversity. But with a little clever science, it might become possible to look for more temporary effects, or those limited to a few regions of the world. What's necessary is to bring the expectations into the same range of realistic alternatives.

    In that view, a more precise climate model that may show a shorter and smaller range of climate effects may be very useful.

  • More on chimpanzee population structure

    Wed, 2010-05-19 11:36 -- John Hawks

    A reader reminded me of a second paper on chimpanzee population structure, using a different Bayesian framework, which came out shortly after the study by Jody Hey I cited in my previous post ("Return of the Neanderchimps"). This paper, by Daniel Wegmann and Laurent Excoffier (2010) confirms many of Hey's findings but diverges from that paper's conclusions in some important respects. Here's most of the abstract:

    Here, we present a novel attempt at globally inferring the detailed evolution of the Pan genus based on approximate Bayesian computation, an approach preferentially applied to complex models where the likelihood cannot be computed analytically. Based on two microsatellite and DNA sequence data sets and adjusting simulated data for local levels of inbreeding and patterns of missing data, we find support for several new features of chimpanzee evolution as compared with previous studies based on smaller data sets and simpler evolutionary models. We find that the central chimpanzees are certainly the oldest population of all P. troglodytes subspecies and that the other two P. t. subspecies diverged from the central chimpanzees by founder events. We also find an older divergence time (1.6 million years [My]) between common chimpanzee and Bonobos than previous studies (0.9–1.3 My), but this divergence appears to have been very progressive with the maintenance of relatively high levels of gene flow between the ancestral chimpanzee population and the Bonobos. Finally, we could also confirm the existence of strong unidirectional gene flow from the western into the central chimpanzee. These results show that interesting and innovative features of chimpanzee history emerge when considering their whole evolutionary history in a single analysis, rather than relying on simpler models involving several comparisons of pairs of populations.

    The difference in timing of the chimpanzee-bonobo speciation is not very great, considering the difference in mode of speciation inferred. This study prefers an older speciation time with subsequent gene flow; Hey had arrived at a later speciation time (around a million years ago, compared to 1.6 million here). This difference is a semantic one coming from the model of speciation -- if there's meaningful gene flow after a "speciation", detectable in samples of 20 or fewer bonobos, it's hard to say that was really a "speciation". Restricted gene flow in an earlier, geographically structured population seems like an alternative way to describe the same model.

    Possibly, the higher value for speciation time found by Wegman and Excoffier reflects the nonzero migration they infer between bonobos and eastern chimpanzees. It would be interesting to see the speciation time under the constraint of no migration; alternatively it would be interesting to strongly test the hypothesis of no migration itself. A demonstration of interbreeding in the wild between eastern chimpanzees and bonobos would be newsworthy.

    One relatively large difference between the two studies is in the time inferred for the establishment of the East African chimpanzee subspecies, P. t. schweinfurthii. Hey had inferred a time of 93,000 years for this population's founding; Wegman and Excoffier infer a much older origin, 440,000 years ago. In Wegman and Excoffier's analysis, the East African subspecies is almost as old as the west African one.

    It's not obvious why the studies differ so greatly in this conclusion, when other conclusions are broadly equivalent (including the date of the west-central African population divergence, the asymmetrical pattern of gene flow from west into central African populations, and the much greater effective size of the central African population compared to the other two. It's possible that Hey's inclusion of genetic markers excluded by Wegman and Excoffier make the difference -- one or two recent shared markers might greatly increase the apparent likelihood of a recent population divergence. Or the additional parameters explored by Wegman and Excoffier -- they allow the populations to have grown over time -- may have influenced this branch point. That assumption certainly seems to have influenced the effective size of central African chimpanzees, which Wegman and Excoffier infer to have been four times higher than Hey (135,000 versus less than 30,000).

    All these estimates are scaled to generation time and mutation rate, and these differ between the two studies -- Wegman and Excoffier assume that human and chimpanzee genes diverged 7 million years ago; Hey had assumed 6 million. That difference is not great, but reminds us that the present demographic results depend on a particular model of human and chimpanzee differences which is certainly in error to some extent.

    Both sets of conclusions actually share most of their underlying data, although they depend on different assumptions and slightly different datasets. You can see how fast we will be able to make progress on chimpanzee population history given only a little bit more sampling. Using the chimpanzee genome to find a set of markers and genotyping them in 100 wild chimpanzees of each subspecies would provide the chimpanzee equivalent of the HapMap, and would be comparatively inexpensive. Still, I think some progress will depend on a better understanding of the pattern of human and chimpanzee (and gorilla) speciation.

    References:

    Gagneux P, Gonder MK, Goldberg TL, Morin PA. 2001. Gene flow in wild chimpanzee populations: what genetic data tell us about chimpanzee movement over time and space. Phil Trans R Soc Lond B 356:889-897.

    Goldberg TL, Ruvolo M. 1997. Molecular phylogenetics and historical biogeography of east African chimpanzees. Biol J Linn Soc 61:301-324.

    Hey J. 2010. The divergence of chimpanzee species and subspecies as revealed in multipopulation isolation-with-migration analyses. Mol Biol Evol 27:921-933. doi:10.1093/molbev/msp298

    McBrearty S, Jablonski NG. 2005. First fossil chimpanzee. Nature 437:105-108. doi:10.1038/nature04008

    Wegmann D, Excoffier L. 2010. Bayesian inference of the demographic history of chimpanzees. Mol Biol Evol 27:1425-1435. doi:10.1093/molbev/msq028

  • The problems of computer-aided biologists, 1

    Wed, 2010-03-17 18:53 -- John Hawks

    On the subject of modeling in genetics, John Timmer of Ars Technica has been running an excellent series on the challenges of computer models in biology. I'll devote a few words to some of these articles in the next several days.

    An article from earlier this winter, "Keeping computers from ending science's reproducibility," discusses the problems with replicability. Data from genomes and genotyping platforms go through frequent revisions, so that the same methods may lead to different results depending on the version of the dataset. Not replicable, in other words, and it may be very hard to track down exactly why slight differences in results persist. It's also hard to verify that the methods are working the same way when the same results aren't found -- it's not like the problem of significant digits in measurement, in other words.

    That problem is compounded when it comes to analytical methods:

    An analysis pipeline may involve dozens of specialized software tools chained together in series, each with a number of parameters that need to be documented for their output to be reproduced. Like the data, some of these tools are proprietary, and many of them undergo frequent revisions that add new features, change algorithms, and so on. Some of them may be developed in-house, where commenting and version control often take a back seat to simply getting software that works. Finally, even the best commercial software has bugs.

    "Getting it to work" is too often the major goal in human genetics, where in-house development of population history models is the norm. Rigorous validation of these models is beyond any single lab's purview; to be published, it is enough to cite prior art.

    The end of the article includes some reporting on possible solutions, including this:

    Even if we solve the legal and computational portions of the problem, however, we're going to run into issues with the fact that many of the people who use computational tools understand what they do, but don't feel compelled to learn the math behind them. That's where a paper in the latest edition of Science comes in. Its author, Jill Mesirov of the Broad Institute, describes how many biologists aren't well versed in computational analysis, but are increasingly reliant on tools created by those who are; she then goes on to describe one type of solution, called GenePattern, that she and her colleagues put together with the help of Microsoft Research.

    The idea is to "embed" the actual bioinformatic research methods into the paper, as one would embed a spreadsheet into a Word document. That way, anyone who reads the paper could just run an active version of the methods, to verify the results were accurate, and (potentially) play with the parameters.

    Not a bad idea for the toy example, but for simulations that take days or more to run, it isn't going to be practical. What we need is people to learn the math, not people to dumbly click buttons in a paper.

    The specific idea of an interactive workflow is implemented fairly well in the Galaxy bioinformatics platform. There are definite strengths to that approach -- most importantly, for simple operations it can be incredibly useful to have a running record of what you've done, so that you can get it again yourself. But an equivalent record can fairly easily be accomplished using Python, Perl or any other scripting language. A risk of an online system is that it runs into the versioning problem very quickly -- interactive downloads may bring inconsistent datasets that use different genome draft assemblies, for example.

    In any event, much pain can be circumvented with a little math, in many cases. We should make it a priority to get students a common-sense understanding of how genetic parameters relate to each other.

    UPDATE (2010-03-18): Another section of the article is worth discussion. Along the lines of my post from earlier this year regarding the importance of code sharing and transparency ("The bugs will out"), Timmer wrote:

    "You need the code to see what was done," [Victoria Stodden] told Ars. "The myriad computational steps taken to achieve the results are essentially unguessable—parameter settings, function invocation sequences—so the standard for revealing it needs to be raised to that of when the science was, say, lab-based experiment." This sort of openness is also in keeping with the scientific standards for sharing of more traditional materials and results. "It adheres to the scientific norm of transparency but also to the core practice of building on each other's work in scientific research," she said. But the same worries that apply to more traditional data sharing—researchers may have a competitor use that data to publish first—also apply here. In the slides from her talk, she notes that a survey she conducted of computational scientists indicates that many are concerned about attribution and the potential loss of publications in addition to legal issues. (The biggest worry is the effort involved to clean up and document existing code.)

    A lot of the code we use is really rather simple. The coalescent can be implemented in a few lines, and most common alterations of it can be handled with 10-line subroutines. A forward-time simulation can be done in a single line of Python, and again the common alterations don't take too much to implement.

    There are rather radically more complicated models in use, and we should direct more attention to making these human-readable, separating modular elements apart so that they can be run with different simulation engines, and making clear distinctions between functional code, parameters, and data. I've been doing this long enough to know how simple it can be to hard-wire your parameters into the code, undocumented, so that nobody can figure out what is going on but the author. That's not where you want to be.

  • Quote: Peter Turchin on the "bugbear" of randomness

    Sun, 2009-08-30 16:18 -- John Hawks

    I'll probably have some more material on quantitative analysis of dispersal in the few days. Here's a quote from Peter Turchin (1998:17-18):

    Of course, we do not know that animals truly move at random, like flipping coins to decide whether to turn right or left. Each individual could be a perfect automaton, rigidly reacting to environmental cues and its internatl states in accordance with some set of behavioral rules. However, even if this were true, we might still choose to model behavior of such animals stochastically, because we would not have the perfect knowledge of all the deterministic rules driving these animals. Even if we did, we might not want to include them all in our dispersal model, since such a model would have an enormous number of parameters and would require a very accurate representation of all environmental "micro-cues." The point is that randomness is a modeling convention. Because it is impractical, and not even helpful, to attempt to model individual movement deterministically, we use a more parsimonious probabilistic model.

    I'm pausing the quote to point out my boldface. It has become computationally feasible in the last few years to model enormously complicated scenarios with individuals acting pseudo-deterministically. The most popular use of such modeling is to try to constrain dispersal models by some geographic conditions, such as local habitat richness, rainfall, or altitude (see also, "One model, hold the extra parameters"). Of course, animals really do disperse in ways that depend on such geographic parameters. The question is whether any datasets are sufficient to test models involving so many parameters.

    This approach is aptly termed behavioral minimalism (Lima and Zollner 1996). In essence, we adopt a thermodynamic approach: the behavior of individuals is erratic, or irregular, but the redistibution process at the population level has many regular features. There is a direct analogy with with thermodynamic theory. The motion of each gas molecule is chaotic and essentially unpredictable, and can only be described probabilistically. When dealing with large numbers of molecules, however, the laws at the aggregate level are for all intents and purposes deterministic. Similarly, the problem of biological dispersal can be treated by starting with a probabilistic description of individual movements (in other words, formulating the problem as a random walk), and then approximating the redistribution process of the ensemble of individuals with a deterministic equation, diffusion.

    The effective scale of stochastic versus deterministic processes is important. I'm chiefly interested in the dispersal of adaptive genes in human populations, for which the deterministic approximation may be considered to have become more and more relevant over time, as the population sizes of regional populations grew. Still, the present pattern in many cases may reflect the stochasticity of populations from earlier time periods, when they were smaller. And formerly important deterministic processes, such as the adoption of agriculture, may no longer be directly observable. So how do we model variance?

    The thermodynamic approach to dispersal does not have to assume that the movement of each "particle" is completely random. The important feature of this approach is that we can control the degree of realism in the model. Environmental factors that have strong effects on movement can be included explicitly in the model, while other factors that have weak effects (or about which we have no information) are included in the stochastic component.

    This would incorporate the geographic modeling approaches mentioned above -- deterministic processes related to spatial variance of habitat or dispersal potential. But then the important step must be to find a minimal deterministic model to account for the data, and then test it with other observations -- such as more extensive genetic sampling, archaeological information, or historical documentation.

    References:

    Turchin P. 1998. Quantitative Analysis of Movement. Sinauer, Sunderland MA.

  • People particles

    Wed, 2009-07-29 13:24 -- John Hawks

    Last week's Science included an article by Adrian Cho examining the way that social modelers use math to describe human behavior on a large scale ("Ourselves and our interactions: the ultimate physics problem?"). I'm sort of irritated at the way physics shows up in this. I mean, sure if -- for the purposes of a model -- we can treat people as interacting particles, then that shares a mathematical basis with (some kinds of) physics modeling.

    Behind it all lies the assumption that, at least within distinct types, people are like subatomic particles: basically the same. "We like to think that we are unique," says Alessandro Vespignani, a physicist at Indiana University, Bloomington, who works on networks. "But probably for 90% of our social interactions, we are not so unique."

    This isn't a very relevant criticism -- some models may assume that every individual is identical, but they need not do so. If there are well-characterized variations in behavior, a model can incorporate them directly. At some level this is what shopping centers do to predict the behavior of teenagers -- do you put the pink cell phones across from Hollister, or the blue ones?

    In any event, does that mean that every kind of mathematical model should be called "physics"? In practice, it seems to be people trained in physics who carry out this kind of work:

    Forays into "sociophysics" began in the early 1970s. Physicists proposed, for example, that individuals interact to form public opinion much as neighboring atoms make a crystal magnetic by aligning their magnetic fields; researchers analyzed the social phenomenon by adapting the Ising model used to describe such magnetic interactions. In the 1990s, many physicists turned to economics in the controversial subfield of econophysics (see sidebar, p. 408). Now, the movement seems to be gathering momentum, as complex-systems researchers have made solid contributions in the study of traffic, epidemiology, and economics. Some are now tackling more-daunting problems, such as the emergence of social norms.

    "The problems are more complicated than most natural scientists assume, but less hopeless than most social scientists think," says Dirk Helbing, a physicist-turned-sociologist at the Swiss Federal Institute of Technology Zürich (ETHZ).

    Sadly many traditional disciplines are safe harbors for the math-impaired. Disciplinary fence-building happens for understandable reasons -- not least, that "interdisciplinary initiatives" often cover administrative efforts to cut faculty or increase courseloads. The route to useful new mathematical models may be easier through cross-disciplinary institutes of various kinds, but even these are often subject to a kind of tunnel vision -- the founders of institutes have pretty specific ideas of what they value.

    Is there a future in particle models of humans, from an anthropology perspective? There's no doubt in my mind -- several of the high-ranking anthropologists and primatologists I know are deeply interested in network effects, hub/spoke models, and phase transitions. My only hesitation is that the models are being driven mainly by consistency. Models can produce outcomes that look like real social systems, and people who don't dig into the mathematical details can find this consistency very convincing. But consistency is not enough; untested models may be simpler, more realistic, or consistent with broader observations. So we need more people familiar with social systems to dig into the details of these models.

    References:

    Cho A. 2009. Ourselves and our interactions: the ultimate physics problem? Science 325:406-408. doi:10.1126/science.325_406

Subscribe to modeling

Neandertals

For years, I've worked on their bones. Now I'm working on their genes. Read more about the science studying these ancient people.

Denisova

From a finger bone of an ancient human came the record of a completely unexpected population. My lab is working on the science of the Denisova genome.

Acceleration

The advent of agriculture caused natural selection to speed up greatly in humans. We're uncovering some of the ways that populations have rapidly changed during the last 10,000 years.

Malapa

Just outside Johannesburg, the Malapa site is producing some of the most exciting finds in human evolution. This site is the headquarters of the Malapa Soft Tissue Project.