john hawks weblog

paleoanthropology, genetics and evolution

recent selection

  • My review of "Paleofantasy"

    Thu, 2013-03-14 16:22 -- John Hawks

    I have a review of Marlene Zuk's new book, Paleofantasy, in this week's Nature: "Evolutionary biology: Twisting the tale of human evolution" [1].

    I can't replicate my review here, but for people who have access to Nature I thought I'd bring attention to it. And if you don't have access, I wanted to share a couple of my reactions.

    It was a fun book for me to read. Zuk brings a light-hearted skepticism to a broad array of topics in human evolution. She took as her focus a collection of "paleo-advice" ideas: barefoot running, paleo diet, back-to-nature parenting advice. She then added some uncritically-accepted scientific notions about our evolution, such as the idea that agriculture was "the worst invention ever devised". To each of these topics, she brings an array of recent science questioning or disproving the assumptions. The result is not to debunk ideas, but to give a fuller (and more nuanced) perspective on how much we know (and don't know) about our evolution.

    The serious issue underlying all these topics, which Zuk recognizes, is the difficulty of reconstructing Pleistocene environments. Some hypotheses assume a fairly detailed model of ancient environments -- the so-called "environment of evolutionary adaptedness". But ancient humans lived in an array of environments, more different than each other in many ways than different parts of today's globalized world. We are unquestionably living in environments no ancient humans knew, in population size, density, disease, lifespan, and many other ways. But in other ways, our difference from some ancient people is trivial compared to their diversity. Are we well-adapted to live in cities? Perhaps not in some ways, but maybe in others.

    Probably the best part of my review to share is the end:

    As an anthropologist, I observe that Zuk's use of the term 'fantasy' is just an emphatic way of describing the hypothesis-forming that is essential to evolutionary science. We play with hypotheses, explore their predictions and try very hard to falsify them. So it is, in a way, unremarkable that so many hypotheses proposed by anthropologists about ancient environments now seem to be wrong — and, in a few cases, even ridiculous.

    It means that science is working. Genomics, high-resolution climate records, and microscopic and isotopic evidence have changed our understanding of what the past has to offer. With that in mind, let the next round of palaeofantasies begin.

    Zuk's "very brief" overview of human evolution is a lot shorter than in other recent books on the topic. I found this to be a merciful change -- how many times do I really need to read about the Australopithecus-to-humans timeline? Readers who don't already know the basic timeline are unlikely to pick up the book, I would guess. Still, if you're looking for a "latest news" about early humans, this book is not directed that way. Where it excels is its coverage of recent evolutionary changes and the shifts in Holocene environments and genetics.

    The book is not without its weak points. Without quite enough of the "paleo-advice" topics to carry the whole story, there were some real differences in tone across the chapters, with some a bit drier than others.

    People coming to this book for "the right answer" about ancient environments are not going to find it. There is no right answer, at least not a scientific one, for many of the topics covered here. Zuk has done well to talk to a range of scientists, covering these different aspects of our evolutionary history, and discuss the reasons for their disagreement.

    I wish scientists would do that for themselves more often!


    References

    Synopsis: 
    A new book by Marlene Zuk challenges some paleo advice mongers.
  • Tracing teeth troubles with fossil bacteria

    Sun, 2013-02-17 19:36 -- John Hawks

    Ed Yong has a great account today of some research from Alan Cooper's lab on the oral microbiome in pre-agricultural and post-agricultural Europeans: "Prehistoric Plaque and the Gentrification of Europe’s Mouth".

    The hunter-gatherers had a diverse array of bacteria including several groups that are associated with good health. That fits with the relative absence of tooth decay or gum disease among modern or prehistoric hunter-gatherers. “They were at the end of a long period of happy co-evolution between us and oral bacteria,” says Cooper.

    The advent of farming disrupted that tango. After the Agricultural Revolution, as humans began to chow down upon barley, wheat and other domesticated crops, the diversity of the mouth microbes fell, and species associated with oral diseases became more common. “Eating all this soft squishy carbohydrate and leaving it lying around the base of your teeth is effectively inviting in a whole new range of bugs to take up permanent residence in your mouth,” says Cooper.

    I'll have some more comments on this new research when I can sit down to write them up. I've been waiting for this to come out for quite a long time -- I first heard about the research almost three years ago. The potential to characterize oral ecology across time is immense, and we have some excellent data on dental pathologies across the entire timespan. Caries and other dental pathologies are very new in human populations, and although starchy diets have been blamed, very little has been known about how oral bacteria themselves may have become more pathogenic over time. This study is really great because it opens a new door to looking at this evolution across time. We will need to compare this record with the evidence for morphological change in teeth across the same time span. Smaller teeth may have been a consequence of selection associated with dental pathology in agricultural peoples.

    Next we will need to compare across space -- including greater sampling of oral microbiome variation among living humans. This is another new area in which we know more about prehistoric people than we do about living human variation!

  • Selection is for the dogs

    Wed, 2013-01-23 16:17 -- John Hawks

    I was really pleased to see the new paper by Erik Axelsson and colleagues [1] on the pattern of recent selection on domesticated dogs. As we began working on recent selection in humans, we expected that domesticated animals might exhibit similar patterns genome-wide. They are among the organisms most similar to humans in demography and ecological change: Domesticated animals have all undergone rapid shifts in diet, predator ecology and social dynamics after domestication, at the same time that they have experienced rapid increases in population size. That is a recipe for rapid adaptive evolution.

    As in humans, the paper shows that dogs were selected strongly for a new agricultural diet. Just as in humans who descend from early agriculturalists, dogs have extensive duplication of the amylase gene. Humans express amylase in saliva, but as explained in the paper dogs only produce amylase in the pancreas, where it digests starches intestinally. Where this paper gets really exciting is when the authors began to investigate the entire metabolic pathway underlying starch digestion. The amylase gene AMY2B underwent duplications similar to those in humans, and not found in wolves. Two other genes that interact in starch digestion and glucose uptake did not undergo duplication but do show near-fixed haplotypes in dogs that are absent or very rare in wolves, and the paper shows using both biochemistry and phylogenetic comparison with herbivores and omnivores that the dog versions of these genes increase enzymatic activity on starches and glucose uptake.

    In conclusion, we have presented evidence that dog domestication was accompanied by selection at three genes with key roles in starch digestion: AMY2B, MGAM and SGLT1. Our results show that adaptations that allowed the early ancestors of modern dogs to thrive on a diet rich in starch, relative to the carnivorous diet of wolves, constituted a crucial step in early dog domestication. This may suggest that a change of ecological niche could have been the driving force behind the domestication process, and that scavenging in waste dumps near the increasingly common human settlements during the dawn of the agricultural revolution may have constituted this new niche6. In light of previous results describing the timing and location of dog domestication, our findings may suggest that the development of agriculture catalysed the domestication of dogs.

    So for those of you wondering why we feed dogs kibble instead of raw beef, here's the reason.

    After finding candidate regions for selection across the genome, the authors ran a gene ontology analysis to see whether functional gene loci in these regions fall into any consistent categories. Along with the metabolic and digestive genes, they found

    The most conspicuous cluster (11 terms) relates to the term ‘nervous system development’. The eight genes belonging to this category (Supplementary Tables 7 and 8) include MBP, VWC2, SMO, TLX3, CYFIP1 and SH3GL2, of which several affect developmental signalling and synaptic strength and plasticity. We surveyed published literature and identified 11 additional CDR genes with central nervous system function (Supplementary Table 9), adding to a total of 19 CDRs that contain brain genes. These findings support the hypothesis that selection for altered behaviour was important during dog domestication and that mutations affecting developmental genes may underlie these changes7.

    That is a similar story to humans. We don't know what such genes might do, and unraveling what difference these genes may have made to behavior will take a lot of additional understanding of developmental biology. Much easier to work out what is going on when you can examine the biochemistry in vitro as with starch enzymes.

    The paper also makes clear why finding evidence of selection can be a difficult empirical problem at the moment:

    Uniquely placed sequence reads from pooled DNA representing 12 wolves of worldwide distribution and 60 dogs from 14 diverse breeds (Supplementary Table 1) covered 91.6% and 94.6%, respectively, of the 2,385 megabases (Mb) of autosomal sequence in the CanFam 2.0 genome assembly11. The aligned coverage depth was 29.8× for all dog pools combined and 6.2× for the single wolf pool (Supplementary Table 1 and Supplementary Fig. 1). We identified 3,786,655 putative single nucleotide polymorphisms (SNPs) in the combined dog and wolf data, 1,770,909 (46.8%) of which were only segregating in the dog pools, whereas 140,818 (3.7%) were private to wolves (Supplementary Table 2). Similarly we detected 506,148 short indels and 26,619 copy-number variations (CNVs) (Supplementary Files 1 and 2). We were able to experimentally validate 113 out of 114 tested SNPs (Supplementary Table 3 and Supplementary Discussion, section 1).

    If that sounds confusing, that's because it is confusing. Right now whole-genome sequencing is not yet routine, and whole-exome sequencing is not routine for creatures other than people. So maximizing the available data means working with partial genomes at varying levels of coverage, often accumulated for other purposes by other research groups using different sequencing platforms. Verifying sequence differences is not trivial. Generating a sample of gene sequences from many individuals is challenging, particularly as different individuals may be covered or not for different parts of their genomes.

    Studying selection requires a fairly large sample of genomes. This paper establishes evidence of selection on a few things in which domesticated dogs are mostly the same, and all are different from wolves. In other words, these are "complete sweeps" or "near-complete sweeps", in which a new genetic variant has become mostly fixed within the domesticated dog sample. A larger sample of dogs would be able to test selection with a broader range of strength and initial date, including "partial sweeps" and selection on standing variation that may have already existed in ancestral wolves before being subject to selection in domesticated dogs. So this paper opens a new area of inquiry on the causes of domestication without ruling out that we will discover much, much more about the history of selection in dogs.

    One really cool possibility is that we will uncover convergent or parallel patterns of selection in dogs with different geographic origins. Already we know that body size and pigmentation have been subject to selection in different dog breeds, and that single genes transferred across breeds have been important parts of that process. There are a few cases in humans where the extensive geographic dispersal of a single adaptive variant can explain the present distribution of a trait. But in many more cases, different human groups have attained traits by parallel selection on different genetic variants. Because humans control the breeding of dogs and traded dogs across long distances in historic times, we may find that dogs are much less affected by parallelism and much more by long-distance gene flow than humans. But we won't know until we put that hypothesis to the test.


    References

    Synopsis: 
    A paper finds evidence of recent selection on starch digestion in dog domestication.
  • Quote: Lederberg on Haldane

    Sun, 2012-09-30 00:16 -- John Hawks

    J. B. S. Haldane has typically been assigned credit for the first suggestion that human hemoglobinopathies are adaptations to malaria. In 1999, Joshua Lederberg examined the history of this question [1].

    Haldane's most often remembered attribution, to malaria, oddly enough does not appear at all in the formal article but in the discussion footnotes. Therein, Montalenti acknowledges a verbal communication from Haldane suggesting that thalassemia heterozygotes may be more resistant to malaria. In his rejoinder, Haldane goes on to suggest that “microcythemic heterozygotes may be at an advantage on diets deficient in iron or other substances, thus leading to anemia” (HALDANE 1949, p. 76). This has been widely viewed as an anticipation of much later research on heterozygote advantage of blood dyscrasias in relation to malaria.1

    In this regard, the work of A. C. ALLISON (1954) is well known. However, he remarks (private e-mail communication, April 26, 1999):

    At the time of publication of my finding that sickle-cell heterozygotes have some protection against malaria (1954), I was unaware that J. B. S. Haldane had made a similar suggestion for thalassemia. After my publication I was invited to make a presentation at University College, London, and we had a friendly discussion. Haldane said that he had recognized that heterozygotes for the thalassemia gene are likely to have some advantage to counter-balance selection against homozygotes and suggested several possible candidates, among them malaria and better absorption of iron. He added that to speculate about the problem was one thing and to provide experimental evidence for a solution was altogether another. This was the first evidence that natural selection operates in humans.

    Meanwhile, Allison himself [2] cited the earlier work of Beet, who showed in 1946 and 1947 that the blood of East African peoples with the sickle-cell trait carried a lower incidence of malaria parasites than the blood of normal individuals [3][4]. Beet's articles are of interest because they precede Haldane's oblique suggestion about the adaptive value of thalassemia. Still, the relatively weak observation that sickle-cell individuals have a slightly lower incidence of parasites was not a sufficient proof that the sickle-cell trait actually protected its carriers.

    Allison demonstrated the connection between sickle-cell and malaria resistance in two ways. He undertook an epidemiological survey among children, showing a very strong statistical association between non-sicklers and parasites in the blood. Then, he performed an experiment in which 15 sickle-cell trait and 15 normal individuals were injected with the malaria parasites in a controlled way. These two groups were starkly different in their parasite response, with only two of the sickle-cell trait individuals showing any parasites at all, and then at low blood counts; while 14 out of 15 of the normal individuals had parasite infections. His article is notable not only for this clear demonstration, but because of its direct discussion of the other major arguments in favor of the malaria resistance hypothesis, including the close examination of the geographic distribution of the sickle-cell trait in relation to endemic malaria, and the rejection of alternative hypothesis of high mutation rate. This paragraph is exceptionally clear:

    The main problem can be stated briefly: how can the sickle-cell gene be maintained at such a high frequency among so many peoples in spite of the constant elimination of these genes through deaths from the anaemia? Since most sickle-cell anaemia subjects are homozygotes, the failure of each one to reproduce usually means the loss of two sickle-cell genes in every generation. It can be estimated that for the lost genes to be replaced by recurrent mutation so as to leave a balanced state, assuming that the sickle-cell trait -- that is, the heterozygous condition -- is neutral from the point of view of natural selection, it would be necessary to have a mutation rate on the order of 10-1. This is about 3,000 times greater than naturally occurring mutation rates calculated for man and, with rare exceptions, in many other animals.

    Interesting: the mechanism by which the sickle-cell trait deters the parasites is even today not fully understood.


    References

  • When genes break: validating loss-of-function variants

    Fri, 2012-02-17 12:20 -- John Hawks

    Daniel MacArthur and colleagues have an important paper in Science, titled "A Systematic Survey of Loss-of-Function Variants in Human Protein-Coding Genes" [1]. They took 1000 Genomes Project pilot data and systematically looked at every allelic variant in the sample that appeared to cause the loss of function of a protein-coding gene. Mutations that de-activate genes in this way are not rare, but they are often eliminated from the population rapidly by purifying natural selection, because the normal function of a protein is necessary to survival or reproduction. However, not every protein is so important, and MacArthur and colleagues confirmed that more than 1200 alleles in this sample genuinely occur in one or more of the 1000 Genomes Project individuals.

    Some of these are common but most occur in fewer than 2% of individuals in the sample, as expected if purifying selection were affecting many of them.

    MacArthur is one of the authors of the Genomes Unzipped group blog, and has written a great summary and introduction to his research paper: "All genomes are dysfunctional: broken genes in healthy individuals". It's free and well-written, so it will probably work better for many readers than the original paper.

    Science is running a commentary to accompany the research article, by Lluis Quintana-Murci [2]. This paragraph encompasses a lot of the numerical facts about these loss-of-function variations, and discusses the idea that some of them were positively selected -- that is advantageous in recent human populations.

    MacArthur et al. estimated that, depending on ethnic background, each individual's genome carries 26 to 37 variants that introduce a stop codon (which signals the termination of translation of nucleic acids into protein), with up to 6 present in the homozygous state. When considering other types of LoF variants, including those that disrupt splice-sites, large deletions, or insertions or deletions of nucleotides that change the DNA reading frame, the total number per individual is extended to 103 to 121, with ∼20 present in homozygosity. A large proportion of LoF variants were enriched in low-frequency alleles, suggesting that the removal of deleterious alleles has prevented them from increasing to high frequencies. Furthermore, some have already been associated with severe human diseases, supporting the less-is-less hypothesis. Other LoF variants, which can reach higher population frequencies, fall into poorly evolutionarily conserved genes or belong to multigene families displaying high paralogous sequence identity. This suggests that the functions of the corresponding genes are highly redundant, explaining their greater tolerance for LoF variants and supporting a less-is-nothing scenario. Also, although no substantial enrichment in positive selection signals was observed among LoF variants at the genomewide level, 20 of them fell into regions displaying signatures of positive selection, as predicted by the less-is-more hypothesis, suggesting that they may have conferred a selective advantage in human evolution.

    Common loss-of-function variants that are evolutionarily recent are very interesting to us as we work to understand the changes that accompanied modern human origins and the later invention and spread of agriculture. I am really excited that these analyses were carried out using the 1000 Genomes samples because that means we can use the sequence data to estimate the ages of these functional losses. We can do quite a lot better than to say that they "fall into regions displaying signatures of positive selection": In fact, we can determine whether these variants themselves were selected, or hitchhiked to high frequency along with some other variant that was selected.

    Many of loss-of-function variants are in genes that may not matter much to selection. Olfactory receptor genes, for example, comprise a very large family with recurrent duplications and pseudogenizations during primate evolution. We have scores of olfactory receptor pseudogenes, many of which are polymorphic in living human populations. Some may continue to make a noticeable difference to the phenotype, such as the asparagus-urine-smelling polymorphism. But many are probably invisible to us. Still, a few of these do look like they've been positively selected in recent human populations.

    Sometimes less really is more.


    References

    Synopsis: 
    A "punishing" resequencing project validates mutations in the 1000 Genomes Project individuals that deactivate protein-coding genes.
  • Copy number variation in 1000 Genomes

    Sat, 2010-10-30 13:01 -- John Hawks

    When I wrote earlier in the week about the 1000 Genomes Project results, I mentioned that a second paper was being published in Science. That paper, by Peter Sudmant and colleagues [1], works to quantify the amount of copy number variation of genes in the genomes of the study participants.

    It can be challenging to study copy number variation using shotgun sequencing methods, because each duplicated part of the genome creates multiple alignment targets for short reads. One way to deal with this problem is to use the drawbacks of shotgun sequencing as an advantage: Look for template regions of the genome that have much higher read depth than others. These places include many where a gene has been duplicated in the target genome, giving one-and-a-half or twice the number of reads for each duplication. Looking at read depth genome-wide is a quick way to assess copy number variation at sites where it was previously unknown. Once these are ascertained in a sample of genomes, they can be targeted for further study, including characterizing the boundaries of the duplicate region.

    The paper describes this methodology in some detail, with various embellishments to get more precise answers to certain kinds of structural questions. They developed a large set of SNPs that differentiate paralogous gene copies, among other things allowing them to examine which members of various gene families had been duplicated, and whether events were shared between populations.

    Through our analysis, we identified that duplicated regions are more likely to be stratified between human populations when compared with copy number variation within unique regions of the genome. For example, 59 (92%) of the top 64 stratified gene families overlap segmental duplications (P –16). Remarkably, many of these highly polymorphic genes map to duplications that promote recurrent rearrangements associated with intellectual disability, autism, schizophrenia and epilepsy. We hypothesize that the extreme polymorphism may contribute to genomic instability associated with disease and may predispose certain populations to different chromosomal rearrangements (30).

    Segmental duplications can be relatively effective ways to change the amount of gene product without changing the gene product. In other words, a duplication can increase the dosage of a particular gene product. That can sometimes be very useful. For example, salivary amylase production varies among people due to the number of duplicate copies of the gene [2]. The copy number variation is related to population history of agricultural subsistence -- old agricultural populations have more amylase copies. It's a simple case where the dietary ecology favors a dosage increase for an enzyme.

    Gene duplications and other structural changes to the genome are rare events -- any particular kind of change is substantially less likely than a single nucleotide mutation at a given point in the genome. So it is of some interest to consider which regions are actually invariant in copy number -- duplications that occurred on the human lineage but have been conserved in more recent populations -- because these may reflect old adaptations essential to the evolution of hominins. Here's what the paper concludes:

    We have also defined the ~49% of gene duplicates that are largely invariant in copy among humans. Although this is based only on an assessment of 159 genomes from select populations, the fact that this fraction of genes remains copy number invariant in a milieu of recurrent unequal crossover suggests functional importance. Among these, we find a number of genes involved in neurological development and disease. We note that many of these duplicated genes are themselves incomplete and may represent nonprocessed pseudogenes, which may modulate the expression of the ancestral gene. The characterization of the most recently duplicated genes should facilitate identification of those that acquired new functions (neofunctionalization) versus those that have become pseudogenes or have partitioned their function among duplicate copies (31).

    I was going to write that there's not much analysis in the paper and let it go at that. But the paper has a 108-page supplement.

    I know I write this like once a week, but what the heck is the point of a 4-page paper with a 108-page supplement? Granted, 7 of the supplement pages are the author list (!!), but I view the whole thing mainly as a rip-off for the people who did the analyses in the supplement. Why don't they get their own first-authored publications? Are other journals satisfied to accept first-authored versions of analyses that have already been in a supplement in Science?

    The supplement lists 64 gene families including segmental duplications that differ substantially in average copy number among the CEU, YRI and CHB/JPT samples to which the low-coverage whole-genome sequencing has been applied thus far. The table (S8) lists the mean copy number in the three populations and the total variance in copy number; the key statistic is a value called Vst, which is analogous to FST for length variations.

    These are not generally duplications of whole genes, and their boundaries don't generally correspond to the boundaries of coding regions or exons. Without further analysis, it is not clear which of these duplicated regions may have functional import. Many of the additional copies may be inactive, either because of pseudogenization or because the duplication may not include the promoter/enhancer elements needed for gene expression. Some of the duplications occur in regions with known pseudogenes. The "involvement" of some genes in these regions with neurological development and disease is interesting, but the paper attempts no statistical assessment of this. It's a list of candidates, with some interesting ones that are obviously worth further examination, but without a clear story for any of them.

    It is maybe interesting that salivary amylase didn't make the list. It's not clear from the supplement whether that is an omission or whether its population differentiation, great as it is, is not as high as the lower cutoff. The greatest differentiation for amylase copy number is between populations that are not yet represented in the 1000 Genomes whole-genome sequencing.

    That raises an interesting question: What if we applied the same methods to the read data from some of the other public genomes? The Bushman genomes from earlier this year are an especially interesting sample because they are notably not drawn from a long-time agricultural population. In which areas would they score atypical copy number variation compared to the 1000 Genomes samples?


    References

  • Now for anthropological genomics

    Wed, 2010-10-27 15:30 -- John Hawks

    The first of the papers describing results from the 1000 Genomes project has been released today in Nature [1].

    This is "big project" genomics news. Like many announcements of this kind, it represents more of a public relations milestone than actual scientific advance. Some of the project data have been publicly available for a while -- the 1000 Genomes and HapMap projects have to their great benefit been based on the strategy of immediate data release. The new paper and its supplements include many summary statistics and report on new genetic variants that have been found -- there's a lot of information here. But most of the interesting science is just getting started. A paper like this really represents the opening of a race to use the new data for innovative research.

    Here in my lab, we are exploring the ways that whole genome sequencing can change our study of human population history. A large part of this is our work on recent selection, of course ("Why human evolution accelerated"). Whole-genome sequencing is not essential to finding many recently selected regions of the genome, but it will help enormously in narrowing down the actual functional changes that affected fitness in past populations.

    Whole-genome sequencing will rapidly improve our ability to resolve the population history of Pleistocene humans. For older events -- going back to the origins of Homo -- whole-genome sequencing will give us samples of genealogies from across the genome. We will be able to resolve some very ancient episodes of population mixture, and we have a chance of testing what kinds of events accompanied the rise of our genus. Even for events of the Late Pleistocene and Holocene, for which haplotypes of SNP markers can be useful without resequencing, whole-genome sequencing can be tremendously valuable. Reconstructing haplotypes from diploid genotypes requires us to make some assumptions about the demography of the population, which may be exactly what we are trying to discover. A sample of genomes sequenced at high read coverage will free us from some of those assumptions. It's really exciting stuff for an anthropologist.

    All those are reasons why the data will be useful for us in the long term. But at the moment, the data are not nearly so rich. The current paper reports:

    1. Whole-genome sequencing at 42x coverage of six individuals, one three-person family trio from Utah, and one family trio from Nigeria.

    2. Low-coverage (2x-6x) sequencing of 59 Yoruba, 60 Utah residents, 30 Chinese and 30 Japanese individuals. These are a subsample of the original HapMap samples.

    3. Sequencing at 50x coverage of 8140 exons in 697 individuals. These are a subset of the HapMap v.3 population samples, including Yoruba, Luhya, Utah, Tuscan, Japanese and Chinese samples. These exons come from 906 genes targeted "randomly".

    It's pretty far from a thousand genomes, and even farther from the stated goal of 2400 genomes. The low-coverage genomes are not sufficient to call genotypes across most of the genome. This is a persistent problem with "whole-genome" sequencing projects so far. A person's whole genome is mostly diploid -- two copies of most everything. Recently, we've seen several "whole-genome" sequences where each base is given a consensus value. SNP variants may be called against other people's genomes, but rarely is there sufficient coverage to call SNPs within the individual. There are exceptions -- a handful of public whole genomes are at high coverage. The exon sequencing here should be enough to call SNPs in these functional regions with great confidence. The family trios also should have enough to call SNPs. So some of these will be our first chance to do actual population genetics on diploid genome-wide sequence data.

    One important piece of analysis in the paper is the confirmation of a low rate of de novo mutations in the children of the family trios. I discussed a result last spring that came to a very low rate of per-site mutation ("A low human mutation rate may throw everything out of whack"). The rate in that paper was 1.1 x 10-8 per site per generation. The current paper comes to a rate between 1.0 and 1.2 x 10-8. I have some more written on this issue and I'll integrate the new finding and post it later in the week. This aspect of the study is pretty important to our understanding of human evolution.

    The paper makes an interesting distinction between "accessible" and "inaccessible" portions of the genome -- accessibility meaning ease of mapping and aligning sequence reads:

    Accurate identification of genetic variation depends on alignment of the sequence data to the correct genomic location. We restricted most variant calling to the ‘accessible genome’, defined as that portion of the reference sequence that remains after excluding regions with many ambiguously placed reads or unexpectedly high or low numbers of aligned reads (Supplementary Information). This approach balances the need to reduce incorrect alignments and false-positive detection of variants against maximizing the proportion of the genome that can be interrogated.

    For the low-coverage analysis, the accessible genome contains approximately 85% of the reference sequence and 93% of the coding sequences. Over 99% of sites genotyped in the second generation haplotype map (HapMap II)4 are included. Of inaccessible sites, over 97% are annotated as high-copy repeats or segmental duplications. However, only one-quarter of previously discovered repeats and segmental duplications were inaccessible

    It's an interesting decision -- just focus and report on the majority of the genome where alignment is easier.

    The paper discusses selection briefly. There's not much new here other than the identification of candidate causal variants for some selected haplotypes.

    First, it provides a more comprehensive catalogue of fixed differences between populations, of which there are very few: two between CEU and CHB+JPT (including the A111T missense variant in SLC24A5 (ref. 38) contributing to light skin colour), four between CEU and YRI (including the −46 GATA box null mutation upstream of DARC39, the Duffy O allele leading to Plasmodium vivax malaria resistance) and 72 between CHB+JPT and YRI (including 24 around the exocyst complex component gene EXOC6B); see Supplementary Table 7 for a complete list. Second, it provides new candidates for selected variants, genes and pathways. For example, we identified 139 non-synonymous variants showing large allele frequency differences (at least 0.8) between populations (Supplementary Table 8), including at least two genes involved in meiotic recombination—FANCA (ninth most extreme non-synonymous SNP in CEU versus CHB+JPT) and TEX15 (thirteenth most extreme non-synonymous SNP in CEU versus YRI, and twenty-sixth most extreme non-synonymous SNP in CHB+JPT versus YRI). Because we are finding almost all common variants in each population, these lists should contain the vast majority of the near fixed differences among these populations. Finally, it improves the fine mapping of selective sweeps (Supplementary Fig. 14) and analysis of the dynamics of location adaptation. For example, we find that the signal of population differentiation around high Fst genic SNPs drops by half within, on average, less than 0.05 cM (typically 30–50 kb; Fig. 5d). Furthermore, 51% of such variants are polymorphic in both populations. These observations indicate that much local adaptation has occurred by selection acting on existing variation rather than new mutation.

    This last point is not especially demonstrated by the new sequencing data. What we are looking at is few complete sweeps, but that's expected even if all the selected variants were novel mutations -- there just hasn't been time to fix many variants. It remains to be shown the extent to which standing variants are involved in this selection, partial sweeps of new mutations, or parallel adaptations ("Spatial dispersal, parallel adaptation, and the 'Stooge effect'"). We'll probably see a lot more interesting work on recent selection coming out of the new data.

    Science has a companion paper to the Nature data summary, focusing on copy number variation and gene duplications. I will review that one separately.

    UPDATE (2010-10-27): Dienekes pulls out an interesting passage about the Y chromosome sequences, which in at least one case recover many markers separating haplogroups once thought to be much closer to each other. Not sure what to make of that yet.


    References

  • Neolithic milk fog

    Sun, 2010-10-17 14:11 -- John Hawks

    Razib points today to an article in Der Spiegel about the revival of folk migration as an explanation for the Neolithic in Europe. His post ("Völkerwanderung back with a vengeance") is worth reading. The general issues here are very interesting right now because the increase in data has made it possible to propose and test more and more complex scenarios. The simple scenario, gradual demic diffusion, appears wrong in many details. Archaeological cultures appeared and spread in spurts, which we now know were often composed of people genetically very different people.

    The article in Der Speigel is titled, "How Middle Eastern Milk Farmers Conquered Europe".

    The main idea of the article is that our understanding of the spread of Neolithic cultures into Europe has been revolutionized by ancient DNA and more sophisticated chemical analysis of artifacts. That's more or less correct. We really are thinking much more these days about folk migrations bringing new people into Europe. We know that lactase persistence was a recent evolutionary phenomenon in European groups, which was absent before the early Neolithic.

    Problem is: from the standpoint of ancient DNA samples, the lactase persistence mutation was also absent within the early Neolithic! The article is full of details that are wrong or misleading. Most important, it links the appearance and proliferation of the lactase persistence trait with the LBK. This might appear to make sense. The chemical analyses have supported the importance of dairying and presumably milk consumption in the LBK. But the genes of the LBK skeletons don't have the lactase persistence marker.

    The absence of lactase persistence in these early Neolithic people is entirely to be expected. Such an allele couldn't become common until the selection pressure was in place. People had to be drinking milk habitually at key times of vulnerability to establish this selection pressure. Even when the selection pressure is very strong, as it was for lactase persistence, the initial growth of a selected allele is very slow. It did not become common in Europe until thousands of years after it first appeared.

    So lactase persistence did not distinguish early Neolithic people in Europe from agriculturalists in the Near East, because neither of those populations had it at any detectable frequency. All the stuff in the article about how lactase persistence originated in Central Europe? It's irrelevant to whether these ancient populations were connected or not.

    What does distinguish the early Neolithic in central Europe is the mitochondrial DNA. I've discussed this several times in the last few years ("Early European mtDNA: only mysterious if you want it to be", and most recently "French Neolithic discontinuities"). The early Neolithic in Central Europe and France is characterized by several common haplogroups that are absent or rare in both earlier and later Europeans.

    It remains to be seen whether we can document a clear analogue of this mtDNA observation with nuclear genetic data. We know a lot about the variation of present-day Europeans, but most attention to geographic relationships has been run through course filters -- maps of the first two principal components are very striking in their correspondence to geography, but they really don't address the timing of movements that may have contributed to the pattern.

    The differences between early Neolithic and later Europeans suggests that post-Neolithic migrations -- real Völkerwandurung -- actually had a major impact on the European gene pool. What we see today is not a pattern established 6000 years ago, but a palimpsest richly painted with strokes from successive migrations.

    One aspect of this scenario: There's no reason to link the early Neolithic with Indo-European languages. There were many later widespread population movements that might have carried this language family, and we know that these later movements were genetically decisive -- at least, as concerns the maternal genealogy. The relation of Y chromosome haplogroups with mtDNA haplogroups is a critical question, but even more necessary is the development of an effective means of testing these hypotheses with nuclear genotype data.

  • Spatial dispersal, parallel adaptation, and the "Stooge effect"

    Thu, 2010-10-14 00:06 -- John Hawks

    Peter Ralph and Graham Coop have an interesting paper in the current Genetics, titled, "Parallel Adaptation: One or Many Waves of Advance of an Advantageous Allele?" [1]

    Fisher [2] famously considered the case in which an advantageous allele is dispersing through a spatially dispersed population, showing that the dispersal forms a "wave of advance". This work was the foundation for a lot of progress in understanding spatial dynamics of organisms.

    As I discussed in 2008 ("Overstating the obvious"), one of the consequences of the Fisher wave model for human evolution is that advantageous alleles will spread very slowly through the population. During the course of the Holocene, a strongly selected mutation might move only across a radius of a thousand or so kilometers. That provides one explanation for why new advantageous alleles haven't spread very far beyond their points of origin -- they just haven't had time yet.

    Another reason why an allele might not have spread widely is interference from other alleles with similar effects. I mentioned this process last year ("Spatial variation and near-fixed selected alleles"):

    Greg Cochran and I have been discussing this idea for some time. We call it the "Stooge effect". Think of the Three Stooges all trying to run through a door at the same time and getting stuck in the middle. That's what these genes are doing -- all of them are competing to respond to selection, but each is slowed by the presence of the others.

    Ralph and Coop have cleverly combined the "Stooge effect" phenomenon with spatial dispersal. They suppose a case in which two separate advantageous mutations arise in different geographic locations, each affecting the same trait. Each begins to spread independently as a Fisher wave of advance. What happens when they meet?

    As they show, the dynamics in this case give rise to a static equilibrium -- once the "waves of advance" meet, they stop moving, forming a stable boundary. A new favorable mutation makes headway only so long as it has no equally favorable mutation to compete against.

    I like the way they used both analytical approaches and simulations to come to this outcome. The appearance of stable boundaries in a reaction-diffusion system has long been known (demonstrated first by Alan Turing, actually!). But to my knowledge, no one has considered this specific case from an analytical perspective.

    The Fisher equation is not all that simple for most students to work with. If you become familiar with the equation, you will notice the key aspect is that it has two separate components -- a logistic (or reaction) component representing the increase in frequency at a single point in space, and a diffusion component representing the dispersal across space.

    The muscle of the dispersal process comes from the logistic component. Without the intrinsic growth of the selected allele, the dispersal of individuals along the boundary would not carry many copies of the selected allele into new geographic areas. If the local selective advantage dies, the wave of advance rapidly stalls. A static equilibrium arises, with the frequency of the selected allele forming a cline that correlates with the local selection pressure.

    Ralph and Coop's model approximates this case, in a dynamical sense. Each new selected mutation forms an increasing zone in which the selective advantage of other mutations is zero. When those other mutations encounter this zone, they form a stable cline. The cline is stable in the short term, but the diffusion component still disperses copies of an allele; they just lack the muscle to continue their deterministic expansion.

    The most interesting simulations by Ralph and Coop show the two-dimensional case, in which the stable boundaries emerge in a "tesselation" pattern.

    Tesselations

    Figure 6 from Ralph and Coop (2010), showing "tesselations" in 2-d simulations of waves of advance.

    The lower three panes in the figure show the stability of the boundaries between the selected alleles. They proceed to fixation locally, but their dispersal stops where they come into contact with other adaptive alleles. Over the very long term, the population will mix -- the diffusion process will slowly carry all these alleles throughout the species' range. Look at the process after a million generations and the entire zone will be gray. But this dispersal occurs at the neutral rate, where the diffusion term is the only factor driving the dispersal.

    What about humans?

    My graduate student Zach Throckmorton and I have been working in this area for a while now. One of the things that impresses us is the way that much more interesting dynamics can emerge when you alter the assumptions. I learned some of this stuff by talking to Frank Livingstone, who gave a lot of thought to these issues of spatial dispersal and selection as applied to malaria resistance alleles.

    In particular, Frank thought about the case where one allele has a slightly larger advantage than another. In some contexts, this allows the "better" allele to overtake and swamp the expansion of the "weaker" (but nonetheless adaptive) one. In others, the two come to a near standstill, one displacing the other only very gradually. Much depends on the timing of the two mutations and the local conditions controlling their initial dispersal.

    Ralph and Coop briefly consider this case in their paper, noting that the difference in fitness advantage of two alleles will allow one to advance into the range of the other, albeit at a slower rate. In humans, we may be seeing a smaller subset of cases, where one or more of the alleles have not yet established a wavefront. In these cases, the arrival of another wave can disrupt the spatial pattern of the rarer allele. The diploid case gives rise to the possibility of more complex epistases. Well-defined boundaries between selected alleles are rare, and where they occur (as may be the case with HbC and HbS in Africa), many have focused on negative epistasis as an explanation.

    Also, alleles are unlikely to substitute perfectly for each other. In many cases, they may work synergistically -- individuals carrying two selected alleles that affect the same function may outperform those carrying only one such allele. At some point, new selected mutations may start to have diminishing returns, even on a trait like skin pigmentation where dozens of alleles may have been selected in widespread human populations. So the current distribution may to some extent be "frozen", but by a more complicated dynamic than the simple intersection of waves of advance.

    As Coop and colleagues showed last year [3], and we discussed in 2007 [4], there are really only few genes that have approached local fixation in recent human evolution. The current spatial pattern of recently selected alleles doesn't look like a tesselation with many alleles near local fixation. Over most of the Old World, it looks like populations have a very large number of very new alleles, far from fixation, and few up over 70 percent in frequency.

    So the specific scenario in this paper by itself probably does not explain the overall empirical pattern in humans. But if we consider the current pattern as a transient, approximating the early stages of dispersal for many selected alleles, we may not be terribly far off the mark.

    Mutation-limited evolution

    This is a long dense paper and there's a lot in it. One further aspect of the paper that I think is essential is the way that Ralph and Coop reiterate the basic point that more people means more mutations. In their case, they focus on population density over space (population number, when you multiply them) as a constraint on the number of possible adaptive mutations. They apply this idea as a hypothesis to account for parallel adaptations that may have emerged in recent human evolution.

    Multiple mutational origins are likely if the characteristic length is shorter than the physical dimensions of the region. Eurasia measures >8000 km across, and so Table 1 suggests that multiple origins at a single base pair are very unlikely at the lower population density. On the other hand, if the mutational target is large, then multiple origins are likely at low densities, while at high densities independent origins are ubiquitous. The complementary cases of (rho = 2, µ = 10–8) and (rho = 0.002, µ = 10–5) give identical characteristic lengths of 3000 km, although the timescale on which the mutations spread differs. Thus for these two parameter combinations we can expect a few mutations to dominate within continents and for multiple mutations to be common in a population spread across an area the size of Eurasia. Obviously these calculations are very crude, as population densities vary through space and time, and dispersal across continents is not simply a function of geographic distance and individual dispersal. Nevertheless, these calculations suggest that it is plausible that for adaptive traits with reasonable mutational targets (e.g., a change anywhere within a gene or pathway) even low population densities can lead to parallel adaptation across an area the size of Eurasia, and higher densities almost certainly will.

    We note that as human population densities have increased dramatically over time, so too has the probability of parallel adaptation. It is interesting therefore to note that a number of recent human adaptations (e.g., sickle cell alleles) involve repeated changes at very small mutational targets in relatively small geographic areas, while older adaptations from single changes (e.g., skin pigmentation) are more broadly spread.

    They are describing a scenario in which small human populations would have been mutation-limited -- that is, the number of new mutations is small, making it unlikely that adaptive mutations will happen in any given generation. In such populations, the rate of adaptation is limited by the availability of new mutations. In an extreme -- in the very small effective sizes of Pleistocene human populations -- the rate of adaptation may be extremely slow and regional populations may come to differ at many weakly selected loci, which spread very slowly.

    As the population grows, strongly adaptive mutations become more and more likely to happen somewhere in the species' range. Yet they are still relatively rare -- meaning that they have an opportunity to spread fairly far before encountering another equally strongly selected mutation affecting the same trait.

    This process can give rise to very large differences on a continental scale, even when the selection pressures in different regions do not differ. In humans, the dispersal of selected alleles across space may have been significantly accelerated by actual dispersals of populations. It is not a mere coincidence that very widespread alleles in Eurasia also tend to be much older than 20,000 years old -- long-distance dispersals prior to that time had a higher chance of leaving a lasting influence on subsequent populations.

    But as the population gets bigger and bigger, parallel mutations are more and more likely to happen. As Ralph and Coop point out, at the extreme of large population size and likely mutations, you shouldn't see any new mutations emerging and spreading over very large areas. Any of these mutations would be very likely to encounter other new mutations that do the same thing.

    Is this likely in humans? Clearly some mutations have happened recurrently. Making a broken gene is easy -- there's a large mutational target, since a large fraction of nonsynonymous substitutions might do the job. So if there's a net selective advantage to breaking a gene, we ought to see that happen recurrently in human populations.

    In contrast, if the mutational target is very small, then mutations will still be rare even in a very large population. If only one base change can have an adaptive effect, that precise change will happen less than once in 109 births (remember that not just any mutation at a site, but some particular mutation is what we may need). If a rare duplication or gene conversion is the necessary change, then it may be much rarer.

    Looking across the last few million years, when human population numbers were much smaller than the Holocene, we can be pretty sure that some aspects of our evolution were mutation-limited. The changes that took hold in our ancestors were the ones that happened, and that survived the winnowing of genetic drift. Many changes that would have been adaptive didn't happen in our ancestors. They just weren't lucky enough.

    But some of those changes would still be adaptive now, if we could get them. And we have had much larger numbers in the last 10,000 years. Homo erectus needed these mutations, but we only now are seeing them selected in the human population.

    Malaria adaptation

    Hemoglobinopathies are among the cases of easy mutations -- where breaking a gene is adaptive. It's not just any broken version of alpha- or beta-globin that does the job, though. The hemoglobin needs to be impaired in certain ways to impede the parasites while maintaining blood function. This provides many of the classic cases of human adaptation, and Ralph and Coop turn to this system for examples of parallel adaptation:

    The sickle cell allele HbS at the β-globin gene in humans provides a particularly interesting case of putative parallel adaptation. The HbS allele (β6 Glu-Val) has been driven to intermediate frequencies by selection within the past 10,000 years due to increased resistance to malaria of heterozygotes for the allele (HALDANE 1949; ALLISON 1954; CURRAT et al. 2002; KWIATKOWSKI 2005). The HbS allele is present on at least four major distinct haplotypes in Africa, each at intermediate frequency within a different geographic region; the haplotypes are named after the population sample where they were first discovered (Central African Republic, Senegal, Benin, and Cameroon). This is consistent with multiple origins of this single-base-pair change. Note that a distinct, malaria resistance allele, HbC (β6 Glu-Lys), has also arisen in Africa at the same codon as the HbS allele (TRABUCHET et al. 1991; AGARWAL et al. 2000; WOOD et al. 2005a), increasing our confidence that the mutational input was high enough to allow multiple types to arise. However, FLINT et al. (1998) thought the hypothesis of multiple new mutations arising at a single base pair was extremely unlikely and proposed that it was more likely that gene conversion had spread a single mutation across multiple haplotypes.

    The theory we have developed can be used to assess the plausibility of the multiple mutational origins of the sickle cell allele, by exhibiting parameter combinations that yield characteristic lengths consistent with the separation of the sample locations. [Recall that the wave of advance, and thus also our model, works in the case of heterozygote advantage (ARONSON and WEINBERGER 1975).] The different HbS haplotypes co-occur within a few thousand kilometers of each other (see Table 5 of FLINT et al. 1998) (noting that these locations are unlikely to reflect the geographic mutational origins, and mutations will have been spread by large population movements). As the HbS changes occur at a single base pair, the mutation rate would have been 10–8, and we take an s = 0.05 (as in CURRAT et al. 2002). If human dispersal at that time was well approximated by a Gaussian kernel with sigma = 100 km, then a characteristic length of 1000 km would require an effective density of individuals of rho = 25 km–2, while if sigma = 10 km, then we would require only rho = 2.5 km–2. This latter set of parameters does not seem unrealistic, considering our knowledge of population density and dispersal parameters, so our model suggests that the hypothesis of multiple origins is not unreasonable.

    I think they've got the basic idea correct here, but there are some additional details to consider. The distribution of HbE is not quite so easy to understand if parallel mutations are really so likely, and of course there is the negative epistasis of different alleles (and the thalassemias) which impacts their dispersal ability when they become moderately common. The dynamic may be of similar form to the one described here, but boundaries between alleles may be reinforced by the fitness costs of carrying multiple ones.

    This situation raises the issue of path dependence. Some mutations have "first mover" advantages. Once they are common, other adaptive mutations may still occur -- even mutations that are better from the standpoint of fitness -- but be lost or grow very slowly because their net fitness advantage over the common mutant is slight. Where HbE is common, new HbS alleles are unlikely to invade quickly. Where HbS is common, new HbE mutants are similarly unlikely to invade -- even though HbE has a higher fitness.

    Network effects among genes may also dominate the spatial dynamics. HbS spread most widely in the context of populations that were already Duffy null, and in which G6PD deficiency was rapidly increasing. The first conditioned the parasite environment -- P. vivax had a strong disadvantage in Duffy null populations, P. falciparum made up most of the parasite load. G6PD deficiency should have impacted the relative advantage of HbS, more and more as it became more common. Those are two loci among many that alter malaria dynamics in Africa compared to South and Southeast Asia.

    Conclusions

    There is much more to say about this paper -- it's 22 journal pages. But I think I've given an impression of what's there and how the ideas may impact our interpretation of recent human evolution. Many of the central concepts were presaged by earlier work in 2007 and 2008, as reviewed here on the blog. The new analytical and simulation work, I really like.

    Hopefully we can get out some shorter papers that will focus on aspects of these problems as applied to humans. A message that comes across very clearly in our work and this new paper is that different time periods in our evolutionary history must have had very different selection dynamics. Pleistocene humans were not only in a different ecology than us, they experienced a radically lower potential for adaptation.


    References

Pages

Subscribe to recent selection

Neandertals

For years, I've worked on their bones. Now I'm working on their genes. Read more about the science studying these ancient people.

Denisova

From a finger bone of an ancient human came the record of a completely unexpected population. My lab is working on the science of the Denisova genome.

Acceleration

The advent of agriculture caused natural selection to speed up greatly in humans. We're uncovering some of the ways that populations have rapidly changed during the last 10,000 years.

Malapa

Just outside Johannesburg, the Malapa site is producing some of the most exciting finds in human evolution. This site is the headquarters of the Malapa Soft Tissue Project.