john hawks weblog

paleoanthropology, genetics and evolution

Neandertal DNA

  • Did Denisovans have genetic adaptations to high altitude?

    Tue, 2011-06-21 12:26 -- John Hawks

    We don't really know the extent of territory that might have been occupied by the population represented by the Denisova genome. The signs of mixture into the Melanesian/New Guinea population suggests that the Denisova individual shared many genes with people who lived somewhere along the South or Southeast Asian coast. Denisova itself, however, is in the Altai Mountains.

    Last week I wrote some thoughts about the possible introgression of HLA alleles from Denisovans into more recent populations. HLA genes pose many problems for testing this hypothesis -- including the difficulty of identifying the alleles in a low-coverage genome and the high chance of incomplete lineage sorting of ancient alleles in recent populations. Other parts of the genome in principle may be much easier to find evidence of introgression.

    If an allele that originated in Denisovans had some advantage in later populations, it might today be found very widely spread across Asian populations, even if the amount of Denisovan ancestry in most of these populations is very small. This was the theme of my paper with Gregory Cochran several years ago [1] ("The inevitability of introgression"). The probability that a single copy of an advantageous allele will survive and increase in the population is roughly 2s, where s is the fitness advantage in a heterozygote carrying the allele. A relatively small number of copies of an allele might have entered a recent human population by introgression from some ancient population, but these few copies would have a high likelihood of surviving and increasing in frequency, possibly toward fixation. HLA alleles could easily be in this category, but the challenges identifying them and high chance of ILS make the hypothesis hard to test.

    Another strategy is to identify genes that have been selected in recent populations and see if the linked haplotype shows up in the Denisova genome. Recently, several studies have attempted to identify genes related to high altitude adaptation in Tibetans. At least some Denisovans lived in the mountainous areas of central Asia, and so I'm curious whether they might have some alleles adapted to this environment. The Altai are not nearly as high as the Tibetan plateau (in fact Denisova itself is not much higher than western Kansas), and we don't know how long Denisovan people might have been resident in Central Asia, but if we're looking for selected alleles there are some strong candidates in this category of genes.

    So let's look at some of them. All positions here are mapped to the hg18 human genome assembly.

    Yi and colleagues [2] find a strong frequency difference between China and Tibet for a SNP in EPAS1, at chr2:46441523. The derived allele, G, has a frequency of 87% in their Tibetan sample but only 9% in their Chinese sample (and zero in Denmark). The Denisova genome is represented by two reads at this site, both C, the ancestral allele. We don't necessarily have to accept that this is a functional site, but as the marker most strongly differentiating the high altitude population it would likely be closely linked to any functional variant. So the Denisova allele suggests that this ancient individual lacked whatever functional variant might currently be common in Tibetans for this gene.

    Simonson and colleagues [3] took a different approach, focusing on candidate genes that they argued a priori were likely to be involved in adaptation to hypoxia because of their physiological role. They evaluated these genes for evidence of positive selection in Tibetans, finding several candidate haplotypes for recent adaptive evolution to high altitude.

    For each of five genes, they identified a three-locus "core selection haplotype" that shows signs of selection within Tibet. The purpose of these three-SNP haplotypes was to examine the correlation of haplotypes and phenotypes in a sample of people where physiological data were taken. So they are intended as tags, not as comprehensive and unique identifiers of the candidates at the genetic level. But the three-locus haplotypes are the only ones reported in the supplement to the paper, so that's what I have to compare.

    EGLN1: The three-allele candidate selected haplotype consists of A at chr1:229793717, T at chr1:229667980 and T at chr1:229665156. Denisova apparently has the selected haplotype with A at chr1:229793717 (2/2 reads), T at chr1:229667980 (3/3 reads) and T at chr1:229665156 (1/1 reads). However, it is not obvious whether this is significant. All three alleles on the candidate selected haplotype are the ancestral (present in chimpanzees and gorillas) alleles, which are much more likely to show up in the archaic genomes than derived alleles. These ancestral alleles are also present in several of the whole genomes provided along with the Denisova sequence reads. So it's not clear to me how good a candidate for selection the haplotype really is.

    CYP17A1: Here the three-allele candidate selected haplotype includes G at chr10:104568521, G at chr10:104594906, and C at chr10:104517420. Denisova has C (5/5 reads, ancestral), T (4/4 reads, ancestral), and C (3/3 reads, ancestral). Again, Denisova has the all-ancestral haplotype here, but in this case it is not the selection candidate.

    PTEN: The selected candidate haplotype is G at chr10:89770364, C at chr10:89790851 and C at chr10:89778618. Denisova has G (5/5 reads, ancestral), T (2/2 reads, derived), and C (4/4 reads, ancestral). Not selected.

    I always find it interesting when the Denisova genome has a derived allele at an interesting site -- it is the shared derived alleles between these archaic genomes and living people that constitute evidence of genetic persistence of the archaic people. No single site carries that information (any one allele may be shared by incomplete lineage sorting), but I still like to note them. The Papuan and half the Native American, Sardinian and Mongolian reads share the derived T at chr10:89790851 with Denisova.

    HMOX2: The candidate selected haplotype has C at chr16:4456093, T at chr16:4465266, T at chr16:4442515. Denisova has this candidate selected haplotype: C (3/3 reads, ancestral), T (4/4 reads, ancestral), T (5/5 reads, ancestral). That haplotype may also be in the Cambodian whole genome accompanying the Denisova data, and can't be ruled out for the Mongolian. Again, the all-ancestral haplotype and wider distribution argue against the hypothesis that this haplotype was specifically selected in Tibet.

    PPARA: The core candidate selected haplotype has A at chr22:44827140, C at chr22:44832376 and T at chr22:44842095. Denisova has A (8/8 reads, ancestral), A (5/5 reads, ancestral), and C (2/2 reads, ancestral). Notice again, Denisova has the all-ancestral haplotype. As an ancient sequence, we are finding this is the usual case, human-derived alleles are just rarer in this genome.

    OK, where are we? Out of six genes that are candidates for selection on altitude adaptation in Tibetans, the Denisova genome has two -- at ELGN1 and HMOX2. In both cases, the core selected haplotype consists entirely of ancestral alleles, and so I think they are actually poor evidence of introgression on the surface. I would test them by looking at more SNPs linked to the presumed selected haplotype, hoping to find some derived SNPs shared by the Denisovan genome and the presumed selected haplotypes. Unfortunately, publications do not yet routinely report long haplotypes, so it will take some more digging to test these cases.


    References

    Synopsis: 
    Noodling through the Denisova genome data for signs of candidate altitude adaptations.
  • The immune systems of archaic humans

    Fri, 2011-06-17 09:37 -- John Hawks

    I've just submitted an abstract for a conference in the fall, with the title, "Immunogenetics of archaic humans."

    Ten years ago, it would have been beyond imagining that this kind of science would be possible. Now, my graduate student Aaron Sams has been working directly with HLA and other immune system genes in ancient DNA sequences. It's pretty tough to work with the HLA region because of the low coverage of the ancient genomes and the high variation and repetitiveness of the HLA. But it is possible to find some of the basic human alleles in the ancient sequences, and those open the possibility of examining the coevolution of pathogens and human immunity in our recent evolution.

    Turns out we're not alone: According to New Scientist, Peter Parham has also been looking at HLA in archaic humans: "Breeding with Neanderthals helped humans go global".

    One allele, HLA-C*0702, is common in modern Europeans and Asians but never seen in Africans; Parham found it in the Neanderthal genome, suggesting it made its way into H. sapiens of non-African descent through interbreeding. HLA-A*11 had a similar story: it is mostly found in Asians and never in Africans, and Parham found it in the Denisovan genome, again suggesting its source was interbreeding outside of Africa.

    HLA-A*11 is actually the most common allele of HLA-A in Papua New Guinea, the population that otherwise shows significant evidence of ancestry from a Denisova-like genome. However, I don't agree with the main idea of the article. The major human HLA alleles are evolutionarily ancient -- most of them predate the origins of modern human groups and are older than the founding of the Denisova-Neandertal populations. This is actually perhaps the worst region to look for evidence of interbreeding among these populations because the probability of incomplete lineage sorting (maintained by balancing selection) is very high.

    As a case in point, HLA-A*11 is very common in Papua New Guinea, but it is also very common in north India and in China. These two areas otherwise show no significant evidence of Denisova ancestry. We might conclude that the HLA-A gene just has an unusually high level of introgression into Asian populations, not typical of the genome as a whole. That's certainly possible. But without finding any substantial number of derived mutations in the HLA-A*11 variant in the Denisova genome and in living Asians, it is hard to rule out that the sharing of HLA-A*11 in all these populations is just coincidence.

    Of course, if the allele were absent in Africa, that would weigh in favor of the idea it is shared by Late Pleistocene interbreeding outside Africa. But HLA-A*11 is in Africa, just very rare. And it's in Europe. This is the kind of locus that is difficult to interpret: if it has any tiny disadvantage against malaria, for instance, its rarity in Africa is easily explained as a function of recent evolution, while its presence almost everywhere outside Africa would be no surprise even if there were never any interbreeding. This is not a case where the geographic distribution is an unusual coincidence -- it's present in Africa and relatively more common everywhere outside sub-Saharan Africa. So the distribution outside Africa cannot simply be explained by interbreeding with Denisovans -- not without selection -- leaving us stuck. Parham's hypothesis may be correct, but the data are really not sufficient to decide.

    HLA-C*07:02 -- the one apparently mentioned in the story -- is all over sub-Saharan Africa at low frequencies. Allelefrequencies.net has a dozen entries for the frequency of HLA-C*07:02 in sub-Saharan Africa, they all have it at frequencies up to around 7 percent (except for the small (n

    What about the question of hybrid vigor that the article raises? Is it possible that modern humans got HLA mojo from Neandertals and Denisovans?

    While only 6 per cent of the non-African modern human genome comes from other hominins, the share of HLAs acquired during interbreeding is much higher. Half of European HLA-A alleles come from other hominins, says Parham, and that figure rises to 72 per cent for people in China, and over 90 per cent for those in Papua New Guinea.

    I just don't think it's clear that these HLA alleles in humans have actually come from the archaic genomes.

    We've tried to match these at more precise levels (in the HLA system, that would be four- or six-digit haplotypes) and have not found the quality of the data high enough to manage a close match. That leaves us with the most superficial classification, which isn't enough to argue that the present human types are derived from the archaic genomes. Incomplete lineage sorting remains a good explanation for the similarities. In fact, we're thinking it makes a nice case study of just how hard it is to work with these genomes, which have lower than 2x coverage. Just typing the Denisova genome requires an assumption about whether the individual was a homozygote or heterozygote across the locus -- an assumption that we can test easily with higher coverage, but not so much with 1x and many gaps. It also requires greater trust in the mapping quality of the reads than we probably should have. With those caveats, the match to HLA-A*11 is likely but not totally solid. Saying that HLA-A*11 in modern humans came from Denisovans is simply premature. And while I've focused here on HLA-A, this is also true of all the other loci. There's a tipping point at higher coverage where typing becomes more secure, and the archaic data are not there.

    Anyway, I imagine that anyone typing HLA in whole genome data knows all this. The press account isn't going to go into the complexity, and I think it's worth noting the real difficulty of making inferences in this region of the genome on the archaic data. It's a tough problem and I've spoken to many human geneticists who thought we were foolhardy to start. But with the first information about the immune systems of archaic humans as the goal, you can see it's a worthwhile problem to tackle.

    Synopsis: 
    We're gathering the first information on the immune systems of ancient humans. Some challenges await.
  • Finding more Neandertal genes, chromosome 19 edition

    Thu, 2011-03-31 18:46 -- John Hawks

    When I last wrote about the Neandertal genome, I showed that across the X chromosome, Europe and China have different Neandertal genes. There is overlap between the two, but as a generalization few Neandertal haplotypes that are common in Europe are also common in China, and vice-versa. I described the basic method for finding Neandertal haplotypes in recent people last month ("Neandertal segments of X chromosomes").

    Almost all of the Neandertal haplotypes found in the X chromosomes of recent people are relatively rare, occurring in fewer than 10 percent of individuals. The largest fraction of Neandertal haplotypes occur in only a single person in the HapMap samples.

    But is this a pattern that occurs on the autosomes, or does it reflect X chromosome dynamics in some way?

    That's not a hard question to answer, and I went looking first at chromosome 19. The number of haplotypes is fewer, because chromosome 19 is shorter than the X. The overall pattern is the same. Most Neandertal haplotypes are rare in the HapMap samples, and relatively few are common in both the CEU and CHD samples.

    Neandertal haplotypes on chromosome 19 histogram in CEU and CHD HapMap samples

    I put the origin at the rear; CEU (European ancestry in Utah) number of copies goes toward the left, CHD (Chinese immigrants in Denver) toward the right. You can see that most of the cases are clumped on the extreme edge of both axes. There are not higher counts in CHD; the two axes are at different scales because of one extremely common region in Europeans, as noted below.

    I've received a few comments on the 3-d histograms. I don't like them much, either, and I'm looking for an alternative. This one in particular is miserable; because it's out of scale. I'd like to plot these in 2-d using shading to denote bin counts. Unfortunately I haven't found a quick and dirty program that will do this in 2-d, and I've got too wide a range of bin counts for a bubble plot to do it without a lot of tweaking. So I'm stuck with these for now. I can either write about them and share them or spend my time finding a better graphing solution.

    I've done a few more comparisons. When we look for Neandertal 10-SNP haplotypes in CEU versus TSI (the sample from Tuscany), we find mostly the same haplotypes in both samples. A haplotype in 10 copies in CEU is certain to be in TSI, and vice-versa.

    Neandertal haplotypes on chromosome 19 histogram in CEU and TSI HapMap samples

    Number of copies in CEU goes across the bottom, TSI back into the picture. This is such a striking difference from the CEU-CHD comparison. It's very comforting to me, because this is totally the expected pattern -- CEU and TSI should have the same things, because they share most of their population history! I will mention that for the X chromosome, CHB and JPT have a similar pattern, they mostly share the same stuff. This helps lend some significance on the finding below that GIH is also pretty different from all these other samples.

    You can see that there is one locus where CEU has more than 100 copies (the little cluster there indicates that this haplotype extends over more than 10 SNPs, in fact it's 13 SNPs with possibly 2-3 flanking SNPs forming a decay pattern on either side; the total length is around 150 kb. There are more than 80 copies in Tuscans, and more than 40 in Gujaratis, but only a single copy in the Chinese sample. Three genes lie in this interval but none point to any obvious hypothesis (to me, at least), about why the Neandertal haplotype would be especially common in western Eurasia. I note it because this is the first Neandertal haplotype I've found with a frequency up over 20 percent or so; this one is about 60 percent in CEU and 50 percent in TSI.

    The Gujarati (GIH) sample adds its own distinct twist. There is some overlap between GIH and CEU, and some overlap between GIH and CHD. But by and large the same pattern obtains as between Europe and China: India has its own Neandertal common variants, not widely shared with either CEU or CHD. For example, here's the CHD comparison; CHD going toward left, GIH toward right. The basic pattern is that most cases are clusted on the edge of the graph, few are scattered across most of the area, and there's no consistent pattern among them. Still, the highest-frequency GIH case is the same as the high-frequency haplotype noted in CEU and TSI above.

    Neandertal haplotypes on chromosome 19 histogram in CHD and GIH samples

    These examples should demonstrate pretty clearly that this is not solely an X chromosome phenomenon; basically we're looking at the effects of drift in small ancient populations after they mixed with Neandertals.

    I did have an excellent question today after my talk where I discussed this pattern -- how do we know that this isn't separate mixture events giving rise to different Neandertal-derived variants in different recent humans?

    That's not a trivial question to answer, and I don't think we could easily rule out the hypothesis in the abstract. But the fact that these populations have very similar fractions of Neandertal contribution overall does suggest a single history of mixing. I'll give this some more consideration as I look across the rest of the genome.

  • Older and younger Acheulean in India

    Sun, 2011-03-27 00:37 -- John Hawks

    Shanti Pappu and colleagues [1] report on date estimates resulting from new excavations at the old site of Attarampakkam, India. The news element is that they date an Acheulean occurrence to as old as 1.5-1.6 million years ago. At the oldest, these dates would make the Acheulean in India equal in age to the earliest occurrences in Africa.

    The dates themselves depend on the decay of cosmogenic nuclides in the artifacts themselves. This is a kind of exposure dating -- as the artifacts are exposed to cosmic rays at the Earth's surface, they build up radioactive isotopes of beryllium and aluminum (10Be and 26Al), which have half-lifes of 1.39 million and 717,000 years, respectively. When they are buried deep underground, their exposure to cosmic rays stops, and the radioactive isotopes can only decay. Then the ratio of the two isotopes in the sample reflects the time since deep burial. But like other exposure methods, in practice this depends on a model of exposure time, burial speed, and radioactivity within the soil, which lends substantial uncertainty to the dates. The lower 95% confidence interval of each of the date estimates reported in the paper is still over a million years, leading to the minimal conclusion that the site is that age or older.

    Robin Dennell has written an accompanying short essay that gives a broader view of the Acheulian in South Asia [2]. The essay includes a great paragraph summarizing the now-obsolete idea that Acheulean reached India only a half million years ago:

    How does this new evidence affect our understanding of the South Asian Acheulian? Previously, the general consensus was that the Indian Acheulian was less than 0.6 to 0.5 Ma (5) and was thus much younger than that in the Levant (eastern Mediterranean). There, the earliest dates of 1.4 Ma, from ‘Ubeidiya in Israel, probably indicate a dispersal of hominins from Africa (6). A second influx of African immigrants is indicated by the discovery of African types of cleavers and hand axes at Gesher Benot Ya'aqov (GBY), in Israel, dated to 0.78 Ma (7). This evidence implied that the Acheulian dispersed eastward toward South Asia only several hundred millennia after it first appeared in the Levant. It also implied that the spread of Acheulian bifacial technologies into South Asia was broadly contemporaneous with its first appearance in Europe, where the earliest sites date from ∼0.5 to 0.6 Ma (8). Some have attributed this expansion of the Acheulian into South Asia and Europe to Homo heidelbergensis. This Middle Pleistocene type of hominin is known mostly from Europe, where it was first defined, but is also recognized by some (but not all) researchers at African sites such as Bodo, Ethiopia, and Kabwe, Zambia, and even at some sites in China (9).

    The "Homo heidelbergensis" model is in such utter disarray right now, I'm not sure many paleoanthropologists have realized the full extent of the problems. You should know that I don't believe in Homo heidelbergensis, never have. A couple of months ago, I was discussing some of the issues about mutation rate estimation with a very prominent geneticist, and the conversation turned to Homo heidelbergensis. What a shock the Denisova sequence should have been to those itching to see a H. heidelbergensis incursion into Asia!

    Notice however, the intrinsic nuttiness of archaeological interpretation. Oh, we have the first evidence for Acheulean in India around 600,000 years ago? Well, that's around the same age as the Bodo fossil from Ethiopia! What a coincidence! Maybe this new kind of hominin expanded from Africa and carried the Acheulean to India! And Sima de los Huesos is around 600,000 years old, too -- and there's a handax in the pit! My gosh, we need a name for those hominins!

    Well, the nice thing about a hypothesis built on mere coincidence, is that it only takes one observation to falsify it. Million-year-old handaxes in India ought to do it, and how. That's the message of Dennell's essay, and the subtext of the paper by Pappu and colleagues. What I find interesting is the extent to which the fact was hinted by earlier discoveries in South Asia but hampered by weaknesses in stratigraphic control and dating. From Pappu and colleagues:

    Sparse radiometric ages from sites in India have situated the Acheulian within the Middle Pleistocene, with a few dates suggesting an early Middle to Early Pleistocene age. However, these ages often exceed the limits of confidence of the methods used (2). They include an electron spin resonance (ESR) mean age of 1.27 ± 0.17 Ma, assuming linear U uptake, on two herbivore teeth from Isampur (23); an ESR age of ~0.8 Ma (lacking uncertainty envelopes) on calcrete from the Amarpura formation, Rajasthan (24), which has been correlated with the Acheulian site of Singi Talav (4); dates ranging from ~1.4 to 0.67 Ma for the tephra at Bori (Kukdi river) (25); and paleomagnetic measurements with evidence of reversals at the sites of Bori, Morgaon, Gandhigram, Andora, and Nevasa (26). However, the reliability of these ages has, in each case, been questioned on various grounds (5, 27, 28). Likewise, the age and stratigraphic position of artifacts and faunal remains from the Early Pleistocene Dhansi formation along the river Narmada are yet to be firmly established (29). Based on data from controlled excavations and two independent dating methods, our ages from Attirampakkam show that the Acheulian in India is older than previously thought. Evidence from other sites in South Asia should be reconsidered and redated.

    Much evidence already exists in the South Asian Acheulean that could be more accessible. The Acheulean in the region has been a long block of undifferentiated time, despite some very well-resolved sites. In addition to this much older dating for early Acheulean, India also has some of the youngest Acheulean assemblages anywhere -- for example, Haslam and colleagues [3] earlier this month reported on an Acheulean assemblage from around 130,000 years ago in northeastern India. That's long after the large biface tradition begins to give way to Middle Paleolithic and MSA toolkits in Europe and Africa.

    On the topic of Denisova, Haslam and colleagues were writing before that genome was reported. But they did know about the Neandertal genetic results, including the evidence of Neandertal ancestry within India. Nevertheless, they assert a scenario in which the makers of earlier and later Acheulean in South Asia are the same biological population, without substantial gene flow from regions to the west, including the Neandertals.

    Recent reports of the draft Neanderthal genome suggest that Neanderthals and H. sapiens likely did interbreed successfully soon after the latter had left Africa (Green et al., 2010), with the probable location of such contact to the west of India, in the Middle East. The southern limit of the Neanderthal range is unknown (Dennell and Roebroeks, 2005), but we emphasise that the continuity seen in the Middle Pleistocene South Asian technological record suggests that taxa derived from earlier hominin dispersals, and not Neanderthals, were the creators of the Indian Late Acheulean. Greater biological separation between dispersing humans and resident Indian hominins may have precluded viable genetic mixing (although see Liu et al., 2010 for an alternate view from East Asia), while similarities in certain technological strategies may have rendered cultural exchange a somewhat more likely occurrence.

    Well, the Denisovans didn't have to live in India when the ancestors of Melanesians ran across them and intermarried. But Denisova and the Neandertal genomes now make it very likely that the inhabitants of South Asia were one or the other. And even if South Asians were yet a third group, as yet unattested from genomes, it is no longer credible to suppose that they were isolated from Europe or Africa for a million years previous. The tools just don't have that much to do with the populations.


    References

    Synopsis: 
    Long known from India, new papers are adding detail to the temporal extent of the Acheulean.
  • Neandertal taste

    Thu, 2011-03-24 00:13 -- John Hawks

    I didn't comment on this study when it came out in 2009, but as I'm reviewing some materials I thought it worth taking down a note. Carles Lalueza-Fox and colleagues [1] intensively sequenced the TAS2R38 gene in a Neandertal bone flake from El Sidrón cave. This is a gene in humans that enhances bitter taste perception, and is responsible for the classic taster/nontaster polymorphism for the substance PTC. Some 1.5 million years ago, a mutant version of this gene arose with less sensitivity to bitter substances; both the high-sensitivity and low-sensitivity versions exist in human populations today. People who carry two copies of the less active allele are often called "non-tasters", in contrast to "tasters" who have at least one copy of the more active allele.

    Part of the difficulty of studying Neandertal genetic information is the low coverage of the genome reads available to date. Some parts of the sequence are not covered at all, and only a small fraction of sites in the genome are covered multiple times. If we want to study polymorphisms in a single Neandertal individual, we are limited to those areas with high read coverage, and even then we shouldn't put much confidence in them.

    For TAS2R38, Lalueza-Fox and colleagues did much deeper sampling of a single relevant site by PCR amplification. They ended with thousands of reads of the site they targeted:

    A total of 4307 sequences were generated for the TAS2R38 gene F142-R166 fragment (figure 1). Of the total, 2391 (55.51%) showed a C in nucleotide position 145, corresponding to a proline amino acid (taster haplotype), and 1916 showed a G (44.49%), corresponding to an alanine amino acid (non-taster haplotype). Three clones show singleton C to T or G to A substitutions that are the most common form of postmortem DNA damage (Briggs et al. 2007). The main researcher involved in the laboratory analysis (C.L.-F.) is proline homozygous. All the Y-chromosome sequences identified (n = 141) showed the ancestral allele and, thus, no male contamination of European origin could be detected in this amplification.

    That's a pretty good argument in favor of this individual having the polymorphism in question. There is of course a high probability that all these reads actually come from a very small number of template molecules, so it's not as convincing as it might look. But it's a picture of the kind of work involved in confirming polymorphisms from ancient sequence data. We will probably be reasonably confident when we have read coverage up to 15-20x coverage for most loci.

    Then we'll just have to worry about phasing. Maybe El Sidrón, with its related individuals, will turn out to be the perfect site for genomics.


    References

  • Europe and China have different Neandertal genes

    Tue, 2011-03-22 01:00 -- John Hawks

    When last we saw the Vi 33.16 X chromosome, I was wresting out its secrets by looking for SNP haplotypes shared by this Neandertal with the European and African samples from the HapMap ("Neandertal segments of X chromosomes"). Neandertal haplotypes in the CEU (Utah, European ancestry) sample, that are not also found in African samples, are candidate loci for Neandertal ancestry outside Africa.

    In my earlier post, I pointed out some drawbacks and weaknesses of this simple approach. The SNPs have poorer power than sequence data, and we will miss relevant short haplotypes. Some Neandertal-derived alleles are probably present at low frequencies in Africa. Excluding rare African alleles will cause us to miss these cases. What we will find is a filtered set of Neandertal candidate loci, where we don't control the filter.

    Finding these haplotypes lets us look at their frequencies within the European sample. As I pointed out, most of the Neandertal haplotypes in the CEU sample are rare, one or two copies. A handful are quite common, up to 30-40 copies in the sample. A good-sized set occurs in 5-10 copies.

    We know from Green and colleagues' comparisons that at least three people outside of Africa have the same fraction of Neandertal ancestry -- one from France, one from China, and one from Papua New Guinea. But there's no reason to think they have inherited the same segments from Neandertals. The overall proportion of Neandertal ancestry is very slight, less than five percent. If five percent of loci were 100 percent Neandertal, then everyone would have the same Neandertal loci. But that's not the way they are distributed. Different individuals certainly have different Neandertal genes.

    A rare allele in one sample is quite likely not to appear in geographically distant samples. So for many of the Neandertal haplotypes in the CEU sample, we shouldn't expect to see them in China. And, as you can tell from the figure below, that is in fact the case.

    Europe-China Neandertal X chromosome comparison

    What you're looking at is a 3-D histogram of Neandertal candidate haplotypes in China and Europe. The number of copies in the CEU HapMap sample is on the X axis, the number of copies in the CHB HapMap sample on the Z axis, going back into the picture. From the leftmost corner, at the origin, going along the X axis is the set of haplotypes present in CEU but absent in China. As you can see, the most frequent outcome is one copy in either one sample or the other. This being a histogram, those are both lumped into the highest bar at the origin.

    Here's a detail of the area near the origin, turned upward so we're looking at almost an X-Z plot.

    Europe-China Neandertal X chromosome comparison

    As we go down the X axis, you see there are many haplotypes with 3 or 4 copies in CEU and none in CHB. In fact, there are very few that have 3 copies in CEU and any in CHB -- many fewer altogether than occur in 3 copies in CEU and none at all in CHB. The ones that have 10 or so copies in both samples are, well, scarce.

    This is very striking. China and Europe by and large have different Neandertal-derived haplotypes. Haplotypes from Neandertals that are common in Europe -- say, with more than two or three copies -- are mostly rare in China. And vice-versa; haplotypes that are common in CHB are rare in CEU.

    Why should this be? Green and colleagues [1] hypothesized an early population mixture of Africans and Neandertals in West Asia, before that population dispersed throughout the rest of Eurasia. This hypothesis was meant to explain why China and Europe have the same proportion of Neandertal genes.

    I think that is also consistent with the fact that China and Europe have different Neandertal genes. If the population mixture was followed by substantial genetic drift as the West Asian population dispersed in different geographic directions, drift would randomly increase the frequency of some haplotypes in one direction, others in the other direction. Europe and China would end up with the same proportion of Neandertal ancestry, but it would be distributed very differently among loci.

    Next, we'll examine whether this pattern is the same for the rest of the chromosomes. Or maybe something even more interesting...


    References

  • Hawks to lecture at UNC Greensboro March 23, 7pm

    Sun, 2011-03-20 16:30 -- John Hawks

    I'll be appearing this Wednesday night at the University of North Carolina, Greensboro, to talk about Neandertal genetics. The lecture is in the Mead Auditorium, 101 Sullivan, at 7:00 pm.

    UNCG has a news announcement about the event:

    At 7 p.m. March 23, paleoanthropologist John Hawks from the University of Wisconsin-Madison will deliver a talk in the Sullivan Science Building entitled “Neandertime: Deciphering the Secrets of Ancient Genomes” about evidence for genetic evolution among humans during the past 30,000 years. He is well known for his public engagement, which includes his own blog; work with mainstream media such as NPR, Slate and the New York Times; and appearances on “Science Saturday.”

    This year’s lecture series will also feature a panel discussion 3-5 p.m. the same day in Elliott University Center Auditorium that will consider the scientific study of human origins, evolution and variability as well as their everyday applications – that is, how they affect people’s lives in terms of biomedical practice, conceptions of difference and risk, and social identity. Moderated by Cheryl Logan, a professor of history and psychology at UNCG, the panel will include Lee Baker of Duke University, Fatimah Jackson of UNC Chapel Hill and Alondra Nelson of Columbia University.

    I've got to tell you, the talk I'm giving about Neandertal genetics is the very best I've ever prepared. I don't say this kind of thing lightly, but if you're in the area and care about Neandertals, this is as good as it gets. We are discovering new stuff every day, the pace of discovery right now is running way ahead of the pace of publication.

    And this looks like it will be an engaging event. The lecture series has its own dedicated blog where students and faculty have been exploring topics related to genetics and human evolution. The home page for the Harriet Elliott lectures has more details.

  • Population structure within Africa: has "modern human origins" become a non sequitur?

    Tue, 2011-03-15 16:33 -- John Hawks

    When I wrote about the Denisova genome late last year, I claimed that "A large-scale reorganization of the science of human origins is upon us."

    I'm glad I had the sense to write that. A lot of people have pointed back to that quote over the last few months. Still, I know that the full implications of the Denisova and Neandertal genomes haven't really sunk in. "Large-scale reorganization" takes time.

    A new paper by Brenna Henn and colleagues in PNAS [1] shows how the shifting landscape has caught many geneticists off their footing. Submitted before the Denisova genome, but long after the Neandertal, the paper is titled, "Hunter-gatherer genomic diversity suggests a southern African origin for modern humans". In today's landscape, with only one instance of the word "Neanderthal" in the paper, the conclusions are obviously incomplete.

    The "southern African origins" conclusion of the paper comes out of a simple analysis that assumes that the best-fit maximum for genetic diversity (as assessed by linkage) is the most likely point of origin of the population. That would be true if the African population emerged by a series of founder effects from a single small ancestral population -- the "serial founder effect" model that I have criticized here before. But of course in 2011, we know that model is false, because it is predicated on a lack of ancient mixture with Neandertals or other populations. If the serial founder model can't work outside Africa, it certainly can't work inside Africa, where populations were larger and regionally diversified during by the beginning of the Late Pleistocene. Without that false assumption, the "southern African origin" evaporates. The primary observation, a cline of linkage disequilibrium within sub-Saharan Africa, can be explained with reference to mixture of populations without assuming an origin and expansion from one geographic location.

    I don't want to criticize overmuch. Many ongoing research projects are casualties of our new knowledge of ancient genomics, and we'll see more papers like this before the fallout has settled. Simplistic founder models, acceptable only a year ago when these projects were conceived, are now unquestionably false. Ancient population mixture is the order of the day, and we don't have any simple, plug-in-the-data models to apply to data like these.

    Instead, I want to consider the power of the data in this article to answer some fundamental questions about African population history. Henn and colleagues report on SNP genotyping of several Bushman groups from southern Africa and Sandawe and Hadza people from eastern Africa. These data are on the 550k SNP platform that was used by 23andMe before the recent increase to 1M SNPs. That means the data are comparable to many other studies. They are not entirely comparable with other samples of African genetic variation, and the authors cut the total number of SNPs down to the 55,000 that overlap among all the genotyping platforms used in their analysis. For this reason, the paper presents a genome-wide set of 55,000 SNPs across many African populations.

    It's far from the perfect sample. I expect we'll be able to do much more with the full 550k dataset from the hunter-gatherer populations. The data have been made publicly available for download, and here we're already starting to investigate them.

    Within the current paper there is a very useful analysis of the broader dataset using the ADMIXTURE software. ADMIXTURE assumes that the current samples represent a mixture of ancient populations that were more distinct than today's. I went through this algorithm with my students in class Wednesday and Friday, which I'm sure was an intimidating process to most of them. The math is not too conceptually daunting; it's just hard to conceptualize how all the possible interactions relate to gene frequencies when you are assuming more than a few putative ancestral populations. Razib Khan gives an impressive step-by-step guide to performing an ADMIXTURE analysis, including some of these samples.

    I'm not in love with this analytical method -- there's no reality check on its assumptions. But its output can be informative about many aspects of population structure. Here are some first approximations:

    1. The genetic diversification of African populations was once much greater than today. Razib Khan points out the homogenizing effect that agricultural populations have had on the African continent, particularly during and after the Bantu expansion. I think the current data suggest that earlier processes involving LSA hunter-gatherers also tended to homogenize populations.

    For example, when eight initial clusters are assumed, the ADMIXTURE analysis constructs them in a way that most of the ancestors of today's Bushmen were in a population with a high degree of genetic divergence from the other seven ancestral populations. The FST between the Bushman ancestral population and others ranges from 0.1 (for forest pygmies) to a high of 0.25 (from Europeans). That estimate is nearly double the equivalent statistic in today's populations.

    Again, we don't have to believe the assumptions underlying the ADMIXTURE algorithm, but it does highlight the basic partitioning of diversity in the African population. Today there is high diversity within African population samples, and some of that diversity can be traced back to populations of 100,000 years ago or more. Some of the diversity that once existed among these populations has now been spread within them instead. The populations got genetically closer over time.

    A model of successive population expansions, bringing ancient populations genetically closer and closer together, is also what we may see in other places. As we have learned more about the mtDNA of ancient Europeans, it has become clear that successive expansions and migrations of people into Europe have radically reshaped the gene pool.

    2. Click languages have no genealogical unity. Over the years, many linguists and anthropologists have proposed that Hadza, Sandawe, and Bushmen are closely related to each other, despite their geographic distance, because they all speak languages that use click sounds. No historical linguist has ever successfully demonstrated a system of sound changes or detailed correspondences among these languages, but people promoting the hypothesis seem immune to these kinds of facts.

    The genetics show a very clear and ancient differentiation of these hunter-gatherer peoples. In the ADMIXTURE analysis, some of the largest genetic distances are among these peoples. By itself, that may not be surprising; these are the populations that have most evaded the homogenization that followed the spread of farming. The Hadza themselves are strikingly distinctive, and their genetics may reflect a history of small population size during the last several hundred years. The potential for genetic drift in this population was very high. Still, the genetic relations are just the opposite that would be expected if speakers of these click languages had shared a common origin.

    Seems to me that this could have been the lede of the paper, if it had been written differently. A bit more exploration of the hunter-gatherer data (probably incorporating some haplotype-level analysis to give a better estimate of the ages of events) would demonstrate this point very well.

    3. By the time we find "modern" humans in West Asia, the African population had long since diversified into regional populations. This is not news; the mtDNA evidence has suggested for several years that southern Africa and the remainder of sub-Saharan Africa were already regionally differentiated before 120,000 years ago. There have also been hints of this diversification from whole-genome evidence (including the supplement of the Neandertal genome paper last year). Here we have a clear indication that the regionality extends to every African hunter-gatherer population.

    4. Hunter-gatherers have relatively little evidence for recent positive selection. The supplementary data of the current paper includes a short discussion of selection and a list of candidate loci in the hunter-gatherer samples. There is relatively little overlap in candidate regions for selection among these samples. Different genes have been selected in different populations, and not all that many of them. This is not surprising if the selection is relatively new -- the last 20,000 years or maybe more, given the distances and amount of historical population structure estimated for the data. It's also consistent with the demography of these populations. It will be interesting to check, but I would speculate that the signature of selection will on average appear older in these samples than in populations that have historically been agriculturalists.

    5. Where's the Aterian? North Africa is relatively depauperate in variation in the large combined dataset. That may stem mostly from Holocene events, including the spread of West Asian populations across North Africa. But the low variation there doesn't readily fit the idea that an out-of-Africa dispersal of genes came from a North African source. I don't think the observations in the paper (centered around linkage disequilibrium with a very low SNP count) are enough to settle anything about this question, but I'd be nervous if I were busy trying to make the Aterian seem important to the modern human origins issue.

    Bottom line

    As interesting as these assertions look, I don't think that a lot of African prehistory is about to be rewritten. Obviously, geneticists need to get serious about reading some African archaeology. We already know that African regional populations were large and diverse during the Middle Stone Age, and that's a very good fit to the kind of genetic diversity we are seeing in these samples.

    The barrier is Holocene population history. Agricultural populations grew, spread, mixed with and absorbed hunter-gatherers, and what we left are the shattered remnants of ancient African population structure. Linkage may be the most powerful way we have to consider historical hypotheses using these SNP data, but if we're going to rely on it we have to control for recent demography and selection.

    And of course, it will be interesting to see a model that can integrate both Neandertal-African and within-African population histories. I don't really have a bang-up finish for this post, because there is immediately more work to be done with these data.


    References

  • The real "junk" DNA

    Wed, 2011-03-09 22:47 -- John Hawks

    Let me be honest: when I started doing paleoanthropology, I really did not expect I'd be talking about Neandertal penises.

    And yet, here I am. Cory McLean and colleagues [1] combine a straightforward genomic analysis of human-specific deletions with a couple of transgenic mice, and take us straight to penis spines.

    You see, most primates, and indeed many mammals, have at least some spines on their penises. "Spine" means more or less what you would expect: little projections that are covered in hard material, generally keratin, curving toward the base of the penis. These spines are sometimes called "horny papillae."

    No, I cannot make this stuff up.

    The morphology of these spines varies among primates. They overlie sensory receptors, and they intensify or enhance sensations accompanying intromission of the penis. Like a KY commercial, except they don't enhance sensations for the female. The net effect in some species is to reduce how long it takes the male to ejaculate. For example, a 1991 paper [2] by A. F. Dixson...

    No, I cannot make this stuff up.

    ...removed the penile spines of several male marmosets, finding that they took twice as long to achieve penile intromission after starting pelvic thrusts. Of course, "twice as long" in marmosets only means 15 seconds. The spineless males took 2 seconds to ejaculate, compared to only 1.73 seconds for those who had a "sham surgery" -- that is, they got the same depilatory spine-removal procedure without the active ingredient. That's some evidence in favor of the idea that losing penile spines might be related to longer coital duration.

    But penile spines don't always mean fast sex. Galagos have penises covered in long hook-like spines, which they use in virtual sex marathon sessions lasting two hours or more. Prosimians tend to have much more elaborated spines, in contrast chimpanzees' spicules are comparatively minor -- in a broad comparison across primates, Harcourt and Gardiner [3] rated chimpanzees along with humans as having insignificant penile spinosity.

    Let me just say that the comparative data don't convince me of an adaptive model for loss of penile spines in humans. Evidence from mutilated monkeys is not all that persuasive. I mean, really, how fast do you think you would manage after the "operation"? More important, the differences among hominoids run against the hypothesis -- gibbons have the spiniest penises among the apes, despite their monogamous, pair-bonded social habits.

    And I'll pause to savor the surreality: I'm here making value judgments about genital cacti.

    One thing that is definitely well-known about these penile spines is that their development depends on testosterone. Castrated monkeys do not develop the characteristic spines, and they lose them if already present. The androgen receptor (AR) locus is surrounded by promoter/enhancer sequences that are tissue-specific, capable of being flipped on or off as development proceeds within different parts of the body.

    Within this system, the genetics in humans and chimpanzees are simple: A long (60 kilobase) deletion of DNA in the human lineage has knocked out a 5 kb conserved region that enhances AR. That enhancer is specific to the follicles around the developing facial whiskers (vibrissae) and in the skin layers of the penis. This specificity was discovered in transgenic mice, in which a reporter gene is inserted with the enhancer, and embryos display expression of the reporter wherever the enhancer is active. Very straightforward, very cool science.

    One more thing: The chimpanzee version can drive expression when implanted into transgenic human foreskin fibroblasts. That indicates that the overall genetic system to make penile spines is still there lurking in our genomes. If we could turn on the gene at the right time, replacing the function of the enhancer, we can still grow penile spines.

    Just saying -- there may be a market there. Maybe the "male enhancement" companies will hit that next. I can only imagine what the wrapper on the NASCAR circuit will look like. OK, I know, don't encourage them. It's bad enough that we have labs full of foreskin tissue with chimpanzee genes floating around.

    I couldn't make this stuff up if I tried.

    Finding the deletion was straightforward genomics: They scraped the human genome for parts missing from chimpanzees and macaques, and then extracted from that set all deletions that included sequence conserved in other mammals. Others have done similar comparisons for conservation and human-specific changes; this is a clever twist on the same problem. It does fit an ongoing theme -- many essential aspects of humans may involve the loss of genes or functionality from our ape ancestors.

    Ok, so where do Neandertals fit in? They have the sequence deletion just like the rest of us do. If that deletion rules out chimpanzee-like spiky penises, then Neandertals could glide like the rest of us.

    All in all, it's a nice short paper, and very straightforward. The only questionable part to me is the social model. The genetics and expression data are solid.

    Speaking of Neandertals and the androgen receptor (AR) locus, my genome appears to have a Neandertal-derived haplotype across that gene. I'll expose this fact at greater length later, but I thought it worth sharing that the current paper is not the end of the story. Neandertals may not have had penis spines, but some functional polymorphisms in testosterone response might still have come into our population from them or other ancient people.

    UPDATE (2011-03-11): Eric Michael Johnson gives us the real dirt on this story ("Penis spines, pearly papules and Pope Benedict's balls"). He points out the relatively small extent of these features of the chimpanzee penis compared to other primates, and adds detail about the lack of association between their presence and sexual system in hominoids.

    He also reveals a shocking fact: a fairly large fraction of men still have the chimpanzee-like pearly papules.

    Scicurious also takes on the topic "Friday Weird Science: Penis Spines, what are they REALLY?", reviewing the original Osman Hill study of chimpanzee penis morphology. I think the Nature paper is very misleading in its use of galago illustrations for these spines, the chimpanzee version is comparatively minor.


    References

  • Neandertal segments of X chromosomes

    Wed, 2011-02-23 16:06 -- John Hawks

    Last year, this Neandertal genome came out. No doubt you've heard about it. So maybe by now you're wondering where the new science is that's being done on this genetic information.

    We've been ramping up here in my lab for a few months, working with these data. My students have a couple of projects that we'll be keeping close to the vest. But for the most part I think we'll share stuff as we go along. This is all open access data, and there are some questions of fundamental interest that are actually pretty easy to resolve.

    The initial Neandertal genome draft publication [1] came with some analysis of the genome-wide similarity of the Neandertal draft genome and a few human genomes. A new review of the basic method of comparison has appeared in Molecular Biology and Evolution, by Eric Durand and colleagues [2]. The basic idea is that a branching model between populations without gene flow predicts that two members of one population have equal amounts of sequence similarity to a third individual in another population. If that third individual turns out to be closer to one or the other of the first two, we can reject the hypothesis that those first two are part of a population that has branched without gene flow away from the third individual's population. When we bring an African and a European as the first two individuals, and a Neandertal as the third, we find that the European is in fact closer to the Neandertal. So we can infer gene flow from Neandertals into the ancestors of Europeans. This comparison is nearly equally significant when we compare an African and a Chinese individual, or an African and an individual from Papua New Guinea. Thus we can infer that Neandertals contributed genes to the ancestors generally of present-day non-Africans, not specifically present-day Europeans. The amount of gene flow that can explain the pattern of genetic similarities adds up to around 2.5% of the total ancestry of non-Africans today. Again, it's not a direct observation; it's a model that explains the greater similarity of the Neandertal genome to people outside Africa than within sub-Saharan Africa.

    As you can see, this leaves open a key question. We don't know whether genetic similarities between Neandertals and present non-Africans are the same in different areas outside Africa.

    The whole-genome comparisons have great statistical power to test the hypothesis of gene flow in general. With a hundred thousand or so actual sequence differences between Neandertals and any given human genome, the method can potentially detect very small amounts of gene flow. What we're seeing in the Neandertal data is anything but small -- it amounts to greater non-African similarity to Neandertals at thousands and thousands of sites.

    But comparison of three whole genomes gives us very little power to identify the specific loci affected by gene flow. If a French genome has three percent ancestry from Neandertals, we can predict that other genomes in France probably do also. That's a consequence of independent assortment -- we're not looking at people who actually have Neandertal grandparents, we're looking at a population that had Neandertal ancestors thousands of generations ago. So all French genomes are probably more-or-less alike in the Neandertal quotient. But will they have the same three percent of Neandertal-derived alleles? Almost certainly not: each Neandertal-derived locus would have to be fixed in France for them to be identical in all genomes. Much more likely, a much larger number of Neandertal-derived alleles exist at an average frequency of three percent. Such a distribution would predict that the average Neandertal-derived variant found in our first French genome has only a 3 percent probability of showing up in a second genome. Looking at one genome in one population will find only a small fraction of loci that have been affected by Neandertal gene flow.

    Hence, if we want to answer the question about different populations, we need to look at a reasonably large sample of individuals. We need to know whether a Neandertal-derived variant in France occurs at the same frequency in China, and vice-versa. Are there loci where a Neandertal allele occurs at 10 percent in France, but never in China? Does a full list of loci with one or more Neandertal-derived variants include any interesting functional genes? Answering these questions would tell us a lot about the demographic and adaptive conditions that led to our Neandertal heritage.

    Enter the HapMap

    You'd think that a genome-wide set of SNP genotypes would be useful for testing hypotheses of population history. The HapMap has more than 3 million SNP genotypes from hundreds of individuals from China, Japan, Utah, and Nigeria, and more than a million genotypes from nearly a thousand other individuals from other populations. In other words, it's the kind of sample that could tell us a lot about the frequencies of Neandertal-derived alleles if we could find them.

    But the HapMap project didn't identify its set of genotypes to help us reconstruct population history. The aim was to find most common variants, and secondarily to add more variants in low-variation regions to allow linkage mapping of medically interesting phenotypes. SNP sites were disproportionately found in some populations (first, Europeans) more than others. These processes of SNP discovery led to ascertainment biases, in which the difference between samples depends not only on their histories, but also on where we chose to look.

    Ascertainment bias is a real pain if we want to test the hypothesis of Neandertal genetic contribution to today's humans. Look at it this way: Suppose we find a rare SNP allele in Europeans, absent in Africans, but present in the Neandertal genome. Looks like a piece of support for Neandertal ancestry of Europeans. If those sites outnumber the sites where we find a rare allele in Africans shared with Neandertals, not in Europeans, then that would seem like the same scenario outlined above -- a case where one of the living populations carries more Neandertal similarities than the other. Evidence of gene flow, right?

    Ascertainment bias leaves another possibility: Maybe we looked harder for rare variants in one of the living populations. If so, the lack of rare Neandertal-shared variants in the other population may be an accident of our SNP discovery procedure.

    There are ways around this problem. For instance, if the Neandertal genome carries many derived alleles for SNPs shared with Europeans, it weighs strongly in favor of recent genetic exchanges instead of ancient incomplete lineage sorting. But this basic question of "which population has more Neandertal ancestry" may still be hard to resolve.

    Haplotypes from Neandertals

    Green and colleagues [1] also presented a second approach for testing Neandertal ancestry. They used SNP data to identify regions of the genome where non-African populations appear to have a "deep root" to their genealogy, but Africans do not. These regions are rare across the genome; they focused on 100-kb intervals, finding only a dozen genome-wide that fit their criteria. But each of these is a case where non-Africans appear to have an ancient genealogical split between two haplotypes, all the SNPs lining up to distinguish one branch of the genealogy from another. If both are not represented in Africa, then presumably one of them came from some non-African ancient population. And indeed, they found ten of the deep branches within the Neandertal genome.

    This approach makes use of the information that SNP data provide about linkage. A segment of a chromosome from a living human that is similar to a Neandertal segment may be explained either by recent ancestry from Neandertals or from incomplete lineage sorting from the ancient human-Neandertal common ancestors. But if that segment is long, it probably isn't from the ancient common ancestors of humans and Neandertals, because recombination should have broken up the linkage across that long interval. Hence, long haplotypes shared by living humans and Neandertals are best explained by recent mixture. If those long haplotypes are predominantly found in non-Africans but not Africans, it tends to confirm that they have come from recent population mixture with Neandertals.

    But how long should these intervals be? This is an area where we can improve on the approach taken by Green and colleagues [1]. A hundred kilobases is way too long to represent the average Neandertal-derived haplotype. The average rate of recombination across the genome is around one centimorgan per megabase -- meaning that an interval of one million base pairs has a one percent chance of recombination per generation. That's a chance 1/1000 of recombination per 100 kb per generation, meaning that half the linkage across 100 kb should be broken up in roughly 700 generations. For humans, half the linkage at that distance decays after only 18,000 years or so, except in regions of low recombination. If we go as far back as 100,000 years ago, half of the linkage decays across regions as short as 18 kilobases. That means if we look at windows 20 kb long for evidence of Neandertal-derived haplotypes, we are likely to miss a large fraction of them. Hundred-kilobase intervals will miss nearly all of them.

    Bottom line, we want to look at intervals as short as we can. But if we look too short, we won't have much evidence to work with. The 3-million SNPs in the HapMap version 2 give us one site every kilobase on average. Ten kilobases will give us around 10 SNPs. A 10-SNP haplotype may sound impressive, but if most of those SNPs have a derived allele at low frequency (say, less than 10 percent), then it starts to become more likely that a given haplotype resembles the Neandertal genome just because they share ancestral SNP alleles. Ideally we'd like more SNPs, but in reality the Neandertal sequence draft is likely to lack several, so if we want 10 SNPs worth of comparison, we'll need to look at longer intervals.

    And really, HapMap 2 is a small sample to try to find low-frequency haplotypes from Neandertals. By analogy with the method used by Green and colleagues, we can find haplotypes that are present in the CEU (European ancestry) sample, present in the Neandertal genome draft, but absent in the YRI (West African ancestry) sample. But HapMap 2 includes only 120 genomes from each of the YRI CEU samples. If we have a variant at in Europe at 1 percent, we're pretty likely to miss it. Worse, if we find a haplotype in Europe at 1 percent, we're really not able to reject the hypothesis that it's in Africa at the same frequency, even if no copies of it are in YRI. We can help fix this problem by looking at HapMap phase 3 samples, which include two more African populations, bringing the total sample up to more than 300 within Africa. But there are fewer SNPs in HapMap 3, limiting our comparisons to longer windows. One could even contemplate the HGDP sample as a way to add even more individuals to our comparative samples. But that sample has many fewer SNPs, so we would need really long intervals to test the hypothesis of Neandertal ancestry for particular haplotypes.

    By the end of this I'll surely be pining for sequence data. Of course for that we haven't long to wait. But I have an aim for which genotype data are at the moment the only feasible approach. So I'm a bit stuck: Using a bigger sample means using longer intervals, which means I'll miss more and more Neandertal-derived haplotypes. But we should thereby get reasonable power to find any common haplotypes derived from Neandertals.

    Phasing and the haploid Neandertal

    The HapMap 2 samples and some of the HapMap 3 samples were taken from pairs of parents, where a child was also genotyped. Those trios make it possible to determine which SNP alleles were linked on the parents' chromosomes, providing a natural "phase" for the haplotypes. For some other samples, the phase was inferred algorithmically, using assumptions about population history and knowledge about which haplotypes are present in the populations with trios. Phasing algorithms are not ideal, because the assumptions about population history (inferred in many cases from the data) may be false. But over the relatively short intervals we're considering here, phasing will probably not lead to false positives.

    Neandertal draft genomes are themselves more of a problem. Each sampled individual is known from a large number of short reads, which (with some luck) can be aligned with the human genome map. The present data include many gaps. More important, there are only a very small number of places where the number of reads is high enough to determine whether a Neandertal individual was a homozygote or not. The Neandertal consensus sequences are built by taking the most frequent base from these reads aligned to any given site in the human genome. That means that the Neandertal "haplotype" across any set of SNP loci may well be a jumbled chimera of two different haplotypes carried by the Neandertal individual. For the current analyses, I have kept the Neandertal individuals separate -- so the haplotypes here were derived only from the Vindija 33.16 individual. If we use a consensus sequence taken from multiple individuals, we will have fewer gaps but potentially more jumble of different haplotypes.

    There's not much to be done about this problem. It should mostly cause us to miss true instances of Neandertal genetic ancestry, and we may be able to quantify the extent of this error in some high-coverage areas.

    (UPDATE 2011-02-24): I should mention, my lab has found that the Neandertal consensus sequences themselves have issues; the consensus-building algorithm appears in many cases to have included the human reference genome SNP allele in the place of the allele found in the majority of Neandertal reads. We are not yet sure how extensive this phenomenon is across the genome, but we have found it recurrently. We hypothesize that this is because of the priors on accepting calls with low read quality; the reference sequence seems to heavily bias the algorithm even in the presence of multiple contrary reads. We will have to check SNP calls manually in candidate regions.

    OK, so let's find the Neandertal regions!

    The strategy is fairly clear. I'll take a 10-SNP window from the HapMap, determine the haplotype of the Vindija 33.16 genome, see if that Neandertal haplotype occurs in the CEU HapMap sample, and then see if it also occurs in the YRI, MKK and LWK samples. When I find a haplotype shared with the Neandertal in Europe but not in Africa, I'll take that as a candidate haplotype for Neandertal ancestry.

    I probably want to be a little more permissive than that, actually. A Neandertal haplotype that is present in Europe, and present but rare in Africa may still be a good candidate. A Neandertal haplotype that does not match at all SNPs may also nonetheless be a good candidate, considering that the consensus is often merging two true haplotypes together. There's not much I can do about the consensus problem, because I don't have any way of figuring out the missing information except in rare cases with multiple sequence reads. But to address the first problem I can relax my criteria a bit with respect to variation inside Africa.

    Sliding the window down the chromosome will allow me to find the length of Neandertal-identical haplotypes in each individual, which could lead to an estimate of linkage decay. Across the genome, this will yield an estimate of the time that population mixture with Neandertals took place.

    Several other observations should lend some confidence in particular candidate haplotypes. The more a candidate includes derived alleles that are not themselves common in Africa, the more convincing it will be. If it does represent a "deep root" -- that is, if no close relative of the Neandertal haplotype occurs in the African sample, that also helps. The region with Neandertal identity shouldn't be too long. It might be quite common -- a few Neandertal-derived alleles may have been positively selected in later populations. But most of them are likely to be rare -- so I should expect to see many of them in only one or two copies in the CEU sample.

    I'm obviously interested in whether different populations (for example, Europe and China) have the same Neandertal-derived haplotypes. I'll leave that off for now -- there's much too much in this post already.

    So to be clear, this procedure will find haplotypes that are likely to have come into non-African populations from Neandertals. No single test will confirm these; but a combination of factors may be compelling for individual haplotypes. We can identify which genes may be in or near an interval where a candidate haplotype is found, but in all likelihood we will not have any known functional polymorphisms in the SNP data. This procedure then will provide no evidence that a particular Neandertal-derived allele has any functional effect in any living people.

    Some results

    I'll be reporting an awful lot more about results over the next few days. My first series of comparisons was the X chromosome, for reasons that will become clear shortly. On the X, there are 396 intervals where a 10-SNP Neandertal haplotype is identical to some CEU phased haplotypes and two or fewer within African HapMap samples.

    They vary in frequency in more or less the expected way -- a few of them are relatively common (10 or more copies out of the CEU sample, for example) most have only one or two copies in CEU.

    These vary substantially in length, mostly because some areas have very low Neandertal coverage. A few are more than 100-kb in length, most are 30 kb or less.

    The haplotype with the strongest signature -- 100-kb interval encompassing 26 SNPs in the Vindija 33.16 genome, is found in more than 15 (and centrally, in 22) CEU individuals and in no African individuals. The interval spans across part of the DMD gene (associated with Duchenne's muscular dystrophy). Conveniently, this is precisely the interval identified by Yotova and colleagues [3] as a site with Neandertal-derived alleles in non-African populations. They used comparisons at the sequence level, finding the Neandertal-derived variant at a frequency of 9% overall outside Africa. I have not yet confirmed that the SNP haplotype corresponds to this Neandertal-derived allele at the sequence level, but we should be able to manage that using public genomes. It's a nice confirmation that we're looking at the right kind of candidate loci.


    References

    Synopsis: 
    My research is outlining regions of human genomes that were derived from Neandertals. Here are some of the methods.

Pages

Subscribe to Neandertal DNA

Neandertals

For years, I've worked on their bones. Now I'm working on their genes. Read more about the science studying these ancient people.

Denisova

From a finger bone of an ancient human came the record of a completely unexpected population. My lab is working on the science of the Denisova genome.

Acceleration

The advent of agriculture caused natural selection to speed up greatly in humans. We're uncovering some of the ways that populations have rapidly changed during the last 10,000 years.

Malapa

Just outside Johannesburg, the Malapa site is producing some of the most exciting finds in human evolution. This site is the headquarters of the Malapa Soft Tissue Project.