Last year, this Neandertal genome came out. No doubt you've heard about it. So maybe by now you're wondering where the new science is that's being done on this genetic information.
We've been ramping up here in my lab for a few months, working with these data. My students have a couple of projects that we'll be keeping close to the vest. But for the most part I think we'll share stuff as we go along. This is all open access data, and there are some questions of fundamental interest that are actually pretty easy to resolve.
The initial Neandertal genome draft publication  came with some analysis of the genome-wide similarity of the Neandertal draft genome and a few human genomes. A new review of the basic method of comparison has appeared in Molecular Biology and Evolution, by Eric Durand and colleagues . The basic idea is that a branching model between populations without gene flow predicts that two members of one population have equal amounts of sequence similarity to a third individual in another population. If that third individual turns out to be closer to one or the other of the first two, we can reject the hypothesis that those first two are part of a population that has branched without gene flow away from the third individual's population. When we bring an African and a European as the first two individuals, and a Neandertal as the third, we find that the European is in fact closer to the Neandertal. So we can infer gene flow from Neandertals into the ancestors of Europeans. This comparison is nearly equally significant when we compare an African and a Chinese individual, or an African and an individual from Papua New Guinea. Thus we can infer that Neandertals contributed genes to the ancestors generally of present-day non-Africans, not specifically present-day Europeans. The amount of gene flow that can explain the pattern of genetic similarities adds up to around 2.5% of the total ancestry of non-Africans today. Again, it's not a direct observation; it's a model that explains the greater similarity of the Neandertal genome to people outside Africa than within sub-Saharan Africa.
As you can see, this leaves open a key question. We don't know whether genetic similarities between Neandertals and present non-Africans are the same in different areas outside Africa.
The whole-genome comparisons have great statistical power to test the hypothesis of gene flow in general. With a hundred thousand or so actual sequence differences between Neandertals and any given human genome, the method can potentially detect very small amounts of gene flow. What we're seeing in the Neandertal data is anything but small -- it amounts to greater non-African similarity to Neandertals at thousands and thousands of sites.
But comparison of three whole genomes gives us very little power to identify the specific loci affected by gene flow. If a French genome has three percent ancestry from Neandertals, we can predict that other genomes in France probably do also. That's a consequence of independent assortment -- we're not looking at people who actually have Neandertal grandparents, we're looking at a population that had Neandertal ancestors thousands of generations ago. So all French genomes are probably more-or-less alike in the Neandertal quotient. But will they have the same three percent of Neandertal-derived alleles? Almost certainly not: each Neandertal-derived locus would have to be fixed in France for them to be identical in all genomes. Much more likely, a much larger number of Neandertal-derived alleles exist at an average frequency of three percent. Such a distribution would predict that the average Neandertal-derived variant found in our first French genome has only a 3 percent probability of showing up in a second genome. Looking at one genome in one population will find only a small fraction of loci that have been affected by Neandertal gene flow.
Hence, if we want to answer the question about different populations, we need to look at a reasonably large sample of individuals. We need to know whether a Neandertal-derived variant in France occurs at the same frequency in China, and vice-versa. Are there loci where a Neandertal allele occurs at 10 percent in France, but never in China? Does a full list of loci with one or more Neandertal-derived variants include any interesting functional genes? Answering these questions would tell us a lot about the demographic and adaptive conditions that led to our Neandertal heritage.
Enter the HapMap
You'd think that a genome-wide set of SNP genotypes would be useful for testing hypotheses of population history. The HapMap has more than 3 million SNP genotypes from hundreds of individuals from China, Japan, Utah, and Nigeria, and more than a million genotypes from nearly a thousand other individuals from other populations. In other words, it's the kind of sample that could tell us a lot about the frequencies of Neandertal-derived alleles if we could find them.
But the HapMap project didn't identify its set of genotypes to help us reconstruct population history. The aim was to find most common variants, and secondarily to add more variants in low-variation regions to allow linkage mapping of medically interesting phenotypes. SNP sites were disproportionately found in some populations (first, Europeans) more than others. These processes of SNP discovery led to ascertainment biases, in which the difference between samples depends not only on their histories, but also on where we chose to look.
Ascertainment bias is a real pain if we want to test the hypothesis of Neandertal genetic contribution to today's humans. Look at it this way: Suppose we find a rare SNP allele in Europeans, absent in Africans, but present in the Neandertal genome. Looks like a piece of support for Neandertal ancestry of Europeans. If those sites outnumber the sites where we find a rare allele in Africans shared with Neandertals, not in Europeans, then that would seem like the same scenario outlined above -- a case where one of the living populations carries more Neandertal similarities than the other. Evidence of gene flow, right?
Ascertainment bias leaves another possibility: Maybe we looked harder for rare variants in one of the living populations. If so, the lack of rare Neandertal-shared variants in the other population may be an accident of our SNP discovery procedure.
There are ways around this problem. For instance, if the Neandertal genome carries many derived alleles for SNPs shared with Europeans, it weighs strongly in favor of recent genetic exchanges instead of ancient incomplete lineage sorting. But this basic question of "which population has more Neandertal ancestry" may still be hard to resolve.
Haplotypes from Neandertals
Green and colleagues  also presented a second approach for testing Neandertal ancestry. They used SNP data to identify regions of the genome where non-African populations appear to have a "deep root" to their genealogy, but Africans do not. These regions are rare across the genome; they focused on 100-kb intervals, finding only a dozen genome-wide that fit their criteria. But each of these is a case where non-Africans appear to have an ancient genealogical split between two haplotypes, all the SNPs lining up to distinguish one branch of the genealogy from another. If both are not represented in Africa, then presumably one of them came from some non-African ancient population. And indeed, they found ten of the deep branches within the Neandertal genome.
This approach makes use of the information that SNP data provide about linkage. A segment of a chromosome from a living human that is similar to a Neandertal segment may be explained either by recent ancestry from Neandertals or from incomplete lineage sorting from the ancient human-Neandertal common ancestors. But if that segment is long, it probably isn't from the ancient common ancestors of humans and Neandertals, because recombination should have broken up the linkage across that long interval. Hence, long haplotypes shared by living humans and Neandertals are best explained by recent mixture. If those long haplotypes are predominantly found in non-Africans but not Africans, it tends to confirm that they have come from recent population mixture with Neandertals.
But how long should these intervals be? This is an area where we can improve on the approach taken by Green and colleagues . A hundred kilobases is way too long to represent the average Neandertal-derived haplotype. The average rate of recombination across the genome is around one centimorgan per megabase -- meaning that an interval of one million base pairs has a one percent chance of recombination per generation. That's a chance 1/1000 of recombination per 100 kb per generation, meaning that half the linkage across 100 kb should be broken up in roughly 700 generations. For humans, half the linkage at that distance decays after only 18,000 years or so, except in regions of low recombination. If we go as far back as 100,000 years ago, half of the linkage decays across regions as short as 18 kilobases. That means if we look at windows 20 kb long for evidence of Neandertal-derived haplotypes, we are likely to miss a large fraction of them. Hundred-kilobase intervals will miss nearly all of them.
Bottom line, we want to look at intervals as short as we can. But if we look too short, we won't have much evidence to work with. The 3-million SNPs in the HapMap version 2 give us one site every kilobase on average. Ten kilobases will give us around 10 SNPs. A 10-SNP haplotype may sound impressive, but if most of those SNPs have a derived allele at low frequency (say, less than 10 percent), then it starts to become more likely that a given haplotype resembles the Neandertal genome just because they share ancestral SNP alleles. Ideally we'd like more SNPs, but in reality the Neandertal sequence draft is likely to lack several, so if we want 10 SNPs worth of comparison, we'll need to look at longer intervals.
And really, HapMap 2 is a small sample to try to find low-frequency haplotypes from Neandertals. By analogy with the method used by Green and colleagues, we can find haplotypes that are present in the CEU (European ancestry) sample, present in the Neandertal genome draft, but absent in the YRI (West African ancestry) sample. But HapMap 2 includes only 120 genomes from each of the YRI CEU samples. If we have a variant at in Europe at 1 percent, we're pretty likely to miss it. Worse, if we find a haplotype in Europe at 1 percent, we're really not able to reject the hypothesis that it's in Africa at the same frequency, even if no copies of it are in YRI. We can help fix this problem by looking at HapMap phase 3 samples, which include two more African populations, bringing the total sample up to more than 300 within Africa. But there are fewer SNPs in HapMap 3, limiting our comparisons to longer windows. One could even contemplate the HGDP sample as a way to add even more individuals to our comparative samples. But that sample has many fewer SNPs, so we would need really long intervals to test the hypothesis of Neandertal ancestry for particular haplotypes.
By the end of this I'll surely be pining for sequence data. Of course for that we haven't long to wait. But I have an aim for which genotype data are at the moment the only feasible approach. So I'm a bit stuck: Using a bigger sample means using longer intervals, which means I'll miss more and more Neandertal-derived haplotypes. But we should thereby get reasonable power to find any common haplotypes derived from Neandertals.
Phasing and the haploid Neandertal
The HapMap 2 samples and some of the HapMap 3 samples were taken from pairs of parents, where a child was also genotyped. Those trios make it possible to determine which SNP alleles were linked on the parents' chromosomes, providing a natural "phase" for the haplotypes. For some other samples, the phase was inferred algorithmically, using assumptions about population history and knowledge about which haplotypes are present in the populations with trios. Phasing algorithms are not ideal, because the assumptions about population history (inferred in many cases from the data) may be false. But over the relatively short intervals we're considering here, phasing will probably not lead to false positives.
Neandertal draft genomes are themselves more of a problem. Each sampled individual is known from a large number of short reads, which (with some luck) can be aligned with the human genome map. The present data include many gaps. More important, there are only a very small number of places where the number of reads is high enough to determine whether a Neandertal individual was a homozygote or not. The Neandertal consensus sequences are built by taking the most frequent base from these reads aligned to any given site in the human genome. That means that the Neandertal "haplotype" across any set of SNP loci may well be a jumbled chimera of two different haplotypes carried by the Neandertal individual. For the current analyses, I have kept the Neandertal individuals separate -- so the haplotypes here were derived only from the Vindija 33.16 individual. If we use a consensus sequence taken from multiple individuals, we will have fewer gaps but potentially more jumble of different haplotypes.
There's not much to be done about this problem. It should mostly cause us to miss true instances of Neandertal genetic ancestry, and we may be able to quantify the extent of this error in some high-coverage areas.
(UPDATE 2011-02-24): I should mention, my lab has found that the Neandertal consensus sequences themselves have issues; the consensus-building algorithm appears in many cases to have included the human reference genome SNP allele in the place of the allele found in the majority of Neandertal reads. We are not yet sure how extensive this phenomenon is across the genome, but we have found it recurrently. We hypothesize that this is because of the priors on accepting calls with low read quality; the reference sequence seems to heavily bias the algorithm even in the presence of multiple contrary reads. We will have to check SNP calls manually in candidate regions.
OK, so let's find the Neandertal regions!
The strategy is fairly clear. I'll take a 10-SNP window from the HapMap, determine the haplotype of the Vindija 33.16 genome, see if that Neandertal haplotype occurs in the CEU HapMap sample, and then see if it also occurs in the YRI, MKK and LWK samples. When I find a haplotype shared with the Neandertal in Europe but not in Africa, I'll take that as a candidate haplotype for Neandertal ancestry.
I probably want to be a little more permissive than that, actually. A Neandertal haplotype that is present in Europe, and present but rare in Africa may still be a good candidate. A Neandertal haplotype that does not match at all SNPs may also nonetheless be a good candidate, considering that the consensus is often merging two true haplotypes together. There's not much I can do about the consensus problem, because I don't have any way of figuring out the missing information except in rare cases with multiple sequence reads. But to address the first problem I can relax my criteria a bit with respect to variation inside Africa.
Sliding the window down the chromosome will allow me to find the length of Neandertal-identical haplotypes in each individual, which could lead to an estimate of linkage decay. Across the genome, this will yield an estimate of the time that population mixture with Neandertals took place.
Several other observations should lend some confidence in particular candidate haplotypes. The more a candidate includes derived alleles that are not themselves common in Africa, the more convincing it will be. If it does represent a "deep root" -- that is, if no close relative of the Neandertal haplotype occurs in the African sample, that also helps. The region with Neandertal identity shouldn't be too long. It might be quite common -- a few Neandertal-derived alleles may have been positively selected in later populations. But most of them are likely to be rare -- so I should expect to see many of them in only one or two copies in the CEU sample.
I'm obviously interested in whether different populations (for example, Europe and China) have the same Neandertal-derived haplotypes. I'll leave that off for now -- there's much too much in this post already.
So to be clear, this procedure will find haplotypes that are likely to have come into non-African populations from Neandertals. No single test will confirm these; but a combination of factors may be compelling for individual haplotypes. We can identify which genes may be in or near an interval where a candidate haplotype is found, but in all likelihood we will not have any known functional polymorphisms in the SNP data. This procedure then will provide no evidence that a particular Neandertal-derived allele has any functional effect in any living people.
I'll be reporting an awful lot more about results over the next few days. My first series of comparisons was the X chromosome, for reasons that will become clear shortly. On the X, there are 396 intervals where a 10-SNP Neandertal haplotype is identical to some CEU phased haplotypes and two or fewer within African HapMap samples.
They vary in frequency in more or less the expected way -- a few of them are relatively common (10 or more copies out of the CEU sample, for example) most have only one or two copies in CEU.
These vary substantially in length, mostly because some areas have very low Neandertal coverage. A few are more than 100-kb in length, most are 30 kb or less.
The haplotype with the strongest signature -- 100-kb interval encompassing 26 SNPs in the Vindija 33.16 genome, is found in more than 15 (and centrally, in 22) CEU individuals and in no African individuals. The interval spans across part of the DMD gene (associated with Duchenne's muscular dystrophy). Conveniently, this is precisely the interval identified by Yotova and colleagues  as a site with Neandertal-derived alleles in non-African populations. They used comparisons at the sequence level, finding the Neandertal-derived variant at a frequency of 9% overall outside Africa. I have not yet confirmed that the SNP haplotype corresponds to this Neandertal-derived allele at the sequence level, but we should be able to manage that using public genomes. It's a nice confirmation that we're looking at the right kind of candidate loci.