The Neandertal genome FAQ, February 2009 edition

I was out of town last week when the Max-Planck Institute made its announcement about the completion of 1x coverage of the Neandertal genome. It was an exciting day for me. Already, I had scheduled a number of radio shows and a public lecture to commemorate Darwin Day. Several press interviews regarding the news of the Neandertal sequencing project added to the hectic nature of the day, so I didn't get a chance at the time to sit down and write my reactions.

So, nearly a week later, I've finally caught up. I've answered many questions about the Neandertal genome before, so I'm focusing these on the current announcement.

For answers to other kinds of questions, try these posts:

And now, some new questions arising this week:

Has the Neandertal genome now been reconstructed?

No. This announcement is a milestone, not an endpoint.

Much remains to put together an entire genome sequence. The ongoing work represents a massive technical achievement, and is well worth celebrating. But we are not yet at the point where we can talk about structural variants in the Neandertal genome compared to humans, length polymorphisms, or a number of other things. Plus, as noted below, only 63 percent of the nucleotides have been sequenced once -- leaving a lot of basic sequencing left to get even a single pass over the whole genome.

Some stories have used the term "decoded" -- that also would be a misstatement. We don't know the import of the variations that might so far have been found. That is, we cannot yet convert the information that Neandertal sequences provide to us about their genome into information about their phenotypes. Keeping that in mind, "decoding" the human genome is an ongoing process. With the Neandertals, we have barely begun.

I heard that this was a whole Neandertal genome, but then the fine print says that it's only 60 percent completed. What gives?

They set up an announcement when they knew they would be past sequencing 3 billion bases. And in fact they've reached 3.7 billion.

That would be more than the whole genome, if they could pick out exactly which parts they are sequencing. But the shotgun sequencing approach they are using means that some parts of the genome are represented several times in their 3.7 billion bases, while others are not represented all.

It's sort of like painting your house. You could calculate how many gallons it would take for "full coverage" with a paintbrush, but if you shoot that many gallons out of a paint gun there are going to be a lot of gaps that didn't get painted.

For the Neandertal sequence right now, the gaps add up to around 36 percent of the whole genome. Which is an awful lot of missing data.

So why make an announcement now? I dunno. Darwin's birthday makes a good occasion? They could easily have published last year or the year before on many different genes, just as they published the whole mtDNA last year. It seems likely to me that they've been holding off announcing or publishing until they were sure they had worked out a solution to the contamination problems they were having.

I think they deserve to pop some champagne bottles and celebrate. When there is a public data release, we can all celebrate!

What about those contamination problems?

If you've been around a while, you'll remember that I thought the initial report of contamination was a bit overblown. Nevertheless, the possibility of substantial contamination, documented by comparisons between sequencing methods, stopped almost all work on the publicly available data. It was a serious problem, and the research groups responded seriously to the presence of contamination in the samples. Few details of this response were made public, but clearly there was a concern that the longer fragments coming out of the 454 machine didn't originate in the Neandertal sample.

According to the Max-Planck press release, they've taken a number of steps to eliminate contamination. I'll quote the relevant sections:

One essential element developed by Pbos group was the production of sequencing libraries under clean-room conditions to avoid contamination of experiments by human DNA. They also designed DNA sequence tags that carry unique identifiers and are attached to the ancient DNA molecules in the clean room. This makes it possible to avoid contamination from other sources of DNA during the sequencing procedure, which was a problem in the initial proof-of-principle experiments in 2006. They also used minute amounts of radioactively labeled DNA to identify and modify those steps in the sequencing procedure where losses occur. Together with other advances implemented during the project, these innovations drastically reduced the need for precious fossil material so that less than half a gram of bone was used to produce the draft sequence of 3 billion base pairs.
In order to reliably compare the Neandertal DNA sequences to those of humans and chimpanzees, the Leipzig group has performed detailed studies of where chemical damage tends to occur in the ancient DNA and how it causes errors in the DNA sequences. The researchers found that such errors occur most frequently towards the ends of molecules and that the vast majority of them are due to a particular modification of one of the bases in the DNA that occurs over time in fossil remains. They then applied this knowledge to identify which of the DNA fragments from the fossils come from the Neandertal genome and which from microorganisms that have colonized the bones during the thousands of years they lay buried in the caves. They have also developed novel and more sensitive computer algorithms to put the Neandertal DNA fragments in order and compare them to the human genome.

I'm satisfied that they've done everything possible to eliminate contaminants. The examination of the chain of events from extraction to the final sequence is especially important. In many ancient bones, the steps taken to sterilize and extract from deep within the bone somehow still don't eliminate contamination in the final sequence data. Most of that contamination must arise during the processing and sequencing steps, despite the oft-quoted "clean room" conditions in ancient DNA labs. So the methodological advances toward understanding the sources of contamination are very scientifically significant.

There's a hint in some of the earlier press coverage that the pace of sequencing has vastly sped up in the last few months. For example, in December, Ewen Callaway reported that the genome was halfway done:

Half the Neanderthal genome has been decoded and the rest should be sequenced by year's end, a scientist involved in the project told a human evolution conference last week.
Researchers will roll out a rough draft of the Neanderthal nuclear genome after their sequencers have read every letter in the genome on average once - "1x coverage" in genomics speak.

Callaway is a careful reporter, but we should keep in mind that the comments in the story might not quite have conveyed the full situation. Still, if we take that assessment at face value, we can speculate that the process of working out the contamination issue took a long time during which sequencing was relatively slow or paused. If they actually had only sequenced half the 3 billion bases by December, that's pretty fast work since then (a perception that was echoed in some press reports prior to the announcement).

The switch to the Illumina platform seems like an underreported aspect of the story. The press release claims that a billion reads were done on the Solexa, compared to only 100 million from 454 -- that also suggests a switch later in the process, since we know that they were using 454 initially and through early 2008. The press release doesn't explain why they moved from the 454 machine to Illumina. Maybe it's just efficiency of the current platform, but there must be a story there.

What was the most boring aspect of the announcement?

I was talking to a reporter on Tuesday before the press conference, and I said,

"They're no doubt going to give us a list of some genes, with well-known variations in living people, that they've genotyped in Neandertals. And, aside from FoxP2, which we already know about, and microcephalin, I don't know what those will be. I think it would be the most boring possible outcome if they told us that the lactase persistence allele wasn't there. Because there's no news there.

Well, I gave a big belly laugh when I saw the press release. Gee, Neandertals didn't have lactase persistence. Big surprise there! What did they think, they were secretly milking goats?

OK, I admit, that's overly snarky. I mean, what if they'd found the opposite? It would be contamination, of course. So finding the wild-type lactase allele is worth something.

But it's sort of like if your friend was looking through a telescope on Christmas Eve and caught the first-known glimpse of Santa and his reindeer. And you asked her, "What does he look like?" And she says, "He's wearing a red coat!"

It's like being trapped in a Laurel and Hardy routine. And I'm Hardy.

Does the Neandertal genome show that they were "distinct from us"?

Experts on Neandertal bone morphology can readily distinguish them from later Europeans, assuming that the correct parts of the skeleton have survived. So from that perspective, Neandertals were clearly a "distinct" population. They had a morphological configuration no longer found anywhere in the world, and not found in the Europeans who immediately followed them in Europe.

On the other hand, the bones of early Upper Paleolithic Europeans share some interesting similarities with the Neandertals. You wouldn't call the Oase 1 cranium a Neandertal. It lacks nearly all of the features that set Neandertals apart. But it has a mandibular foramen shaped like a small horizontal oval -- like a bit over half of Neandertals, and nearly a quarter of early Upper Paleolithic mandibles. This is a very rare morphology today, and it is rare elsewhere in the human fossil record, although it has been found in the very early Homo erectus sample from Dmanisi. There are two hypotheses for why this feature and others should be most common in two populations living in the same place in adjacent time periods: descent or parallel evolution.

Looking only at the morphology, we have only our personal limit of credulity to argue one way or the other. How many features does it take to be convinced that descent must explain some of the similarities? Sadly, the answer to this question is different for different researchers.

I think that the most reasonable explanation for the morphology is gene flow between Neandertal and other populations. But I have to say that others disagree.

Genetic evidence may be most useful because we are much more likely to agree on the score. A unique gene sequence is unlikely to arise twice in parallel, and in any event the probability of such parallelism can be calculated in real numbers, not shopworn guesses. With 3 billion base pairs to compare between our populations, we have a good chance of finding and quantifying even low levels of genetic exchanges.

However, these conclusions still depend on assumptions and models that not all anthropologists agree about. At the moment, the state of the science is such that the meaningful distinction is not whether Neandertals and humans may have interbred, but instead whether such interbreeding was common enough to be evolutionarily important, or to establish Neandertals as a "distinct" population. Since "important" and "distinct" do not have quantifiable meanings in evolutionary theory, you can see that we have a long way to go before paleoanthropology agrees on testable models of Neandertal population history.

I think the science will be lively for the next few years, as the focus goes away from details of morphological characters and toward details of evolutionary models. The morphology will still remain important -- particularly as the observable evidence of variation within ancient populations. It will take many years before we have a good picture of genetic variability within these samples. But questions of "distinctness," which depend on shared characters and levels of interbreeding, must be answered at the level of models, not features.

What about microcephalin?

According to the press conference, the human-derived allele of MCPH1 was not found in the Vindija sequence. Bruce Lahn and colleagues had suggested that this allele might have come into the recent human population from Neandertals, based on its present pattern of variability. This allele is quite divergent from the rest of human variation at the locus, it is common outside of Africa but rare inside of Africa, and it appears to have been under positive natural selection for around 30,000 years. I have an FAQ on MCPH1 and introgression, and I've published on the topic. If the human-derived allele is not in the Neandertal genome, that obviously weakens the argument for introgression of this gene from Neandertals.

We have interpreted this gene cautiously from the beginning. Neandertals are one likely source for such introgression, but not the only one. In my FAQ, I wrote this:

Well, the D haplogroup [of MCPH1] is common in many areas outside of Africa in addition to Europe. So it isn't possible to really specify in what archaic population it may have originated. There is some chance that it may be found in the Neandertal genome sequence, when that becomes available. In fact, that would be the ultimate test for many candidate introgressive alleles.
But there is a good chance that it won't be found in the Neandertal sequence. After all, Neandertals were probably pretty thin on the ground -- especially in Europe. A sampling of their genes would be sort of unlikely to yield a high proportion of archaic alleles that may have survived to the present day. So there is hope that we will find and document such alleles, but the best evidence for many of them may remain their current pattern of variation in living people.

I think those points are important. There were not many Neandertals, and it may be much more likely for present-day humans to have genetic variation that originated in South or West Asia, or even multiple regions of Africa (a hypothesis suggested for some other gene loci).

But I still think it very likely that out of the 20,000 genes in the human genome, some will have derived variants that were also present in the Neandertal genome. Human evolution over the last 50,000 or more years was driven by new variation, and multiple human populations would have been one of the largest potential reservoirs of adaptive variation for selection to work upon.

What is the most important aspect of this announcement?

Paleoanthropology is a science that generates huge public interest. But it gives very few chances for public participation. Those of us who are close to paleoanthropology know how much our science is driven by good ideas from many other fields. The pathways by which those insights enter our science tend to be highly constrained -- radiocarbon dating, scanning electron microscopy, isotopic analysis of enamel, and now genetics have all been brought into paleoanthropology by extremely skilled scientists from outside the field. I think that the Neandertal genome has the potential of breaking new ground.

One year from now, there will be high school students working with sequences from the Neandertal genome. Who knows what they will discover?

I just think that is tremendously exciting. For the first time, the primary data of paleoanthropology will be available to everyone.