metagenomics

Complete Neandertal mitochondrial sequence, and selection on human (not Neandertal) mtDNA

In the current Cell, the Max-Planck group, in coordination with 454 Life Sciences, report the sequence of a complete Neandertal mtDNA. I'm out of town right now, so I'm writing fairly quickly, and I haven't seen any of the reporting. Keeping that in mind, I wanted to set out a few of the interesting things about the paper.

I've been waiting a long time for this sequence to come out. I know they've had the basic data for a long time, since the mtDNA copy number is very high, the 454 process kicks out a lot of mitochondrial sequence. The reward for the wait is that Green and colleagues have done a very careful job of comparative analysis, with some very interesting results.

If I leave something obvious out, please forgive me, since I'm just dashing this as quickly as I can.

Where we left off...

All previously reported sequences of Neandertal mtDNA have been fragments of the control region. The control region of the mtDNA (hypervariable regions I and II) is very helpful for working out phylogenetic relations among recent humans. True to its name, it varies a lot, and its high mutation rate allows a fine discrimination among lineages that have differentiated only within the recent past.

The high mutation rate of the hypervariable regions also means that closely related populations have accumulated many differences. That's very convenient for identifying Neandertal mtDNA, where only small fragments (up until recently) have been practical to obtain. A small fragment of the mtDNA control region is sufficient to assess whether a specimen is like other known Neandertal sequences or not. Up to now, this has been an important way of authenticating Neandertal DNA sequence results --- although it has the obvious drawback that it might falsely exclude some genuine sequences that really do look like the modern human form.

So far every Neandertal mtDNA sequence looks like a member of the same mtDNA clade. (More carefully, every specimen with good biological preservation that has produced DNA has yielded at least some mtDNA sequences that form a clade distinct from all recent humans. Others are presumed to be contamination -- which I have no reason to doubt.) No recent human -- out of the many thousands that have been sampled so far -- has produced a mtDNA control region sequence like any known Neandertal. The two populations, so far as we can tell, possessed distinct mtDNA clades.

Divergence time

A complete mtDNA sequence provides a lot of sites, which allows a more precise estimate of the divergence time between recent human and Neandertal mtDNA lineages. The paper reports this time as 660,000 years ago, with a confidence interval from 520,000 to 800,000 years ago. That range of dates substantially overlaps with the prior estimates of divergence time, and is a pretty good match to the initial estimate based on a single HVR1 sequence in 1997.

The availability of a complete sequence has also removed a remaining piece of ambiguity from earlier comparisons. Because the hypervariable regions are so variable, it has always been the case that comparisons of hundreds or thousands of recent humans have included some pairs of individuals who are really divergent in their control region sequences. The result: some people living today are more different from each other than Neandertals are from recent people.

Now, that particular fact is not meaningful in a cladistic sense. Neandertal sequences share derived mutations, as do recent humans. But the concept of a "range" of genetic divergence has confused comparisons. Comparing the control region alone, it may appear that Neandertals were not so very different from living humans, even if they have a few derived mutations that no longer exist. As long as some humans were also very different from each other, it remained possible that the tree had been wrongly reconstructed. An equally parsimonious tree (or even a more parsimonious one) might link the Neandertal clade with some modern human, even if not a recent European. When comparing humans to chimpanzees and more distantly related primates, the hypervariable regions are somewhat saturated with mutations, meaning that parallel mutations between different species are very common. This makes it even harder to reconstruct the tree of mtDNA relationships based on the hypervariable regions alone.

Comparing the complete mtDNA genomes of a Neandertal and many recent humans presents a very different picture. Humans are all more similar to each other, when comparing the complete mtDNA genome, than any human is to a Neandertal. And in fact the Neandertal sequence is three or more times as different, on average, from us as we are from each other. This change from the earlier picture is a purely statistical one: more sites, with a more regular mutation rate. But it makes a clearer picture, and one that supports the phylogenetic model more clearly.

Selection on COX2?

Even though the control region is so helpful for analysis of recent humans, and easy identification of Neandertals, it's only a small fragment of the complete mtDNA. The mitochondrial genome is inherited as a single unit, so different mutations on a single mtDNA are co-inherited with each other. That means that the diversity of the noncoding control region is shaped by both genetic drift (due to demography) and selection. The selection includes purifying selection on coding sites across the entire mtDNA genome, and the possibility of positive selection on one or more ancient mutations.

I believe that positive selection on mtDNA in ancient humans has a lot of indirect support (and I wrote as much here). To give a brief list:

  • Mitochondrial haplotypes in living humans correlate with functional variation in disease, longevity, and performance -- all areas that have undergone recent biological shifts in humans.
  • Some mtDNA haplotypes in humans appear to have been under recent positive selection, as indicated by their geographic distributions.
  • Some mtDNA haplotypes have vastly changed in frequencies within the past few thousand years, as evidenced by ancient DNA samples.
  • Nuclear genes involved in mitochondrial function have been under recent positive selection.
  • MtDNA from Neandertals is completely absent today, despite the other evidence for genetic survival of that population. This combination is very unlikely if mtDNA was neutral.

So I think that positive selection is not only a reasonable hypothesis, it is extremely likely. But that is not to say that it has been demonstrated. Others might say that my final reason, that positive selection can explain the apparent contradiction between mtDNA and other data (such as skeletal comparisons and apparent nuclear introgression), is a case of wishful thinking. They might argue that all this other evidence of Neandertal-modern gene flow is an illusion, and not a problem to be explained.

I don't think they're right, but in the spirit of honest advertising, that's what they think.

It would be unreasonable for me to expect that a Neandertal mtDNA genome would provide strong evidence of positive selection on the human lineage. Finding such evidence would require repeated selected substitutions, probably within a single gene. Otherwise there would never be statistical evidence of positive selection. The available tests for positive selection in a two-genome (or in this case, two-clade) comparison are very weak.

Only a single selected mutation would be sufficient to explain the complete replacement of Neandertal mtDNA by an advantageous modern human type. No test of selection is powerful enough to refute neutrality based on a single selected site in a comparison of two mtDNA genomes. And repeated selection on a single gene just doesn't seem as likely as one or a few instances of selection, potentially on many mtDNA coding regions.

So imagine my surprise, when reading this paper, when I discovered that they found repeated substitutions on a single mtDNA gene in the human lineage, and statistical evidence of positive selection!

The gene is cytochrome oxidase subunit 2 (COX2). Using the chimpanzee mtDNA sequence as an outgroup, there were 18 human-specific and 20 Neandertal-specific nonsynonymous coding substitutions. Out of the 18 human-specific substitutions, 4 were in COX2. Only three synonymous substitutions occurred in humans for this gene (the ratio 3:4 differs from the ratio for other mtDNA coding regions, 54:14). In contrast, Neandertals had no coding substitutions -- every difference between Neandertal and human sequences is inferred to have occurred in ancient humans. These data are unlikely unless COX2 was recurrently selected in ancient humans.

More evidence will be necessary to establish positive selection. The paper includes multiple comparisons of different genes, so a significant result for this one is necessarily weakened by the multiple-comparisons correction.

But in a very interesting part of the paper, the authors did a functional analysis of the human-specific changes in COX2. Functional analysis of coding sites has come a long way in the last few years. Last fall, we saw it applied to the Neandertal-specific mutation of the MC1R gene. It was the functional analysis that argued that the mutation likely resulted in a red hair phenotype. These functional analyses consider the position of a mutation within the protein sequence, the extent to which that part of the protein interacts with other proteins, and whether the coding changes are otherwise conserved in other species.

Here is the paper's conclusion about COX2:

Another interesting observation is that COX2 stands out among proteins encoded in the mitochondrial genome as having experienced four amino acid substitutions on the modern human mtDNA lineage. Further work is warranted to elucidate the functional consequences of these amino acid substitutions. However, all these substitutions are in regions of the protein that, based on the crystal structure, do not have any obvious function, and they are variable among primates. Hence, they may represent either minor adaptive advantages, perhaps of regulatory relevance, or have no significant functional consequences for mitochondrial function. Unless other evidence for their importance becomes available, we see no need to invoke positive selection to account for the evolution of COX2 on the human lineage (Green et al. 2008:423).

To me, a very persuasive finding is that each of the four human-specific mutations of COX2 is also found in some other primate species. In other words, where humans differ from chimpanzees and Neandertals (and generally, gorillas and orangutans), humans are like baboons or macaques. The authors of the paper read this finding as evidence that the changes have little functional importance. But I see this as a suggestion that these substitutions are functionally salient. Different primates have different energetic and dietary constraints, and it should be no surprise if they exhibit functional convergences in mtDNA. Humans evolved four separate sites, within the last half-million years, to be similar to some cercopithecoids and different from most other hominoids. Neandertals exhibited no evolution in this gene. This makes sense under a hypothesis of mtDNA selection in accordance with functional requirements, which we have good reason to believe were different in humans and Neandertals.

But as the authors say, we need more evidence about the function of these genes. I think the comparative evidence now supports the hypothesis of selection very strongly, and is consistent with the pattern of evidence from the nuclear genome and from the anatomy of early Upper Paleolithic Europeans.

Contamination

This paper advances our understanding of contamination within the Neandertal sequences. The authors acknowledge Wall and Kim's (2007) interpretation of a high contamination rate in the earlier reported nuclear genetic data off the 454 platform, and provide additional information to support a relatively high contamination rate:

Contamination with extant human DNA is the other dominant source of erroneous Neandertal sequences. Given the high coverage and the fact that the best estimate of the contamination rate here is 0.5% (with an upper 95% confidence limit of 0.87%), we do not expect contamination to affect the mtDNA sequence assembly to any appreciable level. Under the assumption that the Neandertal mtDNA sequence is reliable, it is a useful tool for gauging contamination when sequencing the Neandertal nuclear genome. Previously, assays to determine contamination within Neandertal fossil extracts were limited to the HVRI, which carry few positions where extant humans differ from Neandertals. By contrast, the complete Neandertal mtDNA now offers 133 such positions. This enables a reliable estimation of mtDNA contamination by analyzing sequence reads from 454 libraries, rather than by PCR-based assays of the DNA extracts. For example, when we do this in a small preliminary data set initially published from this fossil (Green et al., 2006), 10 of 10 sequences are classified as Neandertal. However, in further unpublished sequencing runs from that library, 8 out of 75 diagnostic sequences derive from extant human mtDNA, suggesting a contamination rate of ˜ 11% (CI = 4.7%–20%). This is in agreement with the suggestion (Wall and Kim, 2007) that contamination occurred in that experiment. That library was constructed outside our cleanroom facility and before the introduction of the Neandertal-specific key, which is crucial for the detection of contamination by other 454 libraries, and was therefore not used for the subsequent Neandertal genome sequencing project (Briggs et al., 2007). However, with the help of the mtDNA presented here, such levels of contamination are now easily detectable from 454 sequencing runs (Green et al. 2008:424).

So the mtDNA from the same sequence library as the previously reported 1 Mb of Neandertal nuclear genome shows a high contamination rate. That's really disappointing, since it means we have no data to work with. We'll just have to wait.

OK, that's all I have time to post; more later...

References:

Green RE and 24 others. 2008. A complete Neandertal mitochondrial genome sequence determined by high-throughput sequencing. Cell 134:416-426. doi:10.1016/j.cell.2008.06.021

Last week, a short article in Science by Rachel Mackelprang and Edward Rubin discussed some of the recent advances in ancient DNA extraction. Of most interest is the paragraph that discusses ways to probe for particular genes while avoiding some drawbacks of PCR amplification:

Microarray-based hybridization, coupled with high-throughput sequencing of recovered DNA, has recently been used to capture thousands of targets in parallel from modern DNA samples. With these strategies, a DNA sample is directly applied to an array of specifically designed oligonucleotide probes immobilized on a chip. Complementary fragments hybridize to the probes while the remaining nonbound DNA is washed away. The hybridized DNA can then be eluted from the chip and sequenced, resulting in enrichment of targeted genomic regions (11). Alternatively, chip-synthesized oligonucleotide probes have been released from the chip and used to capture molecules in solution (12). A purely solution-based method, where sets of probes are designed against a reference genome and used as a bait to "hook" corresponding sequences from a DNA pool (13), has been used to recover specific regions of nuclear DNA from Neandertal and cave bear genomic sequence libraries (1). These various capture approaches hold promise for economically investigating the same sequence in multiple different samples as well as examining multiple independent molecules of an allele isolated from a single sample.

The short review also mentions problems with contamination and some of the results that indicate contamination of the Neandertal sequences, which I've discussed before (Complete Neandertal DNA files).

Probing for the alien within

Laura MacConaill and Matthew Meyerson present a cool short review in Nature Genetics of metagenomics applications in pathogen discovery.

The basic principle is to extract DNA from a tumor or sore, do intensive sequencing of all the DNA in it, and use the computers to subtract out everything human. What's left after you subtract out the human DNA is any pathogen that might be in the sample:

The two recent studies combined computational subtraction with microreactor-based pyrosequencing to identify viral signatures associated with human disease. Feng et al. used high-throughput pyrosequencing15 and comparison to the human transcriptome to identify a viral sequence in a library of cDNAs generated from individuals with Merkel cell carcinoma, a rare but aggressive human skin cancer. The authors sequenced over 395,000 reads of 150-200 bp in length. After digital transcriptome subtraction, 2,395 sequences remained. Among these, conceptual translation of one sequence showed similarity to a polyomavirus. By cloning the complete viral genome and carrying out further analyses, the authors found that the Merkel cell polyomavirus sequence was present in eight of ten Merkel cell carcinomas.
A second group used the same high-throughput DNA sequencing technology to identify a previously undiscovered arenavirus that likely caused the deaths of three transplant recipients who all received organs from a single donor.

I don't know if sequencing will ever get so cheap that this will become practical diagnostic method, but it really doesn't need to be. As soon as you suspect a pathogen, you can probe directly for that pathogen's DNA in a sample -- and there's no barrier to testing for hundreds of pathogens at once. Heck, there ought to be a SNP chip for it.

But this is a potentially important way of identifying new pathogens in unknown samples from scratch. The article mentions that the current cost of this kind of sequencing is around $10,000 per sample, and that is rapidly falling. For that cost, you get the sequence on your computer, even if you can't identify it yet, and who knows -- it might pop up two years later when somebody else finds it in some unexpected place.

References:

MacConaill L, Meyerson M. 2008. Adding pathogens by genomic subtraction. Nat Genet 40:380-382. doi:10.1038/ng0408-380

Filed under

Poincaré pusillanimy

So Science named the Poincaré conjecture proof as the "breakthrough of the year." I got my year-end Discover a couple of weeks ago, and they said this:

If, in the year 2100, DISCOVER runs a feature on the top advances in science in the 21st century, the proof of the Poincaré conjecture is still likely to be the number-one sory in mathematics.

I thought that was really funny, because they made it only number 8 for the year! "Sucks to be a mathematician," I thought!

The number two breakthrough, according to Science was paleometagenomics (including the Neandertal genome), that managed a short mention in Discover's number 7 ranked story. I'm bringing this up because I predicted it in my 2006 New Year's predictions.

I'll be reviewing my predictions and making 2007 predictions next week -- right now it looks like I did pretty well on the solid ones, and downright poorly on the speculative ones.

How your metagenome makes you fat

This week's Nature is largely about the association of gut biota with body mass in humans, with two papers and a commentary on the subject. Both papers are from Jeffrey Gordon's lab, and to my mind they both establish a very important base for metagenomics in human biology.

It has long been known that the human gut flora can cause incredible problems when it goes wrong, but so far these problems (for example, symptomatic H. pylori, pathogenic E. coli, pathogenic Clostridium difficile) have been compartmentalized as the effects of individual pathogens. A metagenomic perspective views such health problems as imbalances in an ecosystem. Bad health outcomes might be induced by harmful invasives (such as hospital-acquired Clostridium difficile) or by long-lasting phylogeographic associations (some examples of H. pylori. In either case, if we want to control the disease, we will be best served to study its evolutionary origin -- which may owe as much to ecology as to epidemiology.

These papers are important because they show that "normal" variations in human biology -- that is, not necessarily pathological variants -- also are linked to the ecology of our metagenome. I think the introduction to the paper by Turnbaugh et al. (2006:1027) puts it well:

The human 'metagenome' is a composite of Homo sapiens genes and genes present in the genomes of the trillions of microbes that colonize our adult bodies. The latter genes are thought to outnumber the former by several orders of magnitude. 'Our' microbial genomes (the microbiome) encode metabolic capacities that we have not had to evolve wholly on our own but remain largely unexplored. These include degradation of otherwise indigestible components of our diet, and therefore may have an impact on our energy balance.

There is a complex set of interrelated observations between these two papers. They used metagenomic methods to assess the microbial population of the gut in obese and nonobese humans (that's the Ley paper). Then (the Turnbaugh paper) they looked at normal lab mice versus leptin-knockout lab mice who are genetically obese (the famous "fat" mice). They found that the microbial contrasts between obese and nonobese people were also shared by the obese and nonobese mice. But the obese and nonobese mice are difficult to compare, in terms of microbial function, because they don't have the same food intake. So finally, they took the microbial populations from the obese and nonobese mice and stuck them into germ-free mice, finding that the microbial community from the obese mice actually is more efficient at extracting calories from food -- the excreta have fewer calories remaining. And then (back to Ley), they examined humans who lost a lot of weight, and found that they had the gut microbes of nonobese people!

The power of metagenomics becomes evident when Ley and colleagues were able to show that the differences in gut ecology between obese and nonobese individuals were not simple "blooms or extinctions of specific bacterial species." Considering the rapid reproductive potential of microbes, such intermittent blooms would likely be a first hypothesis for differences between individuals; the metagenomics is able to show changes to a more complex balance of bacterial types.

This kind of alteration of a complex balance is what Bajzer and Seeley mean by "biological control systems" in their accompanying editorial. They mention that a very slight excess in caloric intake over expenditure can add up to large weight gains over time, so that relatively small differences in the efficiency of gut flora can make a large difference.

This is all really interesting, and it doesn't really solve any mysteries, it just raises new ones:

Another unknown is why and how the make-up of the microbiota is shifted by differences in body weight. Given that acquiring food from the environment can be both calorically expensive and potentially dangerous, it would seem to be most adaptive to extract as many calories from every bite of food as possible. Moreover, if caloric extraction does become more efficient, the regulatory system would dictate that the organism responds by reducing its caloric intake. If a host organism had the ability to change its microbiota so as to increase caloric extraction, it would seem most adaptive to do so when facing famine conditions and losing weight. However, the data indicate just the opposite - the microbiota seems to be more efficient in obese humans who already have the most stored energy, and shifts to being less efficient as the subjects lose weight (Bajzer and Seeley 2006:1010).

Considering that the obese mice studied here were specifically engineered to be leptin-deficient, it would seem that one likely hypothesis is that leptin serves as part of a feedback system altering the balance of the gut flora.

My hypothesis would be that obese people have more efficient gut flora because the gut flora of obese people have to be more efficient to compete for nutrients.

That's ecology again -- it would be a mistake to view the gut ecology without considering the host. I also think that time may be an important element here: one of the major factors determining digestive efficiency is gut transit time, and the effects of a change in microbial populations often include changes in the transit time of food through the digestive system. Since the human microbial samples were from stool, differences in transit time in the upper digestive tract may make an important difference to the abundance of certain microbial nutrients in the cecum or colon.

It sure seems like an interesting problem, and one that we increasingly have the tools to tackle. Human microbial anthropology might be a stretch, but who knows?

References:

Bajzer M, Seeley RJ. 2006. Obesity and gut flora. Nature 444:1009-1010. DOI link

Ley RE, Turnbaugh PJ, Klein S, Gordon JI. 2006. Microbial ecology: human gut microbes associated with obesity. Nature 444:1022-1023. DOI link

Turnbaugh PJ, Ley RE, Mahowald MA, Magrini V, Mardis ER, Gordon JI. 2006. An obesity-associated gut microbiome with increased capacity for energy harvest. Nature 444:1027-1031. DOI link

Filed under

Neandertal genome FAQ

With the release of the initial two papers describing chromosomal DNA sequences from a Neandertal, I thought I would put together some frequently asked questions and answers to them. I actually have been frequently asked most of these questions this week -- mostly by journalists -- so I think this is a good list.

I'll be following up over the next few weeks with additional details, particularly as some of our own work moves forward. I've left some loose ends dangling here deliberately -- sometimes for the sake of brevity, in other cases because they await further developments.

UPDATE (11/17/2006): I'm editing through this, making changes here and there to make things clearer. So as this progresses, it won't be identical to the initial version, although changes will be minor.

There are two papers in two journals, by two different teams of people. What's the difference?

Both teams used samples from the same specimen, Vindija (Vi) 80 -- so in principle, they are sequencing the same genome. The difference between the two comes from their methods of sequencing the DNA.

The Rubin group (Noonan et al. 2006) is using a metagenomics method based on the creation of a clone library from the ancient DNA. To make a clone library, DNA from a sample is cut with a restriction enzyme, which cuts the DNA at every place that it displays the same short sequence (usually 4- or 6-bp sequences, such as "ATTA"). The short fragments of DNA are mixed together and bound to vectors that can be maintained and replicated in cells. This is the "cloning" process, and the "library" consists of all the short fragments, which (hopefully) overlap each other so that they can be reconstructed.

People have made libraries for a long time. For example, the entire mRNA complement in a given tissue type may be made into a library of complementary DNA (cDNA). Once the library is made, it can be probed with short, labeled DNA sequences to assess whether a given gene is expressed in that tissue type. Or contrariwise, after cDNA from the library is sequenced, it can be used to design probes to find where in the genome it came from.

The unique aspect of the metagenomic approach is that all DNA sequences from a sample will be included in the library, potentially seqeunced, and ultimately reconstructed with computers into separate genomes. Usually cloning is preceded by an amplification step (generally using PCR), which selects and amplifies DNA of particular interest for cloning. But metagenomic methods skip this amplification -- because they cannot predict in advance what they are looking for. One of the most important early applications of metagenomics has been to reconstruct the genomes of microbes that cannot be cultured. Even though these organisms are not amenable to keeping in laboratory colonies, their genomes can be reconstructed by sampling their environments -- for example, soil or pondwater.

Or fossils. For the Vindija 80 fossil, the extract includes only around 6 percent identifiable "primate" DNA sequences. Out of the roughly 20 percent that are identifiable at all, over half are microbial.

I suppose if you were interested in the long-term microbial decomposition of fossil bone, you could do your disseration on those. For the rest of us, the final step is to let the computer spit out the humanlike sequences, which are assumed to be the Neandertal DNA plus some proportion of human contamination.

In contrast, the 454 group (Green et al. 2006) used a method called bead-based emulsion PCR. That is a mouthful, so it bears some explanation (for which I'm paraphrasing material from Margulies 2005 and Ronaghi 2001).

The "polymerase chain reaction," or PCR, is a method of replicating many copies of a DNA sequence from a single template. Usually to do PCR, you design a "primer," which is a short sequence of DNA that causes the target sequence to be preferentially replicated by the DNA polymerase. With a number of heat cycles and sufficient primer, you end up with a whole lot of copies of just the piece of DNA that you want.

This is, of course, exactly why standard PCR is so problematic for ancient sequences. There, you can't get exactly what you want, because it is broken into tiny bits and damaged. You would be happy to get anything. But if you amplify everything together in one giant vat, then the less damaged sequences will be the ones that amplify preferentially, and these are going to be worthless to you because they all represent contaminants of various kinds, like microbial DNA or modern human sequences.

The 454 method attaches all the tiny bits of sequence to tiny beads and separates these beads into oil droplets within a water suspension. The oil droplets are the "emulsion" part, and by separating them in this way, the process can employ PCR while keeping all the tiny sequences seperate from each other. Because they are kept separate, one good sequence can't swamp out all the others in the solution. The PCR products all stick to the bead so that after they come out of the emulsion the copies of different sequences are still separate.

After PCR, the DNA is broken down into single strands, still attached to their beads, and the beads are deposited on a fiber-optic slide assembly. The slide has tiny wells that are optically connected to a light-sensing CCD, which is essential for the "pyrosequencing" step. Nucleotides flow across the slide and into these wells one after another (T, A, C, then G). When the DNA polymerase connects one of these nucleotides to the single-strand DNA in a well, it releases a molecule of pyrophosphate (PPi).

That's when the magic happens. The solution also contains luciferase -- the enzyme that makes fireflies glow. With some additional chemistry, the PPi gives a burst of energy to the luciferase, which then emits a spark of light. The CCD picks up the light, which is a signal that the nucleotide was incorporated into the sequence.

Since nucleotides are added only every few seconds, a clever person with a notebook could reconstruct the sequence of the DNA fragment in each well. The real trick is that the fiber-optic slide contains well over a million wells, all being sequenced simultaneously. As the CCD picks up the series of flashes from every cell, the system is tracking many megabases of DNA in every run.

At present, this is the fastest method of DNA sequencing on the planet. It can construct the complete genome of a microbe in a couple of hours.

If the 454 sequencing method is so much faster, then why would anybody ever want to build clone libraries?

The claim is that the library approach is superior as a way to probe for specific genetic loci. For instance, here's a passage from p. 1071 of the Pennisi article:

[Rubin] envisions several libraries, each from a different Neandertal. Researchers would pull out the same fragment from each library to compare with each other and with living people. A pilot project has already demonstrated probes that ferret out specific target sequences, so the team needn't analyze the billions of bases shared by Neandertals and living humans, or among different Neandertals. "We will be able to identify and confirm sequence changes in more than one Neandertal without having to sequence several Neandertals to completion," Rubin says. "Seeing the same change in multiple Neandertals will give us confidence that we got [the sequence] right.

This sounds similar to the study earlier this year that found Mc1r variants in different mammoths, but in fact that study used direct PCR rather than cloning (I suppose because they have a heck of a lot more mammoth tissue to work with!).

It's not obvious to me that this is really that much of an advantage. I mean, it's certainly true that we really want to sample some genes (like MCPH1) from several different Neandertal fossils. But I don't see any point to drilling into fossils for this purpose without also sequencing their full genomes.

Now, somebody will say, "Well, sequencing the full genome of every fossil is just too expensive. We can limit to work on just a few genes much more cheaply, and we can use the same samples later to sequence other genes, or whole genomes."

Personally, I don't see the rush. These fossils were in the ground for 40,000 years, and they're not going anywhere. If we can sequence whole genomes cheaply in 10 or 20 years, and additionally have better means of dealing with contamination, I don't see why we just shouldn't wait. Training graduate students in metagenomics is not a good enough reason to work on these rare fossils.

One may say that the same samples will be sufficient for later sequencing of whole genomes, or other genes, or Neandertal athlete's foot fungus, or whatever, but in my experience it somehow never works out that way. Somebody is always coming back to grind up, dissolve, or laser ablate more bone.

In fact, if I were looking to make the next advance in metagenomics, I would take some of that mammoth flesh, mix in some elephant blood, and find ways to resolve the parts of the resulting mix. That would be something.

Are you saying you are against destructive sampling of these fossils?

Not at all. In fact, I think that genomics gives the most compelling reason ever for grinding up more bones.

There is just a huge quantity of information from DNA sequences; far more than from the morphology -- especially for samples like bone fragments or isolated teeth.

Heck, if the devil came to me and said I could have the full genome sequence of every fossil if I would agree to their destruction, I think that would be a good bargain!

But it's pretty clear that we're not in that situation. We can have our cake and eat it too -- and the longer we wait, the cheaper and less destructive this is likely to be. And frankly, just one Neandertal genome is going to give us plenty to work on for a long time.

But then, I was trained as a fossil guy, and I'm used to working with a few bits and pieces. It gives me a natural advantage!

They say there's no significant evidence of interbreeding. Yet you told us last week that there is significant evidence of interbreeding. What gives?

A few years ago I gave a talk where I laid out what I saw as the problems interpreting nuclear DNA sequences from Neandertals. Now, this was long before we had any reasonable prospect of getting such sequences, so it was purely based on knowledge about human genetic variation. As I saw it then, there were two problems:

  1. Human mtDNA is really variable, with greater than 1 percent sequence divergence between people, and much higher in some places. In contrast, human nuclear DNA has less than one base pair in a thousand different between copies. To get a reasonable picture of variation among people, you need long nuclear sequences so that you will find polymorphisms. But ancient DNA is broken into short little sequences that are very difficult to reconstruct. With mtDNA, this is less of a problem because it is clonal and a person basically has one sequence in many copies. But most nuclear DNA (all autosomal DNA) exists in two, possibly different copies. So reconstructing long enough sequences to study polymorphisms is very difficult.
  2. The coalescence age of human mtDNA is only a couple hundred thousand years, so sampling ancient humans is sort of likely to result in sequences that lie outside this range of variation -- and with Neandertals, that is precisely what happened. But nuclear loci have coalescence ages on the order of 600,000 to 2 million years or older. With these dates, the diversity among living people must significantly predate any divergence of archaic humans for most nuclear genetic loci. This means that Neandertals ought to have shared a high proportion of polymorphisms that are still variable in humans. Since we can expect that Neandertals will not be very genetically divergent for these nuclear genes, compared to the genetic differences among living people, we can conclude that no gene is likely to tell us very much about the phylogenetic relationships of an ancient Neandertal with living people.

These two problems are still stumbling blocks for interpreting Neandertal sequences. But the research teams found a very clever way to circumvent them, by using genomics approaches instead of genetic approches.

If you've been scratching your head wondering exactly why "genomics" has a buzz, then this is a good example.

Because of projects like the HapMap and the chimpanzee genome project, we know a lot (not everything, but a lot) about human genetic polymorphisms and our genetic differences from chimpanzees. In fact, we have databases of human single nucleotide polymorphisms (SNPs), and human-chimpanzee comparisons. For each SNP, some humans have an ancestral nucleotide -- generally the one that chimpanzees have. Other humans have a derived nucleotide -- the one that appeared in some ancient human, and different from chimpanzees.

For the most part, derived SNP alleles are recent. A few of them are very old, and these tend to be found at high frequencies (because the person who originated them had lots of descendants in that time). But many more of them are recent, found in a relatively small number of people today, who descend from a common ancestor during the past couple hundred thousand years.

If Neandertals diverged from humans over 200,000 years ago, and they didn't interbreed after that time, then the Neandertal genome should have relatively few derived human SNPs. In contrast, if the two populations continued to interbreed after 200,000 years ago, they might share fairly many of these derived SNPs.

Hence, we have a potential test for Neandertal-human genetic interactions.

Noonan et al. (2006) looked for these derived SNPs and found very few of them. They concluded that there was no significant evidence of Neandertal-human interbreeding, although their statistical test couldn't rule out as much as 25 percent admixture (for reference, Plagnol and Wall 2006 estimated only 5 percent ancestry from all archaic humans, not only Neandertals).

Green et al. (2006) also looked for derived SNPs. They had a much bigger sample of DNA to work with, so they ought to have a stronger test. Here's what they wrote (p. 334):

Using the SNPs that overlap with our data from two large genome-wide data sets (HapMap, 786 SNPs and Perlegen, 318 SNPs), we find that the Neanderthal sample has the derived allele in 30% of all SNPs. This number is presumably an overestimate since the SNPs analysed were ascertained to be of high frequency in present-day humans and hence are more likely to be old. Nevertheless, this high level of derived alleles in the Neanderthal is incompatible with the simple population split model estimated in the previous section, given split times inferred from the fossil record. This may suggest gene flow between modern humans and Neanderthals. Given that the Neanderthal X chromosome shows a higher level of divergence than the autosomes (R.E.G., unpublished observation), gene flow may have occurred predominantly from modern human males into Neanderthals. More extensive sequencing of the Neanderthal genome is necessary to address this possibility.

If this observation holds (i.e., if it is not influenced by contamination, and the ascertainment function does indeed show this to be an excess of derived SNPs), then it is one of the strongest pieces of evidence for genetic intermixture of Neandertals and modern humans. Note that there are two avenues for this gene flow -- either from the ancient ancestors of modern humans into Neandertals, or out of Neandertals into early modern humans. I'm sure we will hear more about this when they have more sequence.

In the meantime, the other source of evidence about Neandertal-human genetic interaction is the genomic variation of living people. Last week's paper on MCPH1 (discussed here) is a good example of what that evidence looks like. The key feature is that if you troll through the genome, you begin to notice some loci with interesting genealogies. The interestingness is a combined signature of recent selection and ancient population structure.

Looking for genes like MCPH1 in the Neandertal genome is a no-brainer. We probably won't find a lot of them, because the Neandertals were a small subset of the ancient human population.

There is one further problem. We can recognize these interesting loci in living people because they lie on relatively long haplotypes with little recombination. The inference is that such an allele must have begun from a very low copy number around 30,000 years ago, presumably because it was introduced from some archaic population. But the SNPs that are presently linked to the selected site were probably polymorphic within the archaic population, not fixed on a long haplotype. Unless we know exactly which SNP is the selected site on a human allelic variant, we may have some trouble telling whether an archaic genome has the allele. And as I note below, a large proportion of SNPs are going to be missing from the draft Neandertal genome even when it reaches an average 1x coverage.

This just means that evidence from the genomics of living people and from the Neandertal genome won't mesh together seamlessly. There remains some complexity interpreting these relationships.

The divergence date of Neandertal and human sequences is estimated at around 520,000 years ago. What does that mean?

First, what it doesn't mean. It doesn't mean that the human and Neandertal populations diverged 520,000 years ago. I noted above that the estimate of the genetic divergence time comes from the proportion of chimpanzee-human differences for which the Neandertal shares the human allele. But of course, some living humans have the ancestral, chimpanzee-like allele for many polymorphisms, so this comparison of polymorphisms is not saying that Neandertals were like chimps. Instead, we are just disregarding the Neandertal-specific evolutionary events.

I'm sticking with the 520,000 year genetic divergence estimate from Green et al. (2006), instead of the older estimate from Noonan et al. (2006), because of the vastly larger sample in the Green paper. Still, most of the discussion does not hang too critically on the precise date; although the date changes the interpretation by degrees.

The real interesting observation is the Neanderal-human genome draft difference compared to the human-human difference. Here's a passage from p. 354 of Green et al. (2006):

We analysed the DNA sequences generated from a contemporary human using the same sequencing protocol as was used for the Neanderthal. Although ancient DNA is degraded and damaged, this comparison controls for many of the aspects of the analysis including sequencing and alignment methodology. In this case, 7.1% of the divergence along the human lineage is assigned to the time subsequent to the divergence of the two human sequences. The average divergence time between alleles within humans is thus 459,000 years with a 95% confidence interval between 419,000 and 498,000 years. As expected, this estimate of the average human diversity is less than the divergence seen between the human and the Neanderthal sequences, but constitutes a large fraction of it because much of the human sequence diversity is expected to predate the human-Neanderthal split. Neanderthal genetic differences to humans must therefore be interpreted within the context of human diversity.

They don't specify where this "contemporary human" was from. The draft human genome is a chimera made up of anonymous people from different populations. That means that wherever the "contemporary human" is from, it will be the same region as represented by some part of the draft genome, but not all. So the divergence between these two mystery sequences is likely to be greater than average within a single population, and less than average between different populations.

Keeping that in mind, the human-Neandertal difference is startlingly close to this human-human difference measurement. The Neandertal is only 10 percent more different from the draft human genome than these two human sequences are from each other.

It seems very likely that we will find pairs of living human populations where the average genetic divergence is older -- maybe much older -- than this human-Neandertal divergence. For instance, it seems almost certain that the great genetic variability among living African groups will exceed this human-Neandertal difference.

Some geneticists have noted that European and Asian populations seem to be a genetic "subset" of African populations, at least for many genetic loci. With these kinds of numbers, it looks like Neandertals may be a subset of living human diversity in the same sense. I've never much liked that formulation, because "subset" is never really an accurate description of the genetic relationships. But if the seat of living human diversity is Africa, adding Neandertals to the mix may not change that pattern at all.

As Green and colleagues note, most of the genetic divergence between humans and Neandertals, and between humans and other living humans, is actually much older than the divergence of these populations from each other.

At one limit (that is, assuming complete isolation of humans and Neandertals after some date), the population divergence time depends on the effective size of the population that was ancestral to living humans and Neandertals. It is basically not possible to obtain a good estimate of this ancestral effective population size from the current Neandertal data -- mainly because good estimates depend on heterogeneity in divergence times among loci, which we can't infer for the short Neandertal sequences.

Both papers assume that this ancestral effective population size was small -- even smaller than the long-term human effective population size of around 10,000 individuals. A smaller effective size for the human-Neandertal ancestral population is fairly unlikely, though, since it must have been distributed across large parts of Europe and Africa at a minimum. More likely, the effective size was close to 10,000, just as in humans, since the human effective size is inferred to have been that small over at least the past million years.

If you're reading the term "effective population size" for the first time, don't worry. It doesn't mean "population size", and it has mainly a technical genetic meaning. It is sort of important that the Neandertal sequence supports this particular effective size over the long term, but it will take another post to explain why.

As noted above, the populations may never have been isolated. The derived SNP evidence might suggest that there was never any population divergence, or at least no long period of complete isolation, between humans and Neandertals. We'll have to wait and see.

Why does this bone have such a low level of contamination compared to other Neandertals?

I should start by pointing out that "contamination" here means "modern human sequence". All fossil bones are loaded with exogenous DNA, like bacterial and fungal genomes that invaded after the animal died. From a certain point of view, those exogenous genes are contaminants -- we are generally not interested in their sequences, and sorting them out from the endogenous Neandertal DNA is a real nuisance. But because we have a reference genome from humans to compare with the sequences from the ancient bone, we can sort out these bacterial and other exogenous sequences. So although they do "contaminate" the bone, they don't distort our picture of the sequence.

The real problem is that there are contaminating sequences from recent humans in the ancient bones. These sequences come from excavators, anthropologists who studied the bones, museum personnel, graduate students who cleaned and prepared the bones for sequencing, other samples from the labs doing the work, and who knows where else.

I have been asked many times why they can't eliminate this contamination. For example, why can't they just clean the bone, or take samples from deep inside the bone, or take samples from deep inside of teeth, or use a clean room, yada yada yada.

The answer is that they do wash the bones, and they do eliminate the outer surface, and they do take samples from deep inside of bones, and they do work in a clean room, with ultraviolet lights and positive air pressure so that DNA can't get sucked into the room, and rubber gloves and bunny suits, and the whole nine yards. And the bones are still contaminated, deep inside them.

Now, you may imagine anthropologists picking their noses with the bones, and using them as chopsticks, and putting them up to their ears to hear them breathing, and all manner of other things. The truth is, I have no idea how the contamination gets in there, and neither does anybody else. It's just there, and apparently we can't avoid it.

The extraction team looked at lots of Neandertal specimens, with one question in mind: How much human contamination does this bone have? To answer this question, they amplified mtDNA sequences, and assessed what proportion of transcripts were Neandertal-like and what proportion were human-like. Vindija 80 stood out as having a very low proportion of human-like transcripts -- less than 2 percent. So they inferred that there was little contamination of the sample by recent human DNA, and are working under the assumption that the nuclear genome is contaminated in a similar low proportion.

As for why this particular bone has such low contamination, well, nobody really knows that either. Svante Pääbo speculates that it is because Vi 80 was originally identified as fauna and hasn't been handled much. He might well be right. Which would bring us back to the nose-picking chopstick bone theory, I suppose.

If Vindija 80 was put in a box with fauna, it can't be very diagnostic. This high preservation seems very unusual. How do they know it was a Neandertal?

The radiocarbon date is 38,310 +/- 2130, and they found very high preservation of a Neandertal-like mtDNA sequence. If you think that fails to answer the question, well...

How can they deal with the damage to ancient DNA sequences?

One of the things that has become clear about ancient DNA research is that DNA from ancient fossils undergoes various kinds of damage. The most obvious is the fragmentation of the DNA into very small pieces, a problem that both the sequencing approaches have been designed to circumvent.

But a more serious problem is that some bases become degraded over time in ways that cause the sequencing methods to misidentify them. For example, cytosine (the "C" base) can be chemically modified over time into a base called uracil, which sequencing methods misidentify as a thymine (the "T" base).

There seems to be no way to tell which base pair changes are diagenetic (i.e. DNA damage-induced) and which are genuine Neandertal changes.

So, the teams took a radical approach: just ignore all the changes that are possibly damage. Instead of analyzing Neandertal-specific changes, they decided to assess the status of human polymorphisms and human-chimpanzee differences in the Neandertal seqeunce. This method is how they estimated the Neandertal-human genetic divergence time, for example -- because the Neandertals have approximately 96 percent similarity with humans for human-chimpanzee genetic differences, it is possible to infer that their genes diverged from the average human gene only 4 percent of the evolutionary time separating humans and chimpanzees. The research teams assumed that humans and chimpanzees are separated by 13 million years of evolution -- this includes the time on both the human and chimpanzee lineages since their common ancestor, assumed to be 6.5 million years ago. These dates and genetic differences produce an estimate of around 520,000 years ago for human-Neandertal genetic divergences.

In the long run, it should be possible to sequence the genome with multiple coverage, which would allow damage to be resolved. With many copies, the damage to any individual DNA sequence will be unique, while changes that are evident in multiple copies must probably be real.

But we are quite a ways from the long run, so for the time being we have to deal with DNA damage. For individual genes, it may be possible to reason exactly what effects changes would have and thereby arrive at a conclusion about which changes are diagenetic. For instance, only a minority of such changes will affect coding regions, and some of those will be synonymous changes, so only a small proportion will make amino acid changes, and if there are only a couple of these per gene the resulting protein structure may be able to be analyzed. So from a functional perspective, it should be possible to work with damaged sequence.

The main problem is from the statistical perspective (i.e., assuming neutrality), and here I think the teams have taken a very reasonable approach by just throwing the changes out.

Will they really be able to sequence the full Neandertal genome in two years?

I got a lot of questions from journalists on this point. I really see no reason to doubt it -- they know their average sequence yield from a given amount of extract, and the proportion of that yield that is actually Neandertal DNA.

The main caveat is a statistical one: 3 billion base pairs of sequence is -- on average -- one full coverage of the genome, but in practice some loci will be sequenced many times, while a fairly large proportion (a bit over 30 percent) won't be sequenced at all.

A billion missing bases may not seem like a big deal, but there is a catch: the short average fragment size means that the missing patches will be distributed throughout every gene. Since the average gene covers a region of a few kilobases, complete gene sequences will be pretty rare -- most will have gaps in them amounting to around 30 percent of their length.

Or to put it another way, a bit more than 30 percent of informative SNPs in humans will not be represented in the first Neandertal genome draft.

A second issue is that the genome of Vindija 80 is not haploid -- there are two copies of most everything in that bone. Some of these copies were polymorphisms in Neandertals, and if these are reconstructed into a single sequence, there will be mixed-up haplotypes. This means that it will be difficult, if not impossible, to assess whether there were functional multi-SNP differences between the human and Neandertal sequences of particular genes.

Anyway, that's probably getting beyond ourselves. No doubt somebody will think of some way to improve these problems; and it will eventually become cheap enough to do 10x coverage instead of 1x coverage.

They're already making plans to clone Neandertal super-soldiers, aren't they?

Maybe unsurprisingly, this question about Neandertal cloning is the one most journalists so far have wanted to ask me. I'm sure they're asking everybody, hoping that somebody will slip a really pithy quote for them.

Since I have clones here at home, I can't bring myself to get to worked up about it. A Neandertal clone army would definitely be an improvement over a Neandertal Jar-Jar.

Personally, I have another problematic scenario in mind, which I am developing elsewhere.

References:

Green RE, Krause J, Ptak SE, Briggs AW, Ronan MT, Simons JF, Du L, Egholm M, Rothberg JM, Paunovic M, Pääbo S. 2006. Analysis of one million base pairs of Neanderthal DNA. Nature 444:330-336. DOI link

Lambert DM, Millar CD. 2006. Evolutionary biology: Ancient genomics is born. Nature 444:275-276. DOI link

Margulies M and 55 others. 2005. Genomie sequencing in microfabricated high-density picolitre containers. Nature 437:376-380. DOI link

Noonan JP, Coop G, Kudaravalli S, Smith D, Krause J, Alessi J, Chen F, Platt D, Pääbo S, Pritchard JK, Rubin EM. 2006. Sequencing and analysis of Neanderthal genomic DNA. Science 314:1113-1118. DOI link

Pennisi E. 2006. The dawn of stone age genomics. Science 314:1068-1071.

Römpler H and 8 others. 2006. Nuclear gene indicates coat-color polymorphism in mammoths. DOI link

Ronaghi M. 2001. Pyrosequencing sheds light on DNA sequencing. Genome Res 11:3-11. Abstract

Schloss PD, Handelsman J. 2003. Biotechnological prospects from metagenomics. Current Opinion in Biotechnology 14:303-310.

DOE genomics

Linked on Evolgen, I found this post from Nobel Intent that gives a quick summary of reasons the U.S. Department of Energy is in the genomics business. It's a good rundown, including radiation research into mutations and research into new biofuels. It might also mention the interest in finding microbial agents to clean up chemicals of various kinds. As the Neandertal metagenomics stuff starts coming online, some folks might be interested in the history of DOE involvement in genomics, and this is a good place to start.

Quote of the day

Metagenomics maven Eddy Rubin, on grinding up some more Neandertals, in Wired:

I need to get more bone ... I'll go to Russia with a pillowcase and an envelope full of euros and meet with guys who have big shoulder pads. Whatever it takes.

Human Genome Project afterglow

I was reading The Scientist because RPM sent me to this article, titled "The Human Genome Project +5".

And yet the last five years, in Olson's view, have been "a period of a great grinding of gears, kind of shifting of gears." In the terms of the science historian Thomas Kuhn, it's been "a period of consolidation and more normal science." Others, such as Sydney Brenner of the Salk Institute, the Nobel Prize-winning pioneer of the worm, Caenorhabditis elegans, go further, worrying that the genome sequence and the growing lists of sequences and proteins and protein interactions and functional elements don't get very deep into such core problems of biology as the operations of the cell, of development from egg to adult, or the problem of consciousness. "We've become very geno-centric," says Brenner. "The cell must become the focus."

I would say this is pretty much correct -- there has been a long period of normal science in genetics lately, with new findings pretty much following one after another. There have been no revolutions coming out of the HGP.

But I think this scale of examination is a bit misleading. The HGP opened the deep end of the data pool, and we are still swimming in the toddler tank.

Consider what is happening in terms of new data:

One of the most dramatic efforts to push genomics into the realm of complex, multi-genic diseases is the five-year, $138 million haplotype map (HapMap) project, involving samples donated by Japanese, Han Chinese, Yoruba, and Americans of European descent. The project takes advantage of the fact that the millions of single nucleotide polymorphisms (SNPs) found in at least one percent of humans tend to pass between generations in blocks of DNA called haplotypes. The project announced its Phase 1 analysis in October 2005, and said that the analysis of Phase 2, already completed, would be published in 2006. Despite successes, such as using HapMap data to pinpoint a gene for macular degeneration, there remains controversy over HapMap's reach into domains such as rearrangements like deletions and reversals, or the numerous rare mutations that may be involved in diseases.
The minor variations are of central interest to Bentley of Solexa, who has specialized in rare variations. The HapMap, he says, has limitations, capturing only common variations in three target populations, missing the rare mutations. But it may provide a quick way to find more disease genes. Still, in three to five years, he says, the new sequencing machines should open the option of going after virtually all the many genes involved in a disease like diabetes. To be sure, the multiple sequences of patients and "controls" will have to square with what HapMap has found. "Everything that a HapMap captures should also be captured by a technology that aims to do better." Bentley, an early proponent, calls the HapMap "a real benchmark."

There seems to be a "Moore's law" for genome sequencing:

The workhorses of the 2001 human drafts have kept doubling their throughput about every 22 months over 15 years. In September, 454 reported that, in a single run, its system did a shotgun sequence and assembly of the microbe, Mycoplasma genitalium, in four hours. Claire Fraser's team at the Institute for Genomic Research took three months to work out Mycoplasma's sequence in 1995.

And there are gene expression microarrays and microRNA assays, as described in the article. For people who want to know about gene activity at every stage of life, in every type of cell, and in response to every external stimulus, the tools are in place to figure those things out.

As for myself, I think the accumulating data will have some revolutionary effects. These won't be in genetics itself -- I think the paradigms in place now in terms of gene interactions and regulation are very powerful. No doubt some new twists in gene sequence and function will be found, but I would guess that the current picture will expand rather than being overthrown.

But for other fields, I think genetics has some revolutionary power. Obviously genomic medicine has the potential to radically change the way we approach chronic conditions. And metagenomics is already changing the way biologists study microbial communities in all kinds of environments. It wouldn't surprise me if scientists working in places like the Foja Mountains work with DNA tag samples before they do traditional taxonomy on new species.

What will happen to anthropology as a result of the HapMap? There are surprises in store...

Filed under

Mozart and mammoth metagenomic manipulation

OK, I just think the Mozart skull DNA extraction is creepy. Not because identifying dead skulls is creepy in itself -- hey, I like forensic anthropology a lot more than the random person on the street.

No, I think it's creepy because of the mammoths. I got ahold of the mammoth DNA paper by Poinar and colleagues a couple of weeks ago; it's on Science Express.

Can I just say, Science Express is super-lame? I mean, a subscription wall inside a subscription wall!

The paper, on the other hand, is decidedly not lame. Here is the abstract:

We sequenced 28 million base pairs of DNA in a metagenomics approach using a woolly mammoth (Mammuthus primigenius) sample from Siberia. Thanks to exceptional sample preservation and use of a novel emulsion polymerase chain reaction and pyrosequencing technique, 13 million base pairs (45.4%) of the sequencing reads were identified as mammoth DNA. Sequence identity between our data and African elephant (Loxodonta africana) was 98.55%, consistent with a paleontologically based divergence date of 5 to 6 million years. The sample includes a surprisingly small diversity of environmental DNAs. The high percentage of endogenous DNA recoverable from this single mammoth would allow for completion of its genome, unleashing the field of paleogenomics.

Of course, they were helped a lot by the unique preservation in the sample, which was found in optimal cold conditions at the shore of Lake Taimyr. That probably cut down substantially on extraneous microbial and fungal DNA.

But the metagenomic approach makes these kinds of contaminants mostly irrelevant. In metagenomics, researchers sequence every last piece of DNA in a sample, and then figure out what all the pieces are by comparing them to genome databases. What you get is illustrated by this pie chart:

Proportion of DNA sequence from different sources in the mammoth sample of Poinar et al. (2006).

There are two beautiful things about this graph. One is that, although there happens to be a lot of mammoth DNA in the sample (over 50 percent), there doesn't have to be. The fact is, it doesn't really matter how much of the original stuff is there or how much junk there is; if there is any minimal level of DNA preservation from the original beast, you are going to be able to find it.

The other beautiful thing is that the ability to recognize sequence is determined not by your own work on a fossil, but by the completeness of genome databases. This means that unknown sequences just sitting on your computer after an extraction gradually, inexorably, will be identified when science gets around to sequencing the organism they came from. The 18.42 percent "unidentified" in the graph will slowly reduce over time. Now, almost none of that will be mammoth-relevant information, but it's still pretty cool.

There are two problems. One is, if the DNA preservation is poor, you are going to have to grind through an awful large amount of bone to get any kind of good genome coverage. In this case, a small sample of mammoth bone was sufficient to sequence 13 million base pairs of mammoth DNA. But there might or might not be anything interesting in those 13 million base pairs. It is certainly possible to sequence more from more samples, and that is the point: if preservation was not as good as in this particular sample, you would have to mill major mammoth mandible to get a full genome sequence.

For mammoths, I don't see that as much as a problem. Remember the Explorers' Club, after all. I imagine a large woodchipper in some DNA lab standing ready to chomp the frosty mammoth meat.

For hominids, that will be a bit more troubling. Will we be willing to put an entire skull in the blender for a complete Neandertal genome? Or if Neandertals are well-enough preserved and we are willing to settle for less-than-full genome coverage, what about more ancient or more marginally preserved fossils, like an Atapuerca femur? Does a genome have more scientific value than a fossil object itself, if we can preserve its anatomical detail with microCT or other techniques?

Then there's the other problem: degradation. How good is the sequence? Even in the exceptionally well-preserved mammoth sample, there was substantial evidence for degradation of sequence, with around twice the number of expected C -> T transitions compared to elephant and a third or so more G -> A transitions. That's an awful lot of potential noise for anyone looking at gene function and evolution. I'm guessing what will have to be done is to simply ignore certain classes of mutations that are likely to derive from postdepositional diagenesis (that is, DNA rot). Even so, some remaining diagenetic changes will remain hard to figure out.

The best approach may be to simply grind up more bone; making sure that each genome section is covered by multiple copies. The multiple copies allow for error correction, since it is relatively unlikely that any single diagenetic change will occur in multiple copies of a gene. The really, really good news is that given enough sample, we are very likely to get accurate genome sequences from ancient humans.

But the whole thing raises a fairly hairy problem concerning fossil humans. It's like that commercial with the owl and the Tootsie Pop -- how many samples does it take to get the genome? CHOMP!

So what about Mozart?

Something we can do to a Neandertal, we can certainly do to bones from any historical figure. The Mozart genome, the King Tut genome, the Lincoln genome, the John Wilkes Booth genome -- we can have them all!

Today, you can have your Y chromosome sent away to find out if you are a descendant of Genghis Khan. Tomorrow, you'll be able to compare every one of your genes to Mozart. In all likelihood, some genetic variants will be associated with musical talent. The obvious next Austrian TV special will be the Mozart genotypes for any music-related genes. The less obvious step will be screening your young Julliard candidate for genetic similarity to Mozart.

There's no way Mozart can cash in on the process. But what about living celebrities, or athletes? Subscribe to iGenes and you can find out whether your kid's genes might give him the chops for the NBA (with proper work and training, of course) or whether he should start hitting the links instead.

That's what I find creepy. And there are an awful lot of composers buried in well-known locations that could be dug up for genetic comparisons.

References:

Poinar HN et al. 2006. Metagenomics to paleogenomics: large-scale sequencing of mammoth DNA. Science (online early) doi:10.1126/science.1123360.

Cave bear genomics

A new article on the epub area of Science, by James Noonan (Lawrence Berkeley National Laboratory) describes the recovery of nuclear DNA sequences from cave bear remains. Here's the abstract:

Despite the greater information content of genomic DNA, ancient DNA studies have largely been limited to amplification of mitochondrial sequences. We describe metagenomic libraries constructed using unamplified DNA extracted from skeletal remains of two 40,000-year-old extinct cave bears. Analysis of ~1 Mb of sequence from each library showed that, despite significant microbial contamination, 5.8% and 1.1% of clones contain cave bear inserts, yielding 26,861 base pairs of cave bear genome sequence. Comparison of cave bear and modern bear sequences revealed the evolutionary relationship of these lineages. The metagenomic approach employed here establishes the feasibility of ancient DNA genome sequencing programs.

In a previous post, I covered the recent announcement that this group -- a collaboration between Max-Planck Evolutionary Anthropology and Lawrence Berkeley lab -- plans to recover genomic sequences from Neandertals. The cave bear paper gives a clear hint about how it will be done.

A Nature news article covers the bear research:

The standard practice for sequencing genes involves making numerous copies of the initial sample through a process called a polymerase chain reaction, or PCR. Subjecting ancient DNA to this does not produce good results because PCR picks up and duplicates the sequences of modern animals more efficiently. This means that bits of contaminating DNA often drown out samples from the prehistoric animal.
"The prevailing idea was that this was impossible," says James Noonan of the Lawrence Berkeley National Laboratory in California, who is lead author of the paper that appears in Science this week.
To overcome this challenge, Noonan and his colleagues decided to skip the replicating step and directly sequence the tiny amount of DNA extracted from two Austrian cave-bear bones that are more than 40,000 years old. To make sure each portion of DNA was really from the bears rather than a contaminating source, they compared each sequence produced with the genome of the dog, a modern relative of the bear.
The technologies needed to examine such tiny amounts of DNA directly, along with the reference genome from the dog, have become available to scientists only recently.
The team determined that nearly 6% of the sequences analysed from one of their animal samples belonged to ancient bear: an unexpectedly large amount. The rest of the DNA probably came from soil microbes or the palaeontologists handling the bones, the team says.

Metagenomics

The technique they are using, called metagenomics, is borrowed from environmental science. The principle is that you take a sample of organic material and look for evidence of the organisms within it by separating out all the DNA and cloning it.

This is in contrast to PCR, where you look for a specific piece of DNA from one location in the genome by designing primers that will amplify that piece preferentially. With metagenomics, you don't start out knowing what you are looking for.

Metagenomics is useful to environmental scientists, drug researchers, and others because it allows the study of DNA from organisms without being able to culture the organisms in the laboratory. You are taking DNA from the samples and inserting it into bacterial colonies using a vector, resulting in a "metagenomic library." This library consists of DNA fragments from any kind of organisms that were in the sample, possibly including hundreds of species. If you've heard of the idea of creating a "bar code" of DNA that could identify organisms taken from ocean water or soil samples, this is the science that is behind that idea. You don't know what you're extracting from, and you'd like a way to standardize samples so you can say.

For the cave bears, what has been done is the extraction of DNA from the sample and cloning into a metagenomic library, consisting of bacterial DNA, fungal DNA, human DNA, and some cave bear DNA. Then the lab sequences the cloned fragments to find out what they are. The ones that look bear-like, they assume are endogenous. Hence, a limitless source of cave bear genetic material.

Of course, in the case of the bears, the lab has little worry that living bears in the laboratory have handled and contaminated the remains (although I have seen cases in labs where such strange contaminations have happened...). For Neandertals, the possibility of human contamination is everpresent. That this technique skips the PCR step is very important in limiting contamination (since modern DNA amplifies much more readily than ancient DNA) but it far from eliminates the problem. The two cave bear extracts preserve a substantial amount of human sequence -- in one case a third as much human contaminant as original cave bear. It will be very hard to exclude this contamination from consideration in a Neandertal extract, which is very likely to share much of its genome in common with humans without contamination.

Why did they compare with dogs? Because there is a dog genome project, but not a bear one. This is a computational comparison, not a wet one. For Neandertals, the comparison will be the same: hunting through the human genome to find segments that correspond to the Neandertal extracts.

Looking for Neandertal genomic DNA

This is new stuff, to a point, but not all that new. The original extraction of Neandertal mtDNA in 1997 used bacterial cloning to reconstruct the fragments. The history indicates that Pääbo's lab has not trusted PCR amplification in Neandertal-aged remains from the beginning, and certainly for good reason considering the very high chance of preferential amplification of contaminants.

But the metagenomics approach adds a new twist. If you aren't looking specifically for one genomic region when you extract DNA from the sample and clone it, then the results are going to be a scatter from across the genome. In this case, Neandertal genomics may really be like Forrest Gump's box of chocolates: you never know what you're going to get. With a sufficiently large sample, you could in principle find any region of the genome. But it's not obvious how much extract a sufficiently large sample would take. For the bears, around 1 megabase was cloned, yielding around 27 kilobases of cave bear DNA. With more effort, a larger quantity might be obtained, but of course this would require the destruction of larger samples of bone.

Twenty-seven kilobases is a potentially interesting amount. It is large enough to give a good chance of finding genetic variants in the Neandertal sequence. Humans vary in around 1 nucleotide for every thousand, so 27 kb is a nice chunk of potential differences.

But if only one out of a thousand base pairs are different between humans, the amount of DNA degradation over time might overwhelm the actual number of changes. There is some evidence from ancient mtDNA sequences for diagenetic damage to the preserved sequences resulting in sequence changes. These are known to be diagenetic because some of them apparently occur at predictable hotspots, but the rate of this damage is not yet known, and it appears to differ between different specimens. Nuclear DNA may be more stable than mitochondrial DNA, because it is packaged by proteins into a firmer structure, but I wouldn't make any bets on it. But even so, this process of diagenetic change has the potential to be much greater than the actual rate of evolutionary differences. So it will be a terrible problem to interpret the genetic differences.

Noonan et al. (2005:3) observe this problem in the cave bear sequence:

The substitution rate we estimated for cave bear is higher than that in any other bear lineage. On the basis of results from PCR-amplified ancient mitochondrial DNAs, cytosines in ancient DNA can undergo deamination to uracil, which results in an excess of G to A and C to T (GC-AT) transitions (22). The inflated substitution rate in cave bear is likely due to an excess of such events, since many of the substitutions assigned to the cave bear lineage are GC-AT transitions (Fig. 3A). These presumably damage-induced substitutions complicate phylogenetic reconstruction and the identification of functional sequence differences between extinct and modern species.

They argue that the diagenetic changes may be excluded if they occur in a subset of the clones, as they apparently do in this case. They merely leave out the clones with high rates of GC-AT transitions, and their results look more normal. This helps to reduce the problem, if the changes are concentrated in certain clones, but it cannot eliminate it.

This might be easier if we knew we were looking for particular variants at certain genomic locations. For example, if the lab went looking for the FoxP2 gene, they could expect to find variation at the one or two amino acid changing substitutions that have occurred in humans compared to chimpanzees. The odds of diagenetic changes at these positions would be relatively low compared to the known odds of finding a genetic substitution there. But the metagenomic approach may not give the opportunity to focus in on changes that are known to be likely polymorphisms. We may have to just take what we can get.

In any event, it should be interesting to see these results come out. I am afraid that we will see phylograms showing the relationship of some Neandertals compared to other living human populations. That would be a mistake, since living people are not related as branches on a tree; and there is no necessary reason to suppose that Neandertals were either. But I guess that's my job to point out when the time comes.

References:

Noonan JP, et al. 2005. Genomic sequencing of Pleistocene cave bears. Science Express. doi: 10.1126/science.1113485. Abstract

Syndicate content