genomics

LA Times: "UC Berkeley adjusts freshman orientation's gene-testing program."

Where "adjusts" means the state's Public Health Department blocked them from reporting any test results to individual students, so they took the 1000 saliva samples and made a big blind pool to have something to talk about during their orientation program.

The thing that bothers me now:

The state Senate Education Committee on Wednesday defeated a bill, sponsored by Assemblyman Chris Norby (R-Fullerton), that would have restricted UC's and Cal State's ability to seek students' DNA.

One wonders what else the bill contained. Will DNA samples be required for university admission in the future?

My previous skeptical entries: "Berkeley DNA comments", "Berkeley DNA tests revisited", "UC-Berkeley genetic tests for freshmen".

(via 80beats)

Daniel MacArthur is reporting on today's big showdown between Congress and genomics testing companies: "A sad day for personal genomics."

Several companies got hit with a sting by the Government Accounting Office, MacArthur has the audio with captions. My favorite is the agent who was posing as a girl trying to "surprise" her fiancé' with a DNA test. A company rep who said they had ways to "pretty much repair DNA damage" is a snake who should be stomped on.

There's no denying it: that tape is pure gold for the critics of the DTC testing industry. In the first and third clips, a couple of poorly-trained call centre operators at otherwise reputable companies nonchalantly produce the stake that will now be driven into the heart of the DTC industry, over and over again.

Genotyping is cheap nowadays, but not cheap enough. Crooks see easy marks willing to shovel out $400 at a time; more honest companies can't turn away too many $400 prospects or else their IPO may tank. Call center employees work with a script, and somebody writes the script and tells them when it's OK to improvise.

UPDATE (2010-07-23): More from Dan Vorhaus: "From Gulf Oil to Snake Oil: Congress Takes Aim at DTC Genetic Testing." Also, a response from genomics tester 23andMe: "GAO Studies Science Non-Scientifically." Maybe it's a "war on science"...

Filed under

Berkeley DNA tests revisited

I wrote about the UC Berkeley genetic testing of incoming freshmen earlier this spring. The summer is halfway over and the saliva kits have been sent. Now Scientific American has a long and balanced article on the contrasting approaches to genetic testing at Berkeley and an upper-level seminar at Stanford: "Exposing the Student Body: Stanford Joins U.C. Berkeley in Controversial Genetic Testing of Students".

This is an article worth reading by anyone interested in personalized genomics or bioethics. I wouldn't have expected that university classes would be such an early battleground for genetic information, privacy rights, and junk science. But nothing about either program is unprecedented. I wrote in 2005 about genetic testing associated with a course at Penn State. As I noted in 2005, I have a lot of concerns about applying these genetic tests to students. They can have an educational effect, but not always a beneficial one.

The UC-Berkeley program actually provides vastly less information than the ancestry testing that has been applied to students in courses in the past. That's my main objection -- it's an awful lot of trouble for essentially no scientific value. I mean, they might as well just do blood types!

There's a lot in the article about the thinking of the main decision makers. I'll share these two paragraphs:

In fact, after Salari originally proposed the class last fall, a Stanford task force of about 30 basic scientists, clinical scientists, genetic professors, genetics counselors, bioethicists, legal counselors and students spent several months working through the various ethical issues and establishing safeguards to protect students. In contrast, the organizers of Berkeley's project incurred criticism because they spent hardly any time considering the potential reaction to their new orientation program.

Kimberly Tallbear, a professor of science, technology and environmental policy at Berkeley, explains that neither [Dean] Mark Schlissel nor any of the project's other organizers consulted with Berkeley's bioethics community. "Schlissel said several times they were surprised about the controversy," Tallbear says. "I said to him, 'Well doesn't that tell you that you needed input from us? Because we could have told you about the controversy and debate.'"

The article also discusses the "research study" aspect -- participants will be asked to sign an informed consent form and data will be kept. It may seem like the three genotypes provided to the students would not be very interesting as research topics. But it's not too hard to imagine psychology grad students in three years becoming very interested in research projects involving a high-risk population for binge drinking and known ALDH2 genotypes. Berkeley freshmen may be enrolling now in the first phase of a long-term research study on alcohol and sexual assault.

Sergey Brin and genetic research

While I was out of town, Wired ran a long article about Google cofounder Sergey Brin and his quest to find the genetic causes of Parkinson's disease. There is much of interest here. The piece gives an account of present-day genomic research from a unique point of view.

Brin is a smart person, with a family history of Parkinson's and knowledge that he carries a risk allele. So he is directing a lot of money and attention toward new ways of approaching gene-disease associations. He is one of the major financial backers of the direct-to-consumer genomics company 23andMe, and the husband of founder Ann Wojcicki. Google, of course, has prospered by making unconventional uses of data. That's an approach that many are starting to apply to science:

Increasingly, though, scientists—especially those with a background in computing and information theory—are starting to wonder if that model could be inverted. Why not start with tons of data, a deluge of information, and then wade in, searching for patterns and correlations?

This is what Jim Gray, the late Microsoft researcher and computer scientist, called the fourth paradigm of science, the inevitable evolution away from hypothesis and toward patterns. Gray predicted that an “exaflood” of data would overwhelm scientists in all disciplines, unless they reconceived their notion of the scientific process and applied massive computing tools to engage with the data. “The world of science has changed,” Gray said in a 2007 speech—from now on, the data would come first.

I think that "fourth paradigm" probably overdignifies the approach, which looks like a regression to a naive positivism. As described in the book, The Fourth Paradigm: Data-Intensive Scientific Discovery, the idea is rather more -- a unification of theory and massive amounts of data. Data really do speak for themselves, say Fourth Paradigmers, but they speak quietly with a lot of noise drowning them out. So if you collect vast amounts of data, you have a chance to sort out the whispers of real associations from all the junk.

The article gives a vivid example:

Langston offers a case in point. Last October, the New England Journal of Medicine published the results of a massive worldwide study that explored a possible association between people with Gaucher’s disease—a genetic condition where too much fatty substances build up in the internal organs—and a risk for Parkinson’s. The study, run under the auspices of the National Institutes of Health, hewed to the highest standards and involved considerable resources and time. After years of work, it concluded that people with Parkinson’s were five times more likely to carry a Gaucher mutation.

Langston decided to see whether the 23andMe Research Initiative might be able to shed some insight on the correlation, so he rang up 23andMe’s Eriksson, and asked him to run a search. In a few minutes, Eriksson was able to identify 350 people who had the mutation responsible for Gaucher’s. A few clicks more and he was able to calculate that they were five times more likely to have Parkinson’s disease, a result practically identical to the NEJM study. All told, it took about 20 minutes. “It would’ve taken years to learn that in traditional epidemiology,” Langston says. “Even though we’re in the Wright brothers early days with this stuff, to get a result so strongly and so quickly is remarkable.”

But there are a few stumbling blocks. Unknown associations are relatively weak. Most of the phenotypes are polygenic. With heritabilities less than one, there remain unknown environmental causes of most phenotypes, which are not captured in genetic data and which may interact with different genes.

At present, I don't think anyone in genetics is really operating on a "Fourth Paradigm" level. The massive datasets are building, but few authors are working with existing population genetic theory in ways that would enhance the pattern-matching exercise. If you look through papers describing genome-wide association studies, there are a lot of bivariate statistics, and some multivariate descriptive statistics (like principal components analysis). About all the theory is case-control statistical design. Look through a paper on genetic variation and you're likely to see a STRUCTURE analysis and some coalescent simulations.

The article puts this well:

'We have no grand unified theory,' says Nicholas Eriksson, a 23andMe scientist. 'We have a lot of data.'

Genetic data today are a huge contrast from the past, both in sample size and in coverage. There are a lot of new low-hanging fruit. In the future, the easy stuff will be gone and theory will become more and more important. It remains unclear to me how much progress on health may be made by pattern-matching alone, and how much will require new theoretical advances. Given the problems explaining heritability so far, it may be that we'll need new theory sooner rather than later.

Genetic data are slowly being joined by environment data of various kinds. The article contextualizes the study of environmental variables by telling the story of the initial discovery and long use of aspirin. After it had been common in the population for a long time, researchers started to realize that it had health interactions besides followed by the slow realization that long-time use has health interactions of its own.

The second coming of aspirin is considered one of the triumphs of contemporary medical research. But to Brin, who spoke of the drug in a talk at the Parkinson’s Institute last August, the story offers a different sort of lesson—one drawn from that period after the drug was introduced but before the link to heart disease was established. During those decades, Brin notes, surely “many millions or hundreds of millions of people who took aspirin had a variety of subsequent health benefits.” But the association with aspirin was overlooked, because nobody was watching the patients. “All that data was lost,” Brin said.

The answer is simple: Collect all the data and see what percolates out of them. Heck, probably Google already has enough data about everybody based on their web searches, if they could just connect those to the 23andMe database. If you have a few years of web searches, I wonder just how much that tells you about a person's other phenotypes?

Remember (as the article points out), Google is the company that can predict flu outbreaks faster than the CDC.

Still, aside from the obvious technological progress, I'm a little more sober about the prospects of making rapid health improvements. Consider:

This approach—huge data sets and open questions—isn’t unknown in traditional epidemiology. Some of the greatest insights in medicine have emerged from enormous prospective projects like the Framingham Heart Study, which has followed 15,000 citizens of one Massachusetts town for more than 60 years, learning about everything from smoking risks to cholesterol to happiness. Since 1976, the Nurses Health Study has tracked more than 120,000 women, uncovering risks for cancer and heart disease. These studies were—and remain—rigorous, productive, fascinating, even lifesaving. They also take decades and demand hundreds of millions of dollars and hundreds of researchers. The 23andMe Parkinson’s community, by contrast, requires fewer resources and demands far less manpower. Yet it has the potential to yield just as much insight as a Framingham or a Nurses Health. It automates science, making it something that just … happens. To that end, later this month 23andMe will publish several new associations that arose out of their main database, which now includes 50,000 individuals, that hint at the power of this new scientific method.

Today's sequencing techniques make it much cheaper to do some things that used to be very expensive. But we've done a lot of gold-plated medical studies, and have more coming soon. The most important barrier to progress is not the lack of money; it is the difficulty of altering biological systems without adverse complications.

Gene number in humans the old-fashioned way

While doing some other research, I ran across a remarkable short paper by James Spuhler, "On the number of genes in man," printed in Science in 1948.

We've been hearing for the last ten years how the low gene count in humans -- only 20,000 or so genes -- is "surprising" to scientists who had previously imagined that humans would have many more genes than this.

So here's the next to the last line of Spuhler's article:

On the basis of these speculations there are then some 19,890-30,420 gene loci in man.

He actually estimated the total gene number in two ways. The first, based on estimates of chromosome length in Drosophila and humans, coupled with Bridges' estimate of fruit fly gene number (5000), led to an estimate of 42,000 genes in humans. This means of estimation was probably closer to those that later suggested a high gene number in humans.

Spuhler's second means of estimating gene number was a lot more interesting. He observed that among human pregnancies, more males than females are lost to miscarriages. Spuhler assumed that a high proportion of these fetal losses were caused by X-linked lethal mutations, and used that as a means of working out the total lethal mutation rate on the X chromosome.

Haldane had given an estimate of the mutation rate to X-linked hemophilia, based on its novel occurrence in pedigrees in the population. Taking this estimate for one locus, Spuhler could estimate the number of loci on the X. And then, the length of the X as measured cytologically could be used to estimate the total number of genes in the human genome.

His estimate on this basis, roughly between 20,000 and 30,000, is much like what we think today.

On the other hand, Spuhler's numbers were imprecise. Later, Frota-Pessoa revisited Spuhler's estimate. Frota-Pessoa found the means of estimation very attractive because they did not rely on extrapolation from other animals. However, there are other causes of fetal loss than lethal mutations, and we must recognize that the conception ratio is not 50/50, so that the proportion of male to female fetal losses can't estimate the X-linked lethal rate without some correction. Frota-Pessoa arrived at an estimate of human gene number less than a third that of Spuhler's range: only 5900 to 11,700 genes.

That estimate also gives the lie to the idea that geneticists always expected a very high gene count in humans. What's remarkable to me is that the entire means of estimation required no knowledge of gene sequences or DNA; the estimates required only epidemiology coupled with cytological estimates of chromosome lengths.

References:

Frota-Pessoa O. 1961. On the number of gene loci and the total mutation rate in man. Am Naturalist 95:217-222.

Spuhler JN. 1948. On the number of genes in man. Science 108:279-280.

It's been a busy week for DNA news. In the DNA arrest database example, Congress seems to have no problem with more testing. In the case of personal genomics companies, Congress seems ready to move toward more government control of the industry -- announcing hearings in the wake of FDA inquiries into direct-to-consumer genomics testing.

Most of the best reporting is being done outside the mainstream press, particularly by Dan Vorhaus of Genomics Law Report ("Breaking: Congress to Investigate DTC Genetic Testing", "FDA Puts the Brakes on Pathway-Walgreens Pairing; What’s Next for DTC?", "Of Drugstores and Devices: Parsing the FDA’s Evolving DTC 'Policy'").

Some have expected government action in this area for several years now, so the current moves by Congress and the FDA are not surprising. But it's not clear why the Pathway Genomics Walgreens announcement set off their alarms. Here's Vorhaus:

So what is it, exactly, about the Pathway/Walgreens partnership that prompted the FDA to act so quickly and publicly? Would the FDA’s response have been different if Pathway had partnered with Wal-Mart? With Amazon.com? And if we get all the way to Amazon.com, how different is this from what Pathway was already doing: selling its test directly to consumers through a publicly accessible website?

GenomeWeb also has good coverage of the developing story: "First Walgreens, Now House Calls: The Increasingly Bizarre Predicament of DTC Genetic Testing". Daniel MacArthur has a summary post listing the developments and providing some commentary: "Where to next for personal genomics?"

I'm not sure which tags to apply to this story. I'm torn between "colossally-bad-ideas" and "university-auditions-for-big-brother".

Berkeley asks freshmen for DNA samples

Instead of the usual required summer-reading book, this year’s incoming freshmen at the University of California, Berkeley, will get something quite different: a cotton swab on which they can, if they choose, send in a DNA sample.

This is so unbelievable that I looked all over the web for news stories to confirm it isn't just a late April Fools. What conceivable educational value do they think is going to come out of this?

The university said it would analyze the samples, from inside students’ cheeks, for three genes that help regulate the ability to metabolize alcohol, lactose and folates.

Those genes were chosen not because they indicate serious health risks but because students with certain genetic markers may be able to lead healthier lives by drinking less, avoiding dairy products or eating more leafy green vegetables.

WTF?!

Hey, Berkeley! Great plan! I'm sure that your lactose intolerant students will shocked to discover that they're lactose intolerant! OMG! That explains the milkshakes! Likewise, I'm sure that the health impacts of alcohol consumption will get your 18-year-old freshmen to booze less on the weekends! And that folate metabolism test, well, that will get them used to supplements, won't it?

I mean, seriously. Nutrigenomics is a legitimate field of investigation, but testing individuals for genes that relate to nutritional requirements has become the smelly armpit of "personalized genomics". Companies selling "personalized diet plans" or "nutritional supplements" based on supposed genetic testing have become a problem and subject of recurrent FTC investigations. There is no credible science that supports such supplements or plans, outside known nutritional deficiencies.

In fact, there is no credible science that supports the idea that knowing your lactase persistence genotype, alcohol metabolic genotypes, or "folate" metabolic genotypes will improve health.

This information is useless. It's a total waste of money. It gives a highly misleading picture of genetics.

The most probable outcome is to condition 18-year-olds to accept government-sponsored genotyping. So to make it complete, the program comes with a lack of adequate privacy safeguards. The proposal has students using "bar codes" to access their data on a public website.

Yeah, great! That's about as "anonymous" as your drink order at a coffee shop.

Using the Neandertal genome to uncover human evolutionary history

Before the Neandertal genome release last week, I was reading (thanks to a correspondent) an essay that James Noonan wrote for the current Genome Research. The piece, titled, "Neanderthal genomics and the evolution of modern humans" is well worth reading. It's a snapshot of what we might reasonably have anticipated would come out of the efforts to sequence Neandertal genomes, without the punchline -- no recognition that we would ultimately turn out to have Neandertal genes.

It will take a while for paleoanthropologists to come to any kind of informed opinion about the importance of the current genome results. The quotes I've gathered from various newspaper sources include a pretty wide range of silly ideas. Maybe some of mine fall in that category. But generally I try to be informed by both archaeology and genetics, and I find that tends to avoid some of the silliest statements.

Note however, there is really no excuse at all for archaeologists saying silly things about the archaeological record.

Noonan's point of view is that of a mainstream geneticists, and is clearly stated. It represents a widespread school of thought about Neandertal genetics, but (understandably) is mostly uninformed by the archaeological record. For example,

The primary motivation behind generating a Neanderthal reference genome is to determine how distinct modern humans really are from all earlier versions of humanity. We are the only remaining human species, and thus we do not know if Neanderthals or our other extinct relatives shared our capacity for invention, abstract reasoning, or language. We have had to speculate on these matters based on the bones, the settlements, and the artifacts Neanderthals left behind. The question of modern human and Neanderthal biological similarity is particularly compelling given the recent common ancestry of both species: Based on both genomic and mitochondrial sequence comparisons, the lineages leading to modern humans and Neanderthals likely diverged in Africa ∼300,000–700,000 yr ago (Krings et al. 1997; Serre et al. 2004; Green et al. 2006, 2008; Noonan et al. 2006). This genetic evidence has become folded into a narrative of modern human and Neanderthal evolutionary history that continues to frame comparative studies of both species. In its simplest form, the modern human and Neanderthal lineages continued on parallel evolutionary tracks subsequent to their divergence, with the descendants of one branch migrating to Europe and giving rise to Neanderthals, and the other branch remaining in Africa and eventually producing us (White et al. 2003; Mellars 2004; Hublin 2009; Tattersall 2009). The modern human colonization of Europe ∼40,000 yr ago potentially brought both lineages back into widespread contact (Mellars 2004).

Given their very recent common ancestry, how much did the species have in common at this point? Were modern humans and Neanderthals capable of interbreeding, and, if so, did it happen to any appreciable extent? Or were the species so different that no meaningful exchange of information could occur?

Well, you know my answer to those questions.

I quoted this part because I think the earlier part of the passage deserves comment. Will the genetics tell us more about the cognitive relations of Neandertals and their contemporaries? Maybe eventually, but for the time being there is a tremendous void in our understanding of functional genetics. We really know nothing about the relationship of genetic variants to the "capacity for invention, abstract reasoning or language."

Compare the situation to "personalized genomics." If we sequence somebody's genome and find new variants, for the most part we have no way of predicting what they do. And even the genes have functionally apparent properties -- for example, a stop codon -- there still may be no practical way to test the hypothesis that it influences a given phenotype.

The archaeological record is actually pertinent to cognition in a way that the genetic evidence isn't yet. That doesn't mean we have many answers -- we're still groping the dark. But if I want to know about the evolution of human cognition, the archaeology is a much better place to start.

What we know about the archaeology seems very clear: Most of the things that later MSA Africans did, Neandertals also did. There were differences, which may have been important -- but those differences don't exceed the variation of material culture in later human populations.

That doesn't rule out that Neandertals may have been cognitively different from us in some important ways. But when we look at the complexity of the material record within Africa, I think it is fair to say that Neandertal behavior fits comforably within the continuum represented by MSA people. "Behavioral modernity" is broadly shared, and doesn't clearly track lines of biological differences. Rachel Caspari and Sang-Hee Lee's work on mortality differences are another concrete illustration of the ways that material culture and behavior do not track with anatomy in these populations.

In the short term, the most important influence of understanding the Neandertal genome will be what it tells us about phylogenetics and demographic history. That is what got all the attention last week, and will continue to occupy many of us in the next few months.

Even though the news of interbreeding is fascinating, working out the phylogenetic relationships of Pleistocene humans is only a first step towards understanding their evolutionary history. Noonan focuses on strategies for uncovering which genetic changes were important to recent human and Neandertal phenotypic evolution. In this respect, the essay could serve as an introduction to the two papers released in Science last week. It explains a bit about why the Neandertal genome is useful for uncovering functional changes in the human genome, and what may prove useful to drive this inquiry further. For example, from near the end of the essay:

These studies illustrate a general strategy toward an understanding of biological differences between modern humans and Neanderthals, in which the first step is the reverse genetic analysis of genes and gene regulatory elements showing human-specific or Neanderthal-specific sequence changes. In this approach, changes in basic molecular functions, such as enhancer activity, protein-DNA interactions, or receptor-ligand binding affinity are identified in synthetic assays. The phenotypic consequences of these molecular changes can then be assessed in mouse models: A recent study describing the introduction of a "humanized" version of FOXP2 into the mouse genome by gene targeting is one early example (Enard et al. 2009). The data from such studies, combined with a growing body of information on human gene function, the effects of genetic variation on human phenotypes, and comprehensive efforts to functionally annotate the human genome, would provide the foundation for more sophisticated hypotheses concerning the biological similarity of modern humans and Neanderthals than can be generated from the paleoanthropological record alone.

Now, in light of last week's data release, we know some things about these general topics. The evolution of human-specific changes in conserved regions, for example, apparently mostly preceded the human-Neandertal common ancestor. There are few amino acid changes in recent (post-Neandertal) evolution that have become fixed worldwide -- the new studies counted only 88. There are only 212 estimated selective sweeps not present in the Neandertal genome.

Those are manageable numbers.

Of course, we shouldn't underestimate how hard it will be to untangle the interactions among these human-specific changes. It may require testing not each change one by one, but many possible combinations of the changes, since we don't necessarily know their order. And it is not only the fixed changes that are important to morphological and behavioral evolution, polymorphisms will also be important. Among those polymorphisms will be later, strongly selected changes that may substantially modify the "fixed" substitutions -- in a few cases, may even reverse them.

But this isn't a hopeless prospect anymore, it's a practical research program. The genetic changes that are nearly fixed in living people but absent in Neandertals represent one of the earliest -- possibly the first -- instances of geographic isolation and selection in Homo sapiens. They are one aspect of a pattern that has become increasingly important in later human populations, as the pace of adaptation has accelerated beyond the ability of gene flow to disperse adaptive alleles. Reconstructing this history will tell us about the shared evolutionary dynamics of humans and Neandertals, and the ecological particularities that may have made both populations phenotypically different.

References:

Noonan JP. 2010. Neanderthal genomics and the evolution of modern humans. Genome Res 20:547-553. doi:10.1101/gr.076000.108

So now you'll be able to buy a genetic test at your local drugstore:

[T]he plan being announced Tuesday by Pathway Genomics of San Diego to sell its Insight test at about 6,000 of Walgreens' 7,500 stores represents the boldest move yet to bring the power of modern molecular medicine to the mass market.

"It's the first widespread retail availability of genetic tests that are directed specifically at health issues," said Joan A. Scott, director of the Genetics and Public Policy Center at Johns Hopkins University.

The company is apparently marketing a range of different tests, priced at different levels. It's sort of ludicrous since, in those quantities for that number of markers, all of them could be done with a single chip for less than the cheapest. Well, I suppose I'm not a target of their marketing.

The article (in the Washington Post) indicates that the FDA is not cool with this idea, and is already "evaluating similar tests."

Filed under

NEANDERTALS LIVE!

I, for one, welcome my Neandertal ancestry.

It may not sound like a lot -- between 1 and 4 percent. But that's the equivalent of one great-great-great grandparent's DNA contribution. In the case of the Neandertal contribution, more than 1500 generations ago, it's an enduring legacy of an ancient group of people, spread across many lines of the genealogies of living people. Beyond their genealogical interest, Neandertal genes might have made a big difference to our evolutionary potential.

In case you wonder what the heck I'm talking about, here's the story: Two new papers in Science describe the full draft sequence of the Neandertal genome, and perform additional analyses to understand the pattern of adaptive evolution in the population ancestral to living people.

Richard Green and colleagues report on the genome, demonstrating very convincingly that present-day people have Neandertal ancestors. It is not entirely obvious when and where the gene flow between Neandertals and other ancient populations happened -- whether it was associated with the dispersal of most of our ancestry from Africa, or whether it may have been earlier. The gene flow was not limited to Europe, and evidence for Neandertal ancestry occurs in East Asian and Australasian populations.

The paper is full of other good stuff, including some evidence about which gene regions changed under selection in the ancestral human population.

Meanwhile, the second paper by Burbano and colleagues applies new microarray techniques to assess how much of the human legacy of amino acid changes has arisen in the latest, post-Neandertal period of our evolution.

So there's a lot about the pattern of evolution and gene flow leading to living people, and a lot about adaptive and functional evolution. That makes a lot for me to cover -- and while I have the papers a little early, time is short. Let's see how much I can help clarify what's in this new research.

If you had to sum up in a few words, what does this mean for paleoanthropology?

These scientists have given an immense gift to humanity.

I've been comparing it to the pictures of Earth that came back from Apollo 8. The Neandertal genome gives us a picture of ourselves, from the outside looking in. We can see, and now learn about, the essential genetic changes that make us human -- the things that made our emergence as a global species possible.

And in doing so, they've taken a forgotten group of people -- whom even most anthropologists had given up on -- and they've restored them to their rightful place in our heritage.

Beyond that, they've taken all of their data and deposited it in a public database, so that the rest of us can inspect them, replicate results, and learn new things from them. High school kids can download this stuff and do science fair projects on Neandertal genomics.

This is what anthropology ought to be.

What did they sequence?

The Max Planck group obtained most of their genomic sequence from three specimens from Vindija -- Vi33.16, Vi33.25, and Vi33.26. These are all postcranial fragments with minimal anatomical information. Green and colleagues were able to establish that the three bones represent different women, and that Vi33.16 and Vi33.26 may represent maternal relatives.

From these skeletons they got 5.3 billion bases of sequence. All this from an amount of bone powder about equal in mass to an aspirin pill.

Amazing. I mean, I know the folks at Max Planck are reading this. It's inspiring to see what they've been able to do. These are three pieces of barely diagnostic hominin bone, and they've obtained literally hundreds of times more information than we have ever gotten from the fossil record of Neandertals.

I'll describe the analyses of genetic similarity with humans in more detail below. As a brief summary, of those positions where the human genome differs from chimpanzees, Neandertals have the chimpanzee version around 12.7 percent of the time -- meaning that across the genome, a Neandertal and a human will share a genetic ancestor an average of around 800,000 years ago. This is a couple hundred thousand years higher than the same number if we compare two humans to each other. The higher age of genetic common ancestors reflects partial isolation between the Neandertal population and the African populations that gave rise to most of our current genetic variation.

The team were able to identify 111 candidate duplications, almost all of which have some evidence of copy number variation in humans or other primates. They tentatively show that Neandertals have a bit more copy number variation than present-day humans, and identify a few loci with substantially higher copy numbers in one group or the other.

A substantial part of the paper is dedicated to finding evidence of positive selection on the human lineage after the emergence of Neandertals. The idea is to look for fixed selective sweeps -- regions where humans are likely to have SNPs absent in Neandertals and a relatively shallow gene tree. They identify 212 regions like this -- as I discuss below, a surprisingly low number.

The second paper, by Hernán Burbano and colleagues, describes the application of a targeted microarray to probe Neandertal genetic samples for protein-coding variants that separate humans from chimpanzees. They identify 88 amino acid substitutions that seem fixed in the known sample of living humans, but not present in the Neandertal sequence. Those 88 are not necessarily all functionally important, although this list will include a number of "structural" genetic changes that make a difference to proteins expressed worldwide today. There is much to come in analyzing the categories and genes represented in both lists, which may tell us very interesting things about our Late Pleistocene evolution.

What is the evidence for interbreeding?

From their initial work sequencing the nuclear genome in Neandertals, the Max Planck group has followed a clever strategy: Don't look at the Neandertal sequence to see what humans share, look at human variation to see which version the Neandertal sequence has.

The strategy is smart because it helps to obviate some major problems with ancient DNA -- you don't have all the parts, and the parts you do have probably contain a lot of sequencing errors of various kinds. By looking first at sites that vary within humans (or, in some comparisons, between humans and chimpanzees), we can focus on a very simple question -- did the Neandertal have one version, or the other?

Applied to human variation today, there are several ways we might use a Neandertal genome test the hypothesis of no interbreeding. Green and colleagues focus on two complementary approaches.

1. If Neandertals contributed no genes to living populations, then they should be equally related to all living people, no matter where in the world those people live.

Green and colleagues show that the Neandertal genome is closer to some humans than others. People whose ancestry lies outside Africa are significantly more like Neandertals than are people who live in Africa today. In this study, the authors include whole genomes from people in France, China and Papua New Guinea outside Africa, and Yoruba and San inside Africa. The Africans are not as close to the Neandertal as any of the non-Africans.

That doesn't mean that non-Africans derive most of their genes from Neandertals -- in fact, as I describe below, the proportion is quite small. Living people are more like each other -- even non-Africans and Africans -- than any of them are like Neandertals.

The point is that despite this great similarity of living people, we have genetic variants that we share with the Neandertal genome, and that proportion is a lot higher outside Africa than inside it. The natural conclusion is the Neandertals contributed more genes to non-Africans than to Africans.

One thing is for sure: You can't explain this observation under the hypothesis that a small, African population expanded out of Africa without interbreeding with Neandertals along the way.

2. Look at the genes most likely to represent ancient population structure, the ones with deep roots outside Africa.

This is an idea that we came up with to look for genes in living humans that might have come in from Neandertals or other ancient populations (for example, we described it in our 2008 review). Look for the parts of the genome with the deepest genealogical roots outside of Africa. Those are candidates for Neandertal gene flow -- a high chance that one of the two sides of that deep root was present outside of Africa for hundreds of thousands of years.

Green and colleagues took this idea to the next level. They found parts of the genome where non-Africans have a deep root and Africans don't. Then they looked at the Neandertal sequence. Out of the 12 regions they identified with deep roots outside Africa, they found that the Neandertals had the deep, non-African specific version in 10 of those.

I mean, there's really not any other way you can explain this. We got those genes from Neandertals. Every one of those loci is a region where some people have a Neandertal-derived allele, and others don't. Those particular 10 loci are a small fraction of the overall Neandertal-derived element of our heritage -- because they used Perlegen SNPs to find them, they ended up with regions that are fairly long (100 kb or more in length). Those are probably all really interesting, but there will be more of them when we can reliably identify smaller segments with deep genealogies.

Could the results have been caused by contamination?

Green and colleagues are utterly convincing about the level of contamination in their sequence. They have employed several independent checks, all of which arrive at the same conclusion: The modern human contamination in almost all their comparisons is limited to significantly less than one percent -- and for autosomal sequence they can give a tight estimate of 0.7 percent contaminating sequence.

The methods that Green and colleagues used to test for a Neandertal contribution to non-African populations are not likely to be strongly influenced by contamination. The probe for deep roots in particular is extremely unlikely to be influenced by contamination in the Neandertal sequence.

The very low contamination rate, and methods that should be robust to some contamination, means that we can be very confident in their result.

How much Neandertal ancestry do we have?

The Neandertal contribution does not make up a major proportion of any population, even outside of Africa. Green and colleagues apply a population model that involves isolation between ancestral Neandertal and African populations, a dispersal from Africa into Eurasia, and subsequent mixture with the Neandertals. Under this model, the estimated fraction of Neandertal ancestry for non-African populations today is between 1 and 4 percent.

Now, let's put on our skeptics' hats. Is this the right model?

If Neandertal and African populations had not been isolated, then the amount of mixture after an out-of-Africa dispersal would be lower. On the other hand, the dispersing African population would already be part Neandertal, because of genetic mixture. The proportion of ancestry from ancestral Neandertals would be around the same amount, it would just be distributed across a longer time.

They did not examine the question of how much of the genome came in from Neandertals because of selection. The estimate they have, between 1 and 4 percent, is so high that this is not just a few genes introgressing in from Neandertals -- it is a big fraction of the neutral, non-coding part of the genome. So selection doesn't explain the similarity, nor can parallelism -- the similarity is genome-wide, not just coding or functional changes, and not as far as we know clustered into regions that might have hitchhiked with adaptive alleles.

But there's clearly a lot more to do, characterizing the functional implications of some regions, testing for selection, and finding Neandertal variants that might have reached very high frequencies in later populations. To the extent that selection has influenced the pattern, it will also throw off the simple population model. But it doesn't throw off the fraction of Neandertal ancestry -- if it's three percent, it doesn't matter whether it was selected or neutral, it's still three percent.

So the bottom line is, the fraction is going to be about right, regardless of the mechanism by which the genetic mixture happened.

Can we please take off our skeptics' hats? It's getting in the way of my Neandertal victory dance.

No. All the cool paleoanthropologists wear hats.

What about population structure within Africa? Could that explain the apparent Neandertal contribution?

We've known about the occasional deep-rooted genealogies outside Africa for a long time (and Jeff Wall's work, as an example among others, has explained that pattern as archaic human mixture into non-Africans). They've been talking about something like five percent of the human genome coming from admixture with ancient groups outside of Africa. So this shouldn't come as a shock.

Until now, though, it has been possible for some people to wave these results away. We didn't really know that any of those deep roots were in archaic humans, and after all, who's to say that they aren't variants that originated in Africa and have since been lost there, or that we haven't found them yet? African variation is great, and if you imagine that some variation might have once existed in northeastern Africa and was subsequently lost within African populations, that might look like admixture with archaic humans outside of Africa.

This line of argument is now special pleading. Why would we posit a cryptic mystery population in Africa, which happens to look genetically identical to Neandertals, but has subsequently disappeared? A big fraction of deep genealogies outside Africa really are in Neandertals. By far the simplest explanation is that today's non-Africans got them from ancient non-Africans. This is no surprise -- that's where the data have been pointing now for five years.

Yet Africans are a lot more diverse than other populations, and this diversity itself does reflect the dynamics of the ancient African population. The Neandertals aren't so different from that pattern that now still exists within Africa -- they're extending the notion that "modern" is something that's been evolving for a long time. I expect we'll be able to come to a better understanding of ancient population interactions within Africa, by understanding the parts of the genome that have come from Neandertals outside of Africa.

Could the gene flow be due to ancient interactions between West Asia and Africa?

Green and colleagues suggest that at most few genes from modern humans ended up in Neandertals.

That is, although they find lots of evidence of old-looking genes in us that are shared with the Neandertal genome, they find few cases of new-looking genes in us that are shared with that genome.

That might suggest several things about interactions between Africa and West Asia and Europe during the Middle to Late Pleistocene. For example, if there had been high gene flow from Africa into West Asia after the first appearance of a distinct Neandertal population, maybe 200,000 to 400,000 years ago, we might expect to find some new-looking genes in humans that Neandertals also got.

On the other hand, the data are from European Neandertals, who are at the end of a fairly long chain of populations from Northeast Africa. If gene flow had been ongoing into the Levant or further into West Asia during the last 200,000 years, it's not obvious how many of these genes would have made it into Europe. The rapid mitochondrial DNA coalescence of Neandertals does suggest substantial mobility in the population across Central Asia to Western Europe. But maybe that apparent dynamism had a boost from mtDNA selection.

So just on the data, I don't think we know yet whether this is gene flow in the Levant 200,000 or 100,000 years ago, or whether it's genes coming from West Asian Neandertals into dispersing Africans after 100,000 years ago. I expect all are likely. I have some ideas how to test some of these things, and we will get started immediately.

The lack of apparent mixture of "modern" genes into Neandertals -- what does it mean?

It means that a model of one-way gene flow from Neandertals into us can explain the pattern of genetic similarity.

The authors explain this as a function of population expansion. The expanding population (us) picks up some Neandertal genes that expand in numbers, while the contracting population (Neandertals) doesn't have a chance to pick up as many genes because it is declining in numbers. That model seems plausible, particularly in comparison with historical cases of population contact.

On the other hand, the three Neandertals from which most of the genome sequence was derived all date to before 40,000 years ago. There weren't any modern humans around for them to have interacted with around Vindija at that time. So should we be surprised that they don't have genes of modern humans?

A more interesting question was posed to me by a very sharp journalist: What would we expect the result to have been if they had sequenced a Near Eastern Neandertal, like Amud, for example?

The answer seems obvious -- the admixture fraction should have been higher. That population, which is the most likely to have been the source of mixture, must have been somewhat genetically different from the European Neandertals. Any extent of genetic differentiation between them would make the European Neandertals look less like non-Africans today than the Near Eastern ones.

I'll have more to say about these Near Eastern Neandertals in the next few days.

But wait a minute. I thought the mitochondrial DNA proved that Neandertals are extinct!

Selection. Selection. Selection.

I've been saying it for years. I've published it. Will you learn to listen to me, already?

The mtDNA of Neandertals is gone because it conferred some disadvantage. There are many reasons to suspect this -- the Neandertal variation is itself apparently recently derived; the human variation is clearly in disequilibrium, especially outside Africa; the mtDNA genes affect functions that differ greatly in Neandertal and recent populations, including energetics, longevity, and brain; there are clear signs of mtDNA selection in many recent human populations.

Mitochondrial DNA is useful for a lot of reasons, but nobody should ever have relied on it alone as evidence of Neandertal population dynamics.

Is it really true that there is no variation in Neandertal ancestry outside Africa?

The comparisons in the paper are highly convincing because of the sheer amount of sequence taken from the sampled individuals. A single gene locus from an individual may be unrepresentative of the person's population, but averaged across the whole genome, the difference between two people from distant populations is very, very close to the difference between the two populations.

But they sampled very few individuals. So we are left with a question -- do we really know we've sampled variation outside Africa enough to make regional estimates of Neandertal gene flow?

I think we could do better with more genomes. For example, when it comes to finding deep genealogies, we need to be able to find shorter regions than the ones used by Green and colleagues. That will expand the sample of candidate loci, and will catch some Neandertal-derived genes that we're missing now. Moreover, if gene flow was really around 1-4 percent, many SNPs that came in from Neandertals will be rare enough to be missing from the big SNP genotyping samples. We may find some variants with whole-genome sequencing on larger samples that will be worth examining.

But most important, we'll be able to develop strategies based on this success to find ancient population structure involving groups where we don't yet have the DNA -- like populations of South and East Asia. Some of those may give us the chance to test those methods soon, as for the Denisova individual.

Is this multiregional evolution, or just out-of-Africa with some leakage of earlier Eurasian genes?

Out-of-Africa movement was a major mechanism of recent human evolution. The genetic ancestry of living people is multiregional.

I see no contradiction between those statements. From now on, we are all multiregionalists trying to explain the out-of-Africa pattern.

There was clearly a dispersal of African genes into the rest of the world during the Late Pleistocene, sometime between 50,000 and 100,000 years ago. Living people everywhere on Earth derive more than 90 percent of their genes from African populations who lived 100,000 years ago. That much is plain.

(Why did I not write "more than 96 percent?" See below.)

These genetic observations require some kind of out-of-Africa event. This event was not limited to a few genes, and selection of a few genes even with substantial hitchhiking of surrounding genome cannot account for the pattern. There must have been some kind of demographic expansion including African-derived populations and preferentially excluding the genes of Eurasian populations like the Neandertals. Selection on a gene network might have mediated the expansion, as suggested by Eswaran (2002). Or the expansion might have been culturally or technologically mediated, as many other people have suggested.

Those are hypotheses about mechanisms. How did it come to be that living people trace the overwhelming majority of their ancestry to Africa within the last 100,000 years? These explanations may answer that question.

The present study shows that Neandertals were at a minimum partially isolated from their contemporaries in Africa, and that the genetic divergence between those populations was larger than the genetic differences between European, Asian, and African populations today.

Yet those Neandertals are among our ancestors. Late Pleistocene humans had multiregional origins, and the evolution of the Neandertals was itself a case of relatively recent population dispersal from Africa or West Asia. Human and Neandertal genes mostly derive from common genetic ancestors between 400,000 and a million years ago -- much, much later than the initial habitation of Eurasia 1.8 million years ago.

But 1-4 percent is so minor, can it be an important part of our evolution?

There are three things you have to ask about the fraction of Neandertal ancestry.

1. How much gene flow would it take to guarantee that anything adaptive in the Neandertal population survived into later people?

The answer to that question is simple -- it takes a few dozen matings to get most adaptive genes into our population. If there was a lot of interference with the genetic background, it might take more -- just to make sure that the advantageous alleles had a chance to be de-linked from the genetic background.

If Neandertals are one percent of the ancestry of non-Africans, we can be very sure that any gene in a Neandertal that had adaptive value in the later population is here now. That means they were important in an evolutionary sense.

2. What fraction of the human population 50,000 years ago were Neandertals?

This is very important -- when it comes to neutral genetic loci, the essential question is how much the Neandertals may be underrepresented today relative to their numbers in the past. Is three percent too low? It seems very unlikely that the fraction of Neandertals compared to the rest of humans was as high as 10 percent -- we know that Africa already had a large population 50,000 years ago, and everything we know about Neandertals suggests a very low population density, an effective size much smaller than 10,000 individuals. Were five percent of the people on Earth 50,000 years ago Neandertals?

We don't really know the answers, but now we have a chance to test hypotheses about ancient population size and expansion in Neandertals. My point at the moment is only this: If today Neandertal genes make up only one percent of the gene pool of the 5 billion people outside Africa, that's the genetic equivalent of 50 million Neandertals.

In relative terms, their contribution to our population may be a reduction from their fraction of the Late Pleistocene population. Not that great a reduction, not a massive crash to zero. A reduction in the wake of the out-of-Africa movement, possibly from five percent to three.

You might think the answer to this is obviously zero. But in genetic terms, we can ask, how many times has the average Neandertal-derived gene been replicated in our present gene pool? Those aren't Neandertal individuals -- that is, a forensic anthropologist wouldn't classify them as Neandertals. They're the genetic equivalent.

The answer to this is also simple: In absolute terms, the Neandertals are here around us, yawping from the rooftops.

There are more than five billion people living outside of Africa today. If they are one percent Neandertal, that's the genetic equivalent of fifty million Neandertals walking the Earth around us.

Does that sound minor? If I told you that your average gene would be replicated into fifty million copies in the future, would you be satisfied? Maybe your ambition is greater, but I think the Neandertals have done very well for themselves.

Does this mean that Neandertals belong in our species, Homo sapiens?

Yes.

Interbreeding with fertile offspring in nature. That's the biological species concept.

Now, some paleontologists might still disagree -- maintaining that species are units that can be distinguished morphologically, or by one or more derived features, or any number of other definitions. That's fine with me, as long as they're clear. But understand: It does define all non-Africans today as an interspecific hybrid population.

So maybe they want to rethink that one?

If Eurasians got less than 4 percent from Neandertals, doesn't that mean that they got more than 96 percent from Africa?

I look at the 1-4 percent estimate as a minimum, for several reasons. As I'll note below, this estimate mainly refers to the excess Neandertal ancestry outside Africa, which means there may be some additional amount that both recent African and non-African populations share.

But more important, Neandertals weren't the only people living in Eurasia 100,000 years ago. China didn't have Neandertals, nor did Southeast Asia and Java. India was full of hominins, which might or might not have shared substantial genetic similarity with Neandertals. They're close enough to the known Neandertal range to speculate that they may have been close, but the only available fossil, the Middle Pleistocene Narmada skull, is not very informative. Any of these populations might have been genetically different from Neandertals, and might have also contributed genes to present-day human populations -- genes that wouldn't show up by scanning the Neandertal genome.

The recent genetic sequencing of the Denisova pinky (a.k.a. the X-woman) from the Altai Mountains reminds us that these populations outside of Africa may have been quite a bit closer to us, genetically, than we might have expected from the 1.8-million-year record of humans outside Africa. These populations were dynamic in ways that many paleoanthropologists haven't yet appreciated.

Do living Africans have Neandertal ancestry, too?

I think that the present study doesn't have the power to answer this question, at least with the design that the authors used. The fact that living Africans are less genetically similar to the Neandertals is extremely important evidence of the Neandertals' genetic contribution to populations outside Africa. But it doesn't bear on how much back-migration into Africa may have happened.

We know that the answer is nonzero, because Africa has received immigrants from other parts of the world during historic times. The same genetic patterns that reflect population contacts up and down the East African coast, and across the Sahara into West Africa, show the possible conduits for the flow of Neandertal-derived genes into African populations.

But how much genetic dispersal into Africa happened in LSA or late MSA times? Mitochondrial and Y chromosome distributions in Northeast Africa suggest there was been some. Nevertheless, Africa would have been a very difficult place to return, for humans who had begun adapting to different ecological and disease environment.

I think that some Neandertal genes might have made it back into Africa, even in ancient times, but I wouldn't be surprised if that number was small.

The big shoe left to drop is the extent of population differentiation within Africa during MSA times. So far we've seen hints that these populations might have been nearly as differentiated from each other as they were from Neandertals, with substantial gene flow homogenizing them in the last 30,000 years. This paper includes an additional Bushman genome, after the four published earlier this year. Comparing that new genome to the Neandertals, its modal difference from the human reference (Hg18) genome is between the other humans and the Neandertal. Not quite halfway between, but nearly so. There's a lot of genomic variation within Africa, and exploring the population history that explains that variation may turn up some surprises.

What about recent selection?

One of the really exciting aspects of this work is that both Green and colleagues and Burbano and colleagues look for things that all humans today share but Neandertals lack.

You might call these "the genes that make us modern," although functionally we have little idea what any of them do.

Both papers show one thing that is extremely interesting: There aren't very many such genetic changes.

Burbano and colleagues put together a microarray including all the amino acid changes inferred to have happened on the human lineage. They used this to genotype the Neandertal DNA, and show that out of more than 10,000 amino acid changes that happened in human evolution, only 88 of them are shared by humans today but not present in the Neandertals.

That's amazingly few.

Green and colleagues did a similar exercise, except they went looking for "selective sweeps" in the ancestors of today's' humans. These are regions of the genome that have an unusually low amount of incomplete lineage sorting with Neandertals, and therefore represent shallow genealogies for all living people. They identify 212 regions that seem to be new selected genes present in humans and not in Neandertals. This number is probably fairly close to the real number of selected changes in the ancestry of modern humans, because it includes non-coding changes that might have been selected.

Again, that's really a small number. We have roughly 200,000-300,000 years for these to have occurred on the human lineage -- after the inferred population divergence with Neandertals, but early enough that one of these selected genes could reach fixation in the expanding and dispersing human population. That makes roughly one selected substitution per 1000 years.

Which is more or less the rate that we infer by comparing humans and chimpanzees. What this means is simple: The origin of modern humans was nothing special, in adaptive terms. To the extent that we can see adaptive genetic changes, they happened at the basic long-term rate that they happened during the rest of our evolution.

Now from my perspective, this means something even more interesting. In our earlier work, we inferred a recent acceleration of human evolution from living human populations. That is a measure of the number of new selected mutations that have arisen very recently, within the last 40,000 years. And most of those happened within the past 10,000 years.

In that short time period, more than a couple thousand selected changes arose in the different human populations we surveyed. We demonstrated that this was a genuine acceleration, because it is much higher than the rate that could have occurred across human evolution, from the human-chimpanzee ancestor.

What we now know is that this is a genuine acceleration compared to the evolution of modern humans, within the last couple hundred thousand years.

Our recent evolution, after the dispersal of human populations across the world, was much faster than the evolution of Late Pleistocene populations. In adaptive terms, it is really true -- we're more different from early "modern" humans today, than they were from Neandertals. Possibly many times more different.

More?

That's what I have time for now, if I want to get this posted. There is much, much more to say on the topic, and you can bet it will be all Neandertals all the time here for the foreseeable future.

References:

Green RE and many others. 2010. A draft sequence of the Neandertal genome. Science (in press) doi:10.1126/science.1188021

Burbano HA and many others. 2010. Targeted investigation of the Neandertal genome by array-based sequence capture. Science (in press) doi:10.1126/science.1188046

Carl Zimmer describes his experience as a master of ceremonies (with Robert Krulwich) at the Genomes, Envrionments, Traits conference ("A day among the genomes"). The conference, organized by George Church, got together on one stage almost everyone who has publicly made known their whole genome.

David Dobbs was in the audience and describes the show: "Genomes, cool conferences, and what the hell to tell people about behavioral genes". He also describes some of the backchannel talk that focused on the more concrete element of trying to predict things from genomes -- including behavioral variation:

As I'm quite interested in [behavior and mood], I couldn't help but notice that they didn't come up a lot in the formal discussions. But when I talked to people on the side, including some of those who had their genomes run, they usually confirmed my impression that people take a particularly keen intereste in genes related to things like mental health or behavior -- depression, bipolar, hyperactivty, aggression. "Oh God yes," one person told me. "Unless you're really worried about cancer or something, that's the first thing people look at. 'Do I have the crazy gene?'" Yet by my read, neither the industry nor the research community quite knows what to tell people to do with that information -- even as we move closer to making it cheaply available.

Daniel MacArthur writes a thoughtful summary of a new study of the DNA of Stephen Quake: "What can you learn from a whole genome sequence?"

That means that the real benefit of whole-genome sequencing over other assays - the uncovering of truly novel or rare genetic variants - has much less of an impact than it should, because in most cases it's impossible to assign function to such variants. Indeed, it's striking in this study that the really compelling, actionable findings - the increased risk of myocardial infarction and metabolic diseases, and the drug metabolism effects - come largely from common variants, most of which would be captured by chip-based assays such as that used by 23andMe.

Courtesy of Jon Cohen in Science ("The Chimpanzee Genome Project's Seedy Origins"), a detail that I hadn't heard before:

To begin, [Pieter] de Jong asked Yerkes for a sample of chimp sperm, and researchers there chose Clint—not because he was a hardy male representative of Pan troglodytes or had some other meaningful attribute. Clint, it turns out, became the genome chimp because he was particularly fond of providing sperm samples.

Apparently it all started with Evan Eichler, who needed to make a bacterial artificial chromosome with chimpanzee X chromosome sequence.

NIH genetic test registry

The National Institutes of Health directorate this week announced the creation of a new database for tracking and providing public information about commercial gene tests:

The National Institutes of Health announced today that it is creating a public database that researchers, consumers, health care providers, and others can search for information submitted voluntarily by genetic test providers. The Genetic Testing Registry (GTR) aims to enhance access to information about the availability, validity, and usefulness of genetic tests.

Currently, more than 1,600 genetic tests are available to patients and consumers, but there is no single public resource that provides detailed information about them. GTR is intended to fill that gap.

It is hard to tell much from the press release, but I think it foreshadows two significant aspects of the registry. First, the NIH seems to be entering the realm of quality control:

GTR genetic test data will be integrated with information in other NIH/NCBI genetic, scientific, and medical databases to facilitate the research process. This integration will allow scientists to make, more easily and effectively, the kinds of connections that ultimately lead to discoveries and scientific advances.

This would enable NIH to provide an independent summary of whether test markers correspond to clinical studies. Second, the registry seems to be encouraging active engagement of companies in the process:

During the development process, NIH will engage with stakeholders — such as genetic test developers, test kit manufacturers, health care providers, patients, and researchers — for their insights on the best way to collect and display test information. In addition, other federal agencies, including the Food and Drug Administration and the Centers for Medicare and Medicaid Services, will be consulted.

I'm not sure what this means for the possible regulation of tests in the future. The engagement of the FDA at this point may presage greater involvement of the agency in genomic testing. The involvement of Medicare in the database seems more important, as the federal government will likely become the largest purchaser of genetic testing in the near future.

In relation to the Medicare/Medicaid involvment, I discussed candidate Obama's record on gene testing in 2008 ("Good only for entertainment value... and, of course, the government"). At that time, the main concern was standardization of records:

Providing diagnostic value for SNP screens or genome sequences will take a massive effort at standardizing information about joint gene-phenotype associations. Direct-to-consumer gene testing companies presently differentiate themselves based on the different information they provide to their customers. That approach works as long as there is little of value in the results -- the companies today are succeeding or failing on the basis of the communities of customers they are building, with the stories of customers providing the best advertisements. That's the nutrition supplement market.

But that approach will start to fail if genetic tests start to allow serious risk mitigation in health maintenance. If two companies provide divergent information to customers, in a way that impacts the customers' interactions with their physicians, I expect that the outcome will be some massive lawsuits and further federal regulation. If the government becomes the health care purchaser -- and with Medicare it already is the largest -- we can expect to see early federal intervention in this market, focused upon standardizing genetic information provided to physicians.

The creation of an NIH registry may reflect growing surveillance of the different interpretive results from these tests, with an eye toward future government purchasing protocols.

The new registry announcement is discussed in more detail by Dan Vorhaus at Genomics Law Report: "Evaluating the NIH’s New Genetic Testing Registry. " He gives some background relating to the 2008 report “U.S. System of Oversight of Genetic Testing” (PDF), commissioned by the Secretary’s Advisory Committee on Genetics, Health, and Society (SACGHS) during the Bush Administration. As Vorhaus points out:

Although the SACGHS report acknowledged that “short-term voluntary approaches” to test registration might be appropriate, it also clearly indicated that the fundamental objective was the creation of a permanent and mandatory test registry.

This important distinction has not been lost on others. In a press release celebrating the GTR, the advocacy group Genetic Alliance (whose founder, Sharon Terry, has been one of the most outspoken advocates for a mandatory registry, including making the case several months ago in this very space) applauded the NIH’s announcement while simultaneously looking forward “to the registry becoming mandatory so that we are all apprised of the quality and availability of genetic testing across the nation.” (links in original)

Probably the most important element is the involvement (for now, at the level of consultation) of Medicare. Companies that want a piece of that market will be more or less compelled to join the registry:

Depending on the degree to which purchasers of genetic tests come to rely on the GTR, inclusion in the GTR may well become a de facto requirement for any commercial genetic test provider, even if it is not converted into a legal requirement.

In the place of "purchasers" read "Medicare". And of course insurance companies have similar incentives to require tests that participate in the registry.

(via Collective Imagination blog, and Genetic Future)

A low human mutation rate may throw everything out of whack

Last week, a paper looking for the genetic causes of Miller syndrome reported the whole genomes of four members of a single family: two siblings with the disorder and their two parents without. The idea was that they would simply compare the affected and unaffected genomes. They would then find candidate loci that might account for Miller syndrome in the affected siblings. By exploiting some other sources of information, they found what they were looking for. Daniel MacArthur covered the story in his post, "Disease hunting with whole genome sequences: the good news, and the bad news".

I got interested in another aspect of the story. With whole-genome sequences of parents and offspring, it becomes possible to directly determine the rate of mutations in each generation. The paper by Roach and colleagues did just that -- they counted 28 in the 2.3 billion bases of sequence they included in their comparison. That makes a per-site mutation rate of 1.1 x 10-8 per generation.

Which is a pretty interesting number. You see, it's less than half what it ought to be:

[O]ur estimated human mutation rate is lower than previous estimates, the most widely cited of which is 2.5 x 10-8 per generation (10) based on three parameters: a human-chimpanzee nucleotide divergence per site (Kt) of 0.013, a species divergence time of five million years ago, and an ancestral effective population size of 10,000. More recent estimates indicate a nucleotide divergence of 0.012 (9), species divergence time between six and seven million years ago (11–15), and ancestral effective population size between 40,000 and 148,000 (16–19). With these parameter ranges and a generation length of 15 to 25 years, the mutation rate estimate is between 7.6 x 10-9 and 2.2 x 10-8 per generation, which is consistent with our intergenerational estimate of 1.1 x 10-8. Our estimate is within one standard deviation (SD) of an earlier estimate of 1.7 x 10-8 (SD: 9 x 10-9) based on 20 disease-causing loci (20). The rate we report is for autosomes, and should be several-fold lower than that of the Y chromosome, as in the male germline more cell divisions occur per generation. Though our rate differs approximately as expected from the recently reported estimate of 3.0 x 10-8 (95% CI: 8.9 x 10-9 – 7.0 x 10-8) for the Y chromosome, the error rates make this difference not significant (21).

You can see the obvious implication: If this mutation rate is accurate, then the average human-chimpanzee gene divergence has to be up around 11 million years ago. That can be accommodated with a 7-million-year-old species divergence only if we assume a very large ancestral population -- on the order of 50,000 or higher. Or, the ancestral effective size could be lower -- but that would make the species divergence substantially older -- 9 million years or more.

There is a second implication. Most studies of human genetic variation have assumed that 5-million-year-old human-chimpanzee divergence and the high associated rate of mutations. If the true rate is less than half that, then the coalescence times of human genes are more than double most estimates. That would include our estimates of human-Neandertal genetic differences.

Well, that's a fine pickle.

I'm not quite ready to believe the very low rate estimate. The analysis in this paper uncovered tens of thousands of false positives, and had to filter through those to arrive at 28 true mutations. The filtering involved resequencing all the positives to determine which were true and which were false, but maybe there's room in there for a substantial number of false negatives, too.

If this low estimate were true of the human-chimpanzee divergence, it would imply vastly higher ages for other primate divergences, or a much lower rate on the human lineage specifically. So that allows another check on the process.

But generally, I'll be looking at whole-genome family comparisons with great interest, because they will give us a much more precise understanding of the rate of mutations and recombinations across the genome.

References:

Roach JC and 14 others. 2010. Analysis of Genetic Inheritance in a Family Quartet by Whole-Genome Sequencing. Science (early online) doi:10.1126/science.1186802

Genetics and archaeology, 2

I've just received the book, Climate Change in Prehistory: The End of the Reign of Chaos, by William Burroughs. I'll be reading it and reviewing it during the next couple of weeks.

For the time being, I found a short passage of the book's introduction that helped me to put into words something I've been thinking about this week.

Before this passage, Burroughs has described the sources of new evidence about climate and its effects on humans in the past. One of these areas is genetics, in particular the emergence of mtDNA and Y chromosome haplotypes as markers relevant to ancient migrations. The other is Greenland and Antarctic ice cores, which by 2004 had allowed course-scale temperature reconstructions over the last 800,000 years or so.

After these, he discusses archaeology -- what we might usually consider to be the most direct source of information about humans in the past. But as Burroughs describes the situation, the relevance of archaeology is somehow fundamentally more difficult to describe:

It is often easier to write with confidence on fast-developing and relatively new areas of research, such as climate change and genetic mapping, than to review the implications of such new developments for a mature discipline like archaeology. Because the latter consists of an immensely complicated edifice that has been built up over a long time by the painstaking accumulation of fragmentary evidence from a vast array of sources, it is hard to define those aspects of the subject that are most affected by results obtained in a completely different discipline. Furthermore, when it comes to many aspects of prehistory, the field is full of controversy, into which the new data are not easily introduced. As a consequence, there is an inevitable tendency to gloss over these pitfalls and rely on secondary or even tertiary literature to provide an accessible backdrop against which new developments can be more easily projected (Burroughs 2005:10).

I think this is a revealing quote. From the standpoint of someone describing an emerging science, as Burroughs is doing in the book, there must be intense frustration. It seems so simple when you compare climate data and genetic data. Humans underwent some catastrophic population declines in the past, and there were big climate fluctuations. What could be simpler? But then, you get to the archaeological record where nothing is simple at all.

Imagine the author had written the paragraph above as an exercise in self-reflection. Either of two things might logically follow:

1. ... and therefore the simple conclusions of the immature sciences may be wrong.

or

2. ... and therefore those wishy-washy archaeologists had better get their act together.

I won't prejudge which of these Burroughs comes to -- for that, I'll need to review the rest of the book. But you can see the temptation to arrive at the second -- the supposedly "mature" science is hopelessly mired in meaningless debates. The new sciences of genetics and climate change will finally bring simplicity and allow a new revolution of archaeological insight.

I'd like to write a few words in favor of maturity.

What marks a "mature" discipline is the emergence of informed critiques focused on the limits of methods of analysis. When archaeology was immature, before the 1950s or so, almost all archaeologists were simple (some say "naive") positivists. They excavated and found the traces of ancient people, just as today's archaeologists do. And what they found was what there must have been. Find a handaxe, you know people made handaxes; find a temple, you know they worshipped gods of some kind. Dig in a mound, find a grave, you know that the people had rituals associated with death that required substantial non-subsistence directed labor.

Of course, today's archaeologists tend to be positivists, too. There's no sense twiddling around with hypotheses that will never be testable. The religion of Neandertals? Well, it's one thing to speculate about it, but the fact is that it's devilishly hard to test hypotheses about religion from the material remains of any pre-monumental culture. In the absence of information, we may as well stick to the facts.

But there's a deeper sense in which archaeologists have a much more complicated view of their evidence. Archaeology has gone through many periods where different researchers developed and applied distinctive analytical techniques. These techniques have often been incommensurable. Sometimes they settle debates. For example, the systematic study of skeletal element representation and cutmark taphonomy has gone far toward testing (and verifying) the occurrence of hunting in some Early Pleistocene contexts. The hunting versus scavenging debate still goes on, with renewed emphasis on active or confrontational scavenging. But knowledge advanced by means of analytical critique.

These kinds of internal critique have fueled many of the great debates in archaeology. For example, the technical standardization promoted by François Bordes enabled a new kind of systematic comparison of assemblages with each other. But those new data gave rise to several vociferous differences of interpretation. Where Bordes had favored a cultural interpretation of site differences, Lewis Binford critiqued the emerging pattern along functional lines. Later Harold Dibble and others critiqued the stability of artifact types, noting the emergence of some categories as side effects of the reduction sequence. These critiques did not lead to any quick resolutions, but they allowed archaeologists to deepen our understanding of the cognitive and functional circumstances of artifact production and transmission. They taught us the limits of comparison by showing the weakness of particular artifact types as markers of cultures.

In human genetics, we have the assumption that particular haplotypes are markers of populations. Critiques of that assumption go back more than fifteen years, but I think it fair to say that they have not taken hold. It's worth asking, "Why not?" Why does a tradition of effective critique emerge in some areas of science but not others?

A large part of the answer is the culture of practice in human evolutionary genetics. Let me give an example. Last week, I had my students read a selection of review papers published this month in Current Biology. I mentioned those papers here a couple of weeks ago ("Genes and archaeology"). These papers are reviews of the basic findings of genetics as applied to the last 50,000 years of evolution in most of the major regions of the world.

Toward the end of our session, I asked, "What methods did you find unifying this set of papers?" That is, what basic methodology do they have in common?

The students really couldn't find any shared methodology, beyond a few issues strongly connected to the data. For example, there was a shared reliance in most of the papers on the two uniparentally inherited gene systems -- mtDNA and the Y chromosome. Several of the papers came down to issues regarding the exact mtDNA chronology, and none of them seemed to deal seriously with the discrepancies between mtDNA and Y chromosome timescales. But when it came to methods of analysis -- how do we go from genotypes and haplotypes to some knowledge that populations had a particular history -- the papers had no systematic way of answering those questions.

The demographic models developed to test hypotheses about human evolution are different in almost every study of human genetic variation. Since our evolutionary history has been complicated, simple mathematical models won't often be very effective tests of events in our evolution. So we need to apply simulation modeling of various kinds.

The necessary computer programs tend to be written by graduate students and postdocs. Principal investigators -- the scientists in charge of the lab -- are rarely directly involved in this kind of work in human genetics, although there are exceptions. The development of distinctive simulation methods in many different labs raises important issues about replicability and code quality -- some students document their code well and have extensive backgrounds in computer programming, but most do not. This situation is terrible from the standpoint of developing a shared analytical methodology -- when the students leave the lab, or when the dataset changes, the next group of students and postdocs usually ends up developing new methods.

Some groups work with standardized simulation code that has published documentation. But the students and postdocs apply distinctive parameters that rarely match those used by other research groups. That is, the programs may be standard, but the parameters are idiosyncratic. Maybe they choose parameters that provide the best fit to a particular dataset. Or maybe they choose them through a set of discussions at the laboratory level. In any event, when the data change, and when the students and postdocs change, the models change.

That means the results of different studies may be incommensurable, even if they look the same. A reviewer who just reads the conclusions of such analyses may think that they are all consistent with the same story -- even though the simulations in one paper actually may contradict the results of other papers. Papers appear unified at the level of conclusions, but not by virtue of having a shared system of methods.

Now, what does archaeology have to do with this? Well, in the case of human evolution, we have an archaeological record. It would be sensible for archaeologists to contribute to the project of genetic modeling and simulation methods -- that way, we would be testing models that could be critiqued on the basis of archaeological reality as well as genetics. But the students and postdocs who develop simulation models in genetics don't know archaeology. And most of the archaeologists don't know genetics -- so they discuss models only at the level of conclusions, not at the level of parameters.

The tradition in archaeology for the last fifty years has supported the devleopment of robust critiques. Likewise, the tradition in evolutionary genetics has supported such developments -- witness the rise of neutral theory, the "selfish gene" revolution, the innovation of evolutionary game theory. Each of these involved the discovery of weaknesses in old population models, based in part on a growing program of empirical research on natural populations and mathematical models.

I don't want to push this comparison beyond reason. There is a point of overcaution -- of superfluous critique that can impede progress. Archaeologists have beached themselves on the shoals of such critiques many times.

But human evolutionary genetics remains immature. We should be cautious about the details of population models, and we should try to identify lines of critique that will improve them. Some critiques have begun to emerge, and I will be highlighting those over the next several weeks in my course. In addition, I'll be discussing some lines of inquiry based on open access datasets that will illustrate problems in recent human evolution, along with some potentially productive approaches for solving them.

Actress Glenn Close joins the ranks of the genomed; Daniel MacArthur discusses the celebrity genomics trend.

He covers in greater detail the James Lupski genome story, in which the geneticist sequences his own genome to find out what causes his own genetic disorder, Charcot-Marie-Tooth disease. Beside that success story, he places a second study this week that had a lot more trouble -- a case in which complete genome sequencing of four members of a family could not by itself find the causative variant for two siblings' Miller syndrome.

The basic problem here is that we're still extremely bad at differentiating between mutations causing serious disease and perfectly benign polymorphisms - each of us have genomes littered with genetic variants that look like nasty mutations but have little or no effect on health. In fact, Lupski's genome illustrates this nicely: one of the mutations causing his disease is a premature stop codon that disrupts the function of a gene - but his genome also contains an additional 120 stop codons disrupting other genes, presumably without severe health effects.

So all of us are walking around with hundreds of gene-disrupting variants, and finding the single causative gene amongst all that noise is seriously challenging.

We've been talking about stop codons and pseudogenes a lot here in the Hawks lab this week.

Remember Genome 10K? Well, here's a new study by Michel Milinkovitch and colleagues, that points out the deficiencies of comparative data from 1X genomes:

2× genomes - depth does matter

Here, using recently-developed comparative genomic application systems, we evaluate the impact of low-coverage genomes on inferences pertaining to gene gains and losses when analyzing eukaryote genome evolution through gene duplication. We demonstrate that, when performing inference of genome content evolution, low-coverage genomes generate not only a massive number of false gene losses, but also striking artifacts in gene duplication inference, especially at the most recent common ancestor of low-coverage genomes. We show that the artifactual gains are caused by the low coverage of genome sequence per se rather than by the increased taxon sampling in a biased portion of the species tree.

They conclude that a diversity of 1X genomes may not be as useful as a smaller number of genomes at higher coverage. Wide coverage is good for testing conserved loci, but deep coverage will be necessary for many other kinds of comparisons.

References:

Milinkovitch MC, Helaers R, Depiereux E, Tzika AC, Gabaldón T. 2010. 2X genomes -- depth does matter. Genome Biology 2010, 11:R16 doi:10.1186/gb-2010-11-2-r16

From Razib: "Creative destruction in the personal genomics industry?"

I’m hearing about rumblings at 23andMe, and not in a good way.

Last year: "23andMe co-founder Linda Avey leaves."

Is there a common coding variant of FOXP2 in southern Africa?

Today I was looking through the online data files for the South African genome. Those online files are available from the Data Libraries entry of the Galaxy bioinformatics tool website.

I noted last week that some of the most interesting data -- in particular, the genotypes for new SNPs -- are not yet available to download ("Online toolkits -- the good and the frustrating"). But in the meantime there are some very interesting things there. In particular, the sequencing team has made available a list of amino-acid-coding mutations present in one or more of the five individuals (four Bushmen and Desmond Tutu) for whom the team obtained exome sequence.

If you look at the summary information for this list, it gives the position of amino-acid-coding mutations against the human reference genome (hg18), the position and identity of the amino acid change. It then gives a "prediction" of whether the mutation is damaging to gene function.

This kind of prediction can be very misleading. The categories of effects include "tolerated" and "damaging", but these are based on whether the site tends to be conserved in other mammal lineages, and whether the new amino acid is very different in affinity (and possible conformation) compared to the reference. There's no "beneficial" -- even though some fraction of these polymorphisms are probably retained because of selection on the mutant allele.

I say that because one of the five individuals (TK1) has an amino-acid-coding mutation in FOXP2.

Yeah, that surprised me when I found it.

As you'll remember the coding sequence of FOXP2 is pretty strongly conserved in other mammals. Two amino-acid-coding substitutions in humans separate us from other primates, an additional one separates primates from the mouse genome (Enard et al. 2002). This area of the genome looks like it had undergone a recent sweep in human populations, with relatively little variation and a strong excess of rare mutations surrounding the gene. Coop and colleagues (2008) gave a point estimate of the time of a sweep in humans as 42,000 years ago, which I wrote about at the time ("FOXP2 is really recent, it really did introgress (if it's not contamination)"). That estimate has to be massively too young -- it's not plausible that a sweep could be that recent and fixed worldwide.

Meanwhile, last year, Ptak and colleagues (2009) followed up on my suggestion that there might really have been a recent sweep, but one near FOXP2, instead of involving one of the two human amino acid substitutions. They found statistical linkage between flanking sites immediately around the gene, which would be unlikely after a fixed sweep of FOXP2 itself. That linkage is quite likely if the human-specific substitutions were already fixed, and much later another nearby site underwent a partial sweep. It remains to be demonstrated, however, what nearby site is a plausible candidate for a recent partial sweep.

So, finding variations near FOXP2 is very relevant to the history of this gene region. If there is an ongoing sweep involving some site near the gene, we should expect that some human populations haven't undergone the sweep yet, or have the selected haplotype at a lower frequency than others. The existing datasets from Africa -- mainly HapMap and HGDP sets -- are insufficient to test the hypothesis because they include only common SNP variants at low density. But sequence data from South Africa can give us a direct estimate of the nucleotide diversity around FOXP2, thereby letting us test for the presence of a recent sweep.

The amino acid coding variant in one of these Bushman genomes came to me as a total surprise. Using the alignment with hg18, the location of the mutation is at position 114089380 on chromosome 7. The mutation changes a leucine in the wild-type sequence to a proline in the mutant, and the algorithm classifies it as "damaging" -- probably because the two residues are very different in their hydropathy. This position is not one of the two human-specific amino acid substitution sites. In fact it is in the forkhead box domain of the protein itself, which is the DNA-binding motif. Without going further into the biochemistry, I really can't guess what the effect of the mutation would be. I'm not really sure it's relevant -- after all, if it is a singleton in the population it might well be a recessive with no effect on the carrier phenotype.

Still, the mutation could be common in the Bushman population. Our point estimate of the mutation's frequency is one in eight. Maybe it's a new variant that confers some advantage; maybe it's a result of a founder effect tens of thousands of years ago. It could even be widespread within Africa. We won't know until we have more genomes.

The mutation is not in any of the regions sequenced by Krause and colleagues (2007) in the Neandertals from El Sidrón. I wouldn't expect it to be there -- as a derived variant, it would be unlikely to evolve in parallel in Neandertals and southern African populations. But who knows what else we'll find?

References:

Coop G, Bullaughey K, Luca F, Przeworski M. 2008. The timing of selection at the human FOXP2 gene. Mol Biol Evol 25:1257. doi:10.1093/molbev/msn091

Ptak S, Enard W, Wiebe V, Hellmann I, Krause J, Lachmann M, P&aauml;&aauml;bo S. 2009. Linkage disequilibrium extends across putative selected sites in FOXP2. Mol Biol Evol 26:2181-2184. doi:10.1093/molbev/msp143

Krause J, Lalueza-Fox C, Orlando L, Enard W, Green RE, Burbano HA, Hublin J-J, Bertranpetit J, Hänni C, Fortea J, de la Rasilla M, Rosas A, Pääbo S. 2007. The derived FoxP2 variant of modern humans was shared with Neandertals. Curr Biol 17:1-5. doi:10.1016/j.cub.2007.10.008

Enard W, Przeworski M, Fisher SE, Lai CSL, Wiebe V, Kitano T, Monaco AP, P&aauml;&aauml;bo S. 2002. Molecular evolution of FOXP2, a gene involved in speech and language. Nature 418:869-872. doi:10.1038/nature01025

Schuster SC and many others. 2010. Complete Khoisan and Bantu genomes from southern Africa. Nature 463:943-947. doi:10.1038/nature08795

Genes and archaeology

Current Biology has released a special issue titled "Global genetic history of Homo sapiens". There is much of interest in this issue, with seven papers, mostly regionally focused in different parts of the world, but one paper by Jonathan Pritchard and colleagues discussing recent adaptive evolution.

The geneticists to varying extents in this volume depend on archaeological observations, but in many cases read the archaeology very selectively. Speaking as someone who takes archaeology seriously, I find this very frustrating. With more genetic data, we need to demand

An editorial by archaeologist Colin Renfrew leads off the special issue ("Archaeogenetics -- towards a 'new synthesis'?").

Today, we have an abundance of data about the genetic variation of living people that we did not have ten years ago. In addition to our samples from living populations, we are beginning to find a trove of information about ancient people, from DNA extracted directly from skeletal material. But despite the attempts of geneticists and (rather pitifully few) archaeologists, I don't see a "new synthesis" emerging.

Reading the first paragraph of his editorial, it seems to me that Colin Renfrew agrees:

It seems a timely moment to review human population history of the five continents as it emerges from recent archaeogenetic studies, as summarised in the reviews of this special issue of Current Biology. Has the ‘new synthesis’ — between genetics, archaeology and linguistics — arrived which I, perhaps incautiously, heralded a few years ago [1]? These highly informative reviews document, it seems to me, both achievement and uncertainty: the achievement relates to the remarkably consistent picture which has now emerged about the out-of-Africa emergence of our own species Homo sapiens and the initial peopling of the Earth. The uncertainty involves the application of archaeogenetics to the more recent, Holocene period, when most of the planet was already peopled — except much of Oceania — and sedentary, farming-based communities emerged. Here, it appears that much of our current understanding still depends on archaeological or, sometimes, linguistic evidence. And, with a few exceptions, the archaeogenetic evidence has not yet been assimilated into a genuine synthesis; but, let us begin with the good news.

I find it a markedly bad sign that Renfrew thinks the best of "archaeogenetics" is the part with the least archaeological evidence. If the genetics doesn't seem to work where there is abundant archaeology, why should we believe the genetics in cases where the archaeology is poor?

I write that quite seriously, as someone engaged directly with the genetics. It's too easy to make stuff up. How can you test a hypothesis that seems consistent with genetic data? The obvious approach is to try to falsify the hypothesis with archaeological observations -- but sadly, archaeology is often pitifully silent on the subject of demography and gene flow, or there are many scenarios equally consistent with the same archaeological record.

In the Holocene, archaeology has a lot of power to rule out hypotheses about demography and population movement. So this is where I want to see serious attempts to falsify archaeological models using genetics. And that's what we're starting to get! The finding from ancient DNA that early European farmers were neither closely related to earlier hunter-gatherers nor to later agriculturalists has been very surprising. It seems to reject the hypothesis that today's gene distributions come from an initial dispersal of farmers with their Indo-European languages -- the European component of the so-called "language-farming hypothesis".

Why? Well, because a later massive genetic change suggests that the language transition may well have happened a lot later (as suggested by much of the linguistic evidence itself), and the mtDNA haplotypes carried by the early European farmers have no clear relationship to Near Eastern or central Asian populations.

It's no surprise that Colin Renfrew would find disagreements with this genetic work; he's the biggest supporter of the "language-farming hypothesis".

But I think that the current situation is very healthy. Geneticists are testing hypotheses and showing them to be false. At the same time, they're proposing models that archaeology can easily show to be false. For example, many recent evaluations of adaptive evolution have looked for genetic outliers against a "neutral" population model that involves very small Holocene population size. From the genetic perspective, this small population size assumption is conservative -- it means that some genuine cases of adaptive evolution will look less statistically significant. But archaeology can actually inform us about these cases. Any scenario in which the Holocene population was smaller than millions of individuals must be false. In many cases, a less conservative model is in order.

I think there are tremendous opportunities for integrating adaptive evolution remains to be integrated with our understanding of demography. I don't put a lot of faith in the current storyline about genetics and the earlier part of prehistory. That story will continue to develop as we deepen our understanding of the demographic and adaptive factors that have shaped human genetic variation within the last 50,000 years.

References:

Renfrew C. 2010. Archaeogenetics -- towards a 'New Synthesis'? Curr Biol 20:R162-R165. doi:10.1016/j.cub.2009.11.056

Razib lists a taxonomy of culture-gene historical scenarios. Real worked examples for several of these would be worthwhile.

It's now several years since I've noticed a lot of interest in the project of correlating gene trees and language trees. That may be because human geneticists have reflected on the importance of geography -- which in most cases seems stronger than any culture-historical factor in explaining allele frequencies. Or maybe it's because nobody ever really understood the "synthetic map" approach.

Most of the people interested in culture history accounts of migration have focused on Y and mtDNA haplotypes, but I think there's room for new work on SNP genotypes and population history. We need some better models of culture contact and demography, and we need to integrate selection with the models.

Amy Harmon reappears in the NY Times science page this week, with a series on the clinical trials of a targeted cancer drug ("A Roller Coaster Chase for a Cure").

Dr. Flaherty, who has a near-photographic memory, was not accustomed to rereading. But in his campus office that morning, he scrolled through the article on his computer again to be sure he had understood. The presence of the same B-RAF mutation in so many cancers, he thought, meant it was one of the biggest genetic smoking guns yet identified in cancer. A drug that blocked the protein made by the defective gene might have enormous consequences for patients — and he knew of one that just might work.

This is where the "rubber" of personalized medicine "hits the road", so to speak -- if we can find drugs that treat the specific mutations that cause a person's cancer, then there may be hope in other kinds of interventions targeted to a particular genotype.

Filed under

Online toolkits -- the good and the frustrating

In pursuit of my DIY genomics posts, I've been playing around with the Galaxy bioinformatics web tools. The team responsible for the South African genomes published the data to Galaxy, and their uploads are easy to get -- either to download, or to work with the online Galaxy platform.

Working with a resource like this helps to illustrate both how tremendously useful bioinformatics tools can be, but also how frustrating it can be to figure them out. Some things are a breeze, although others are completely obscure. Documentation for the uploads is skimpy so far -- one thing that drove me up the wall is that SNPs are listed by genome, but without indicating genotypes -- is the individual a homozygote or a heterozygote? The paper by Schuster and colleagues describes their genotype calling procedure, but the results turn out not to be posted along with their other data. I'm sure they'll become available as the data are updated, but I did waste some time figuring out how the releases correspond to descriptions in the online supplementary material from the paper.

Despite occasional frustrations, we seem to be heading in the direction of all-in-one online bioinformatics toolkits. Galaxy, for example, lists several advantages on a promo page. A couple of entries:

Now your results are reproducible! | When publishing results, replace “the data were analyzed using a collection of in-house scripts” with a URL pointing to Galaxy’s history. Your reviewers will have no further questions. That’s reproducible genomics!

...

No tools for new datatypes | Some datatypes generated by high throughput genomics are so new that there are no tools to analyze them. For example, how do you extract sequences of coding exons from the latest 28-way alignments of vertebrate genomes or analyze quality scores from 454/Solexa/SOLiD? With Galaxy.

I live at the mathematical end of this stuff. I work with models of populations and assume that sequences are known, you know, as if we looked at them and read off the ACGT's. But in reality, a lot of complexity lies between models and the biochemistry. Going from sequencing reads to genomes, and aligned genomes, involves a lot of analysis. Many of the details differ entirely between different sequencing platforms. As we continue to move toward whole-genome analyses of populations and other species, it's really important to have an abstraction that allows for different underlying sequencing models, while allowing replication of the population genetics modeling.

The disadvantage of a single widely used tool is that it can limit creativity and lock people into a certain way of processing data. Locked-in assumptions sometimes lead to wrong conclusions -- as we've seen in human genetics many times over the years. But the advantage is that it allows everybody access to the same methods and data, so that results can be replicated and augmented with new observations.

Mailbag: Pearls for the swine

Re: "Genetic lapidaries":

Hope all is well in Wisconsin. Regarding your post on the new Southern African genomes and the tendency to discuss phenotypic associations...

Isn't this just a product of the priveleged status genetic technology gets in general and the way in which students are genetics in introductory biology courses? Students get taught that genotype > phenotype and get great Mendelian examples which make the process very comprehensible even if they actually underplay to the point of misinforming the students the actual complexity of that relationship. Likewise the reporting and general discussion of genetics has always highlighted its knowledge generating capacity - "decoding the blueprint of life"... So when stories like this come out there is a latent expectation to hear about those simple genotype-phenotype associations like the ones you outline. This then becomes self-reinforcing. Obviously these genomes represent a huge amount of information - it is harder to accept and understand that they don't represent the amount of knowledge (at least yet) that we would like them to be.

What you say is true enough, that's probably where these authors are coming from. I mean, it's hard to blame the press when the paper serves up these little nuggets of "wisdom".

Discovering Mendelian associations that differ in frequencies in different populations is a good start. I only object because the ones they've "discovered" are the ones that we already knew about!

Anyway, seems to me that the thing to do is figure out how many of the "damaging" amino acid substitutions actually are new things in the (European) reference sequence that have been selected recently. We may have a bit of a statistical problem there -- if you want to test hypotheses about 13,000 or so amino acid variants, the Bushmen are not a very large sample...

Syndicate content