john hawks weblog

paleoanthropology, genetics and evolution

sequencing

  • Everyday genomes

    Fri, 2013-03-29 20:46 -- John Hawks

    From the Guardian, a pause to consider how ordinary complete genome analysis has become: "Genome research: discovery as an everyday event".

    When the Human Genome Project was completed in April 2003, it was hailed as biology's equivalent of the moon landing. Ten years on, what began as costly, painstaking and uncertain science has become commonplace.

    Researchers now have the entire genomes of more than 4,000 species – pathogens such as salmonella, leprosy and tuberculosis, parasites such as the malaria plasmodium, insects such as the fruit fly and the malarial mosquito, crops such as maize, the grape and the golden delicious apple, mammals such as the dog, the African elephant, the laboratory mouse and the chimpanzee. One consortium is comparing the genetic texts of a thousand human beings; another is assembling all the variations that might explain differing susceptibilities to disease, and differing responses to the same drugs; a third is using inherited markers to build up a detailed picture of the great journey of homosapiens [sic] out of Africa 70,000 years ago to colonise almost the entire world.

    It is glorious, because it means that we can now do comparative science instead of "big science" on genomes. There's a whole lot of opportunity for those of us who concentrate on statistical and analytical methods of comparing populations, instead of single genomes. And it will become more and more possible to do good work with new data, as new data become cheaper and cheaper to obtain.

    Meanwhile, I'm tired of "the great journey out of Africa". Clue to writers: Most of our species was still in Africa, where the majority of humans still lived some 20,000 years ago. When those genetic markers are doing a better job of telling us what happened to our African ancestors, I'll have more confidence about the story of how a minority of them left Africa.

  • A new high-coverage Neandertal genome

    Wed, 2013-03-20 00:32 -- John Hawks

    Today, Svante Pääbo's group at the Max Planck Institute for Evolutionary Anthropology released high-coverage sequence data from a toe bone from Denisova Cave. The new genome comes a year after the same group released the high-coverage genome of the Denisova finger bone, several months before they published the first high-coverage analysis of this ancient genome [1]. Today's announcement is here: "A high-quality Neandertal genome sequence". It adds a second high-coverage genome from Denisova Cave, this one from a toe bone. Unlike the first finger bone genome, this toe has produced a genome very much like Neandertal specimens from much further west, including the Vindija Neandertals.

    Something interesting in these data: the presence of a Y chromosome.

    There's not so terribly much we can say about a toe. This particular bone was first reported in 2011 by Mednikova [2], who described the specimen's anatomy. She found the toe similar in some respects to equivalent Neandertal toe bones, but also like recent humans in a couple of details. Still, the anatomy wouldn't be enough to conclude that the bone is a Neandertal, because we don't know much about the toes of other ancient human populations.

    The genetics are fairly clear about the level of similarity of this new genome to other Neandertals. From the announcement:

    Similarity of Neandertals and Denisova genomes

    The figure shows a tree relating this genome to the genomes of Neandertals from Croatia, from Germany and from the Caucasus as well as the Denisovan genome recovered from a finger bone excavated at Denisova Cave. It shows that this individual is closely related to these other Neandertals. Thus, both Neandertals and Denisovans have inhabited this cave in southern Siberia, presumably at different times.

    This is a cluster diagram based on genome-wide similarity, which doesn't tell us about possible mixture among the populations. But it does show the high degree of similarity among the known Neandertals. This new specimen from Denisova (labeled "Altai") is a bit further from them than they are to each other, but not much. It will be interesting to assess this degree of similarity in comparison with the within-population similarity of more living human populations.

    I'm reluctant to accept a dichotomy of "Denisovan" versus "Neandertal". Distinguishing the samples in that way invites a typological assumption about the ancient people, giving an impression of distinctness that I'm not yet convinced about. It remains to seriously investigate the hypothesis that one or both of these putative samples represents some amount of gene flow from each other, or from yet more ancient populations. But I suppose we're stuck with the "Neandertal from Denisova" and the "Denisovan from Denisova".

    Unless we go for "manual genome" versus "pedal genome", which is admittedly unappealing.

    There's not much meat in this announcement, that will wait for the full published analysis that we can expect later this year. The most important aspect of this, like the Denisova data availability from last year, is that we can now start working with the high-quality data. As someone who works with sequences, I cannot overstate the importance of having the best high-coverage data available for our work.

    I have a paper in preparation where I make a relevant analogy, in this case noting last year's high-coverage Denisovan genome in comparison to the history of ancient DNA sequencing:

    To put this into context: the original 360bp sequence from Feldhofer 1 has been memorialized on a cross-shaped plaque at the site outside Mettmann, Germany. This plaque is approximately 1 square meter in size. A similar monument to contain the Denisova high-coverage data would need to be more than 14 kilometers across. Compared to the first sequencing effort in 1997, today’s state of the art involves the generation of more than 200 million times more data.

    It's a pretty awesome time for those of us exploring human evolution!


    References

    Synopsis: 
    Noting the announcement of new data availability from Denisova
  • Privacy of genetic research participants

    Thu, 2013-02-07 00:01 -- John Hawks

    Misha Angrist, writing in Nature News comments ("Genetic privacy needs a more nuanced approach") on the recent study that demonstrated the possibility of finding the true identities of research participants who provided anonymized DNA samples [1]. Adding some context to the study, Angrist discusses the current federal privacy regime, and the way that genetic research relies upon the anonymizing techniques now shown to be insecure:

    Although genetic data are considered protected health information under the HIPAA, many of the protections disappear when the information is ‘de-identified’ — that is, the 18 identifiers specified in the act (including names, addresses, birthdates and the like) are removed. And because genetic information is not one of those 18 identifiers, it does not need to be removed from health records to follow the letter of HIPAA privacy. If researchers do not know who you are, and cannot easily find out, then their obligations to you diminish by orders of magnitude. Furthermore, their protocols are less likely to need full review by an institutional review board; their grant applications become less onerous; and their technology costs go down.

    ...What if the absence of the 18 identifiers isn’t enough to protect someone’s identity?

    If genotyping becomes sufficiently cheap, and personal information sufficiently interlinked within corporate or government databases, then personal identification of genetic samples will be ubiquitous. The constraint on ubiquitous identification is not the cost of genotyping, which is already cheap enough for anyone motivated to identify a sample. The remaining constraint is the interlinking of databases.


    References

  • Finding sequencing methods in the library

    Tue, 2013-01-08 23:38 -- John Hawks

    Jay Shendure and Erez Lieberman Aiden have a recent review in Nature Biotechnology that provides some recent data on the falling cost and increased use of genome sequencing [1]. They accentuate the massive reduction in cost of sequencing technology over the last seven years -- from $1000 per megabase to only 10 cents per megabase.

    What is more interesting about the article is that the authors concentrate on the possible strengths of different sequencing platforms for different biological projects. They point out that the cheapest technology may not be the best for many purposes, and each application has different unique requirements.

    They illustrate this with a "subway map" view, which illustrates the routes that different molecular techniques have followed, from one application to another, until they have come to be used for sequencing (the function at the "terminal").

    Subway map view of sequencing technology and applications, from Shendure and Aiden 2012

    From their later text:

    The subway map analogy suggests that the development of new applications is likely to be best supported by a broad knowledge of existing and emerging sequencing protocols as well as a willingness to delve into the past 50 years of methods development in biochemistry and molecular biology. These sources effectively provide a toolbox that can be drawn on when evaluating potential routes to support new applications.

    Of course, the next advances in sequencing methodology are probably already being developed by labs looking through these methods.


    References

    1. Shendure J, Aiden EL. The expanding scope of DNA sequencing. Nature Biotechnology. 2012;30(11):1084 - 1094.
  • Do we need an offshore data haven for genomes?

    Tue, 2012-07-31 10:16 -- John Hawks

    Razib Khan comments on 23andMe's pursuit of FDA clearance for their genome service:

    I still believe that on a deep level regulatory agencies don’t “get it.” Our own genotype and genome is going to be a cheap commodity in the next few years. Services like Promethease will proliferate to provide people open source information. Is openSNP going to the FDA anytime soon? The main reason that firms like 23andMe will go through regulatory hurdles is that they are, or aim to be, legitimate public entities. In other words this is an artifact of our institutions. Mind you, 23andMe et al. will probably always have slicker user interfaces, and there’s some value in that. But that doesn’t entail FDA oversight, does it?

    I think of the ongoing case of the caveman blogger: "North Carolina Tells Blogger That Providing Dietary Advice Is Illegal, Blogger Tells NC To Read The 1st Amendment". States and the federal government have no end of ways to make life difficult for people who comment publicly on health. Do we need an offshore data haven for genomes?

    The interest of FDA officials in regulating commercial entities offering personal genome interpretation services has been a notable story over the last two years. More details about the story from the 23andMe blog: "23andMe Takes First Step Toward FDA Clearance".

  • Human population history makes a difference

    Thu, 2012-05-10 16:18 -- John Hawks

    Alon Keinan and Andrew Clark have a short report in the current Science examining the effects of recent human population growth on the expected spectrum of human genetic variation [1]. Population growth skews the variation in a population so that there are many more rare alleles than would be expected in a constant-sized population.

    Why is this? In a constant-sized population, individuals have an average of two offspring who survive to have offspring of their own. Many people have no children at all, or only one, while only a small proportion of people have more than four children. In the constant-sized population, a person born with a new mutation would have a 50% chance of passing it on to each child. In such a population, more than a third (36%) of mutations aren't passed on even once. The same fraction are inherited by only one child, and these face the same odds of extinction in the next generation. This isn't natural selection, it is random genetic drift -- and its net result is that most new mutations are lost.

    In a growing population, individuals average more than two offspring. Every additional offspring increases the chance that a new mutation will be passed on to the next generation. In other words, more people means less genetic drift. As a population grows, new mutations begin to stack up at low frequencies in the population.

    This is a very basic point in population genetic theory, and it interacts in a troubling way with the current generation of sequencing technology. Short-read shotgun sequencing yields a high number of false positive mutations, which must be aggressively filtered out of whole genome data. If we don't filter these out, we will arrive at incorrect conclusions about many aspects of human biology. The simplest means of filtering require some understanding of how many rare mutations you expect to find, in particular how many should be found in only one person in a sample of people. That expectation is different in a growing population, resulting in a potentially large bias.

    Despite an improvement in the accuracy of sequencing technologies, some errors remain unavoidable. For example, with a sequencing error rate of 1 in 10,000 bases, in a sample of 10,000 individuals, each base pair will exhibit two errors on average across the sample and the majority of monomorphic sites will appear polymorphic (most often as a singleton or a doubleton; i.e., with the rare allele present in one or two copies in the sample). On the other hand, strict filtering of the data will lead to missing many rare variants because they are not observed as reliably. Hence, any analysis of large sample sizes must account for the uncertainty inherent in sequencing by considering the variant calls probabilistically, and secondary validation of rare variants by an alternate sequencing procedure is essential.

    Keinan and Clark present some models that show how much it matters to consider a growing population compared to the usual null model of constant population size.

    It's so interesting to me to see human geneticists catching up to where anthropologists have been for a long time. Of course, we wrote about the effects of recent population expansions in 2007, noting the apparent acceleration of positive selection in post-agricultural populations ("Why human evolution accelerated") [2].

    Large-scale sequencing projects have moved beyond simply categorizing common genetic variation. They are now at a stage where thousands of individuals need to be examined, to find increasingly rare genetic variations and determine their collective effects on phenotypes. That means that the next version of the 1000 Genomes Project really needs to be involve many of us who are directly concerned with human population history. The growth and dynamics of actual historic human populations are going to matter to how we understand their genetic variation and its effects on phenotypes. Fortunately, archaeology and written history can help -- if anthropologists are involved in this work from the start!


    References

    1. Keinan A, Clark AG. Recent Explosive Human Population Growth Has Resulted in an Excess of Rare Genetic Variants. Science. 2012;336(6082):740 - 743.
    2. Hawks J, Wang ET, Cochran G, Harpending HC, Moyzis RK. Recent acceleration of human adaptive evolution. Proceedings of the National Academy of Sciences, U. S. A. [Internet]. 2007;104:20753–20758. Available from: http://dx.doi.org/10.1073/pnas.0707650104
    Synopsis: 
    Human genetics has reached the point where population history is essential to further progress
  • Finding the scary genes

    Wed, 2012-03-07 21:39 -- John Hawks

    John Lauerman reports in BusinessWeek on his experience participating in the Personal Genome Project:

    “This is probably the most serious variant that we’ve actually seen to date in the study,” Thakuria said. About two out of 1,000 people have the JAK2 variant, which encourages blood cells to grow and divide. The variant is used to diagnose three rare blood disorders, including primary myelofibrosis, which is potentially lethal. “I don’t want you to fret about this,” Thakuria said, before giving me fresh cause for worry: a study, published in 2010, in which 10,507 people in Copenhagen gave blood samples and were followed for as long as 18 years. The Copenhagen researchers went back and analyzed the blood samples: 18 had the JAK2 variant; 14 of those 18 with the variant developed cancer in their lifetimes, and all 18 died within the study period. How, exactly, was this helping?

    Finding that you carry a harmful genetic variant, and that there's nothing you can do about it, is probably the most frightening outcome when obtaining your personal genetic information. Some say they would rather not know about such genes.

    Several others have commented on Lauerman's piece, including Matthew Herper at Forbes, and the 23andMe blog. Naturally, they have different takes.

  • Genotyping the intro class

    Fri, 2012-02-24 00:26 -- John Hawks

    Holly Dunsworth, at the University of Rhode Island, is undertaking a unique project with her undergraduate course this semester, providing 23andMe genotyping for every student. She describes some of her thoughts on the "cans of worms" that this may create for her: "First we were snapped, now we're SNP'd".

    Part of what students have to do this semester is form a 'plan of action.' That's what I've called their assignment where they predict what their SNPs will hold and where they explain what they will do if they find out they're at high risk for a disease or even, yes, that they might not be related to their father. (This discovery doesn't require paternal DNA. Since half of your genome is from your father, and since a few traits are pretty simple, the rare participant with the rare SNP can deduce that they did not get their DNA from their father who doesn't show the trait in question.)

    I've been discussing this issue a lot with people lately. In a few years, most of my students will have whole-genome genotyping or sequencing done for routine medical purposes, because that's how cheap it will be. Interpretation of the results will not (necessarily) be cheap, and it may not be appreciably better than it is now.

    Some readers will say, "Well, if the interpretation isn't a lot better than now, nobody will want the results anyway, so it won't happen."

    I disagree. Take Mendelian disorders. Today, every child in Wisconsin is tested for a few dozen genetic disorders at birth. It is already possible to screen parents for every Mendelian disorder with a frequency of more than one in a thousand. In a short time, that genotyping will be cheaper than the current postnatal testing. Prenatal care already includes a score of tests, and fetal cell genotyping may eventually replace postnatal testing for genetic disorders. Moreover, companies (for example, Counsyl) are already providing genotyping and interpretive services for the couples prenatal testing market. As genotyping becomes cheaper, it will pull in a broader and broader fraction of my students, future college graduates and professionals.

    So I've taken it as my attitude that my biological anthropology courses must educate them for this future. Our curricula can provide the students useful information about health and ancestry, including both the promise and limits of genetic information. The beauty of the new genetic approaches is that they provide better illustrations of most of the classic topics in human biology and variation. You can see some of that at play in my Principles of Biological Anthropology lectures this semester.

    Synopsis: 
    Holly Dunsworth shares some perspective on 23andMe testing for her students
  • The Mayflower criminal registry

    Fri, 2012-01-13 22:25 -- John Hawks

    Of some interest with respect to DNA databases and privacy concerns: "DNA links 1991 killing to Colonial-era family".

    The DNA sample was taken in the death of 16-year-old Sarah Yarborough, who was killed on her high school campus in Federal Way, Washington, in December 1991. The King County Sheriff's Office has circulated two composite sketches of a possible suspect -- a man in his 20s at the time with shoulder-length blonde or light brown hair -- but had been unable to put a name to the sketch.

    In December, though, the department sent the DNA profile to California-based forensic consultant Colleen Fitzpatrick. Fitzpatrick compared the profile to others in genealogy databases and found the closest match was to the family of Robert Fuller, who settled in Salem, Massachusetts, in 1630 and had relatives who came over before him on the Mayflower.

    This is a Y chromosome match based on the genealogical research of people who may be completely unknown to the "suspect". Fitzpatrick offers that a Y-chromosome match may be expected to share a surname, which is probative in the forensic situation. Obviously there are many possible scenarios in which such information will not lead to discovery of a suspect: the chance of non-acknowledged paternity events across 200 years is very high. I don't view the result as strongly actionable, but I do think it raises important questions about the future of genealogy databases.

    We are near the time when whole-genome sequencing will make this kind of identification much more likely because unique genetic matches to 3rd and 4th degree relatives will be plausible. Finding a handful of rare mutations shared between a crime scene sample and an individual in a whole-gneome database would be a strong indication of a relationship. It's possible that the databases for whole genomes will grow faster than the technology will allow reliable whole-genome sequencing from a crime scene sample. So in this case, the issues with database use may be primary.

    It would be an interesting exercise to estimate the fraction of unknown samples from crime scene Y chromosome and mtDNA that could be matched to a 10th-degree relative in the Genographic (or any other large) dataset.

  • Sequencing is outpacing computing

    Wed, 2011-11-30 23:36 -- John Hawks

    The New York Times notices DNA sequencing's Malthusian trap: "DNA sequencing caught in deluge of data."

    That is a decline [in sequencing costs] by a factor of more than 800 over four years. By contrast, computing costs would have dropped by perhaps a factor of four in that time span.

    The lower cost, along with increasing speed, has led to a huge increase in how much sequencing data is being produced. World capacity is now 13 quadrillion DNA bases a year, an amount that would fill a stack of DVDs two miles high, according to Michael Schatz, assistant professor of quantitative biology at the Cold Spring Harbor Laboratory on Long Island.

    I have spoken with several scientists in other fields, like astronomy and particle physics, who deal with truly big datasets. Until now, biology data has actually been pretty small potatoes compared with the sheer amount pumped out by large projects in other fields. But that's changing. The Times article points out a unique aspect of the data problem in genetics: There are now thousands of labs that can generate large datasets, many of whom have no special plan for data archiving or availability.

    “Google has enough capacity to do all of genomics in a day,” said Dr. Schatz of Cold Spring Harbor, who is trying to apply Google’s techniques to genomics data. Prodded by Senator Charles E. Schumer, Democrat of New York, Google is exploring cooperation with Cold Spring Harbor.

    Google’s venture capital arm recently invested in DNAnexus, a bioinformatics company. DNAnexus and Google plan to host their own copy of the federal sequence archive that had once looked as if it might be closed.

    I don't see Google as a deus ex machina for this one -- although I do observe that several other big data projects are sponsored by large Microsoft investors or founders.

Pages

Subscribe to sequencing

Neandertals

For years, I've worked on their bones. Now I'm working on their genes. Read more about the science studying these ancient people.

Denisova

From a finger bone of an ancient human came the record of a completely unexpected population. My lab is working on the science of the Denisova genome.

Acceleration

The advent of agriculture caused natural selection to speed up greatly in humans. We're uncovering some of the ways that populations have rapidly changed during the last 10,000 years.

Malapa

Just outside Johannesburg, the Malapa site is producing some of the most exciting finds in human evolution. This site is the headquarters of the Malapa Soft Tissue Project.