john hawks weblog

paleoanthropology, genetics and evolution

1000 Genomes Project

  • Human population history makes a difference

    Thu, 2012-05-10 16:18 -- John Hawks

    Alon Keinan and Andrew Clark have a short report in the current Science examining the effects of recent human population growth on the expected spectrum of human genetic variation [1]. Population growth skews the variation in a population so that there are many more rare alleles than would be expected in a constant-sized population.

    Why is this? In a constant-sized population, individuals have an average of two offspring who survive to have offspring of their own. Many people have no children at all, or only one, while only a small proportion of people have more than four children. In the constant-sized population, a person born with a new mutation would have a 50% chance of passing it on to each child. In such a population, more than a third (36%) of mutations aren't passed on even once. The same fraction are inherited by only one child, and these face the same odds of extinction in the next generation. This isn't natural selection, it is random genetic drift -- and its net result is that most new mutations are lost.

    In a growing population, individuals average more than two offspring. Every additional offspring increases the chance that a new mutation will be passed on to the next generation. In other words, more people means less genetic drift. As a population grows, new mutations begin to stack up at low frequencies in the population.

    This is a very basic point in population genetic theory, and it interacts in a troubling way with the current generation of sequencing technology. Short-read shotgun sequencing yields a high number of false positive mutations, which must be aggressively filtered out of whole genome data. If we don't filter these out, we will arrive at incorrect conclusions about many aspects of human biology. The simplest means of filtering require some understanding of how many rare mutations you expect to find, in particular how many should be found in only one person in a sample of people. That expectation is different in a growing population, resulting in a potentially large bias.

    Despite an improvement in the accuracy of sequencing technologies, some errors remain unavoidable. For example, with a sequencing error rate of 1 in 10,000 bases, in a sample of 10,000 individuals, each base pair will exhibit two errors on average across the sample and the majority of monomorphic sites will appear polymorphic (most often as a singleton or a doubleton; i.e., with the rare allele present in one or two copies in the sample). On the other hand, strict filtering of the data will lead to missing many rare variants because they are not observed as reliably. Hence, any analysis of large sample sizes must account for the uncertainty inherent in sequencing by considering the variant calls probabilistically, and secondary validation of rare variants by an alternate sequencing procedure is essential.

    Keinan and Clark present some models that show how much it matters to consider a growing population compared to the usual null model of constant population size.

    It's so interesting to me to see human geneticists catching up to where anthropologists have been for a long time. Of course, we wrote about the effects of recent population expansions in 2007, noting the apparent acceleration of positive selection in post-agricultural populations ("Why human evolution accelerated") [2].

    Large-scale sequencing projects have moved beyond simply categorizing common genetic variation. They are now at a stage where thousands of individuals need to be examined, to find increasingly rare genetic variations and determine their collective effects on phenotypes. That means that the next version of the 1000 Genomes Project really needs to be involve many of us who are directly concerned with human population history. The growth and dynamics of actual historic human populations are going to matter to how we understand their genetic variation and its effects on phenotypes. Fortunately, archaeology and written history can help -- if anthropologists are involved in this work from the start!


    References

    1. Keinan A, Clark AG. Recent Explosive Human Population Growth Has Resulted in an Excess of Rare Genetic Variants. Science. 2012;336(6082):740 - 743.
    2. Hawks J, Wang ET, Cochran G, Harpending HC, Moyzis RK. Recent acceleration of human adaptive evolution. Proceedings of the National Academy of Sciences, U. S. A. [Internet]. 2007;104:20753–20758. Available from: http://dx.doi.org/10.1073/pnas.0707650104
    Synopsis: 
    Human genetics has reached the point where population history is essential to further progress
  • When genes break: validating loss-of-function variants

    Fri, 2012-02-17 12:20 -- John Hawks

    Daniel MacArthur and colleagues have an important paper in Science, titled "A Systematic Survey of Loss-of-Function Variants in Human Protein-Coding Genes" [1]. They took 1000 Genomes Project pilot data and systematically looked at every allelic variant in the sample that appeared to cause the loss of function of a protein-coding gene. Mutations that de-activate genes in this way are not rare, but they are often eliminated from the population rapidly by purifying natural selection, because the normal function of a protein is necessary to survival or reproduction. However, not every protein is so important, and MacArthur and colleagues confirmed that more than 1200 alleles in this sample genuinely occur in one or more of the 1000 Genomes Project individuals.

    Some of these are common but most occur in fewer than 2% of individuals in the sample, as expected if purifying selection were affecting many of them.

    MacArthur is one of the authors of the Genomes Unzipped group blog, and has written a great summary and introduction to his research paper: "All genomes are dysfunctional: broken genes in healthy individuals". It's free and well-written, so it will probably work better for many readers than the original paper.

    Science is running a commentary to accompany the research article, by Lluis Quintana-Murci [2]. This paragraph encompasses a lot of the numerical facts about these loss-of-function variations, and discusses the idea that some of them were positively selected -- that is advantageous in recent human populations.

    MacArthur et al. estimated that, depending on ethnic background, each individual's genome carries 26 to 37 variants that introduce a stop codon (which signals the termination of translation of nucleic acids into protein), with up to 6 present in the homozygous state. When considering other types of LoF variants, including those that disrupt splice-sites, large deletions, or insertions or deletions of nucleotides that change the DNA reading frame, the total number per individual is extended to 103 to 121, with ∼20 present in homozygosity. A large proportion of LoF variants were enriched in low-frequency alleles, suggesting that the removal of deleterious alleles has prevented them from increasing to high frequencies. Furthermore, some have already been associated with severe human diseases, supporting the less-is-less hypothesis. Other LoF variants, which can reach higher population frequencies, fall into poorly evolutionarily conserved genes or belong to multigene families displaying high paralogous sequence identity. This suggests that the functions of the corresponding genes are highly redundant, explaining their greater tolerance for LoF variants and supporting a less-is-nothing scenario. Also, although no substantial enrichment in positive selection signals was observed among LoF variants at the genomewide level, 20 of them fell into regions displaying signatures of positive selection, as predicted by the less-is-more hypothesis, suggesting that they may have conferred a selective advantage in human evolution.

    Common loss-of-function variants that are evolutionarily recent are very interesting to us as we work to understand the changes that accompanied modern human origins and the later invention and spread of agriculture. I am really excited that these analyses were carried out using the 1000 Genomes samples because that means we can use the sequence data to estimate the ages of these functional losses. We can do quite a lot better than to say that they "fall into regions displaying signatures of positive selection": In fact, we can determine whether these variants themselves were selected, or hitchhiked to high frequency along with some other variant that was selected.

    Many of loss-of-function variants are in genes that may not matter much to selection. Olfactory receptor genes, for example, comprise a very large family with recurrent duplications and pseudogenizations during primate evolution. We have scores of olfactory receptor pseudogenes, many of which are polymorphic in living human populations. Some may continue to make a noticeable difference to the phenotype, such as the asparagus-urine-smelling polymorphism. But many are probably invisible to us. Still, a few of these do look like they've been positively selected in recent human populations.

    Sometimes less really is more.


    References

    Synopsis: 
    A "punishing" resequencing project validates mutations in the 1000 Genomes Project individuals that deactivate protein-coding genes.
  • Mailbag: Neandertal genes across the Strait of Gibraltar

    Sun, 2012-02-12 22:03 -- John Hawks

    Re: Neandertal gene variants in Yoruba:

    If you think in terms of ice-age climate, with sea-level about 150 ft lower than at present and the Mediterranean regularly covered by thick arctic-like ice in winter, it is easy to imagine early humans making their way back and forth over an ice-covered strait of Gibraltar or along an ice-free coastal strip connecting western Europe with West Africa. I think the discovery of relatively large number of neanderthal genes in West African tribes like the Yoruba is one of those unexpected and unpredicted facts that on further reflection makes a lot of sense, justifying the statistical analysis used. After all, if a statistical algorithm only shows what's expected, you have to wonder whether all it's done is to give a statistical excuse for what's already believed to be true.

    Indeed, I think this is a possible explanation. On the other hand, there is just as much danger of post hoc generalization the other way!

    Testing that hypothesis will require some more sophisticated estimates of the ages of particular gene regions that have been inherited from Neandertals in West African populations.

  • Which population in the 1000 Genomes Project samples has the most Neandertal similarity?

    Wed, 2012-02-08 01:14 -- John Hawks

    Last December I began writing about an analysis of introgression in the 1000 Genomes Project samples ("Neandertal introgression, 1000 Genomes style"). I left everybody in a bit of suspense, partly because my writing computer was unexpectedly replaced before winter vacation, and partly because of my extensive travel in January.

    I'm catching up this week before I go to Ann Arbor, Michigan next week for a talk and visit with many friends. It's a good time to give readers some status updates on the analyses because the release of the high-coverage Denisova genome today will allow us to do some very deep checks on some of the comparisons we've carried out.

    Picking up where I left off, in the last post I emphasized that the individual genomes represented in the 1000 Genomes Project samples in Europe and East Asia have a surplus of derived SNP alleles that they share with the Vindija Vi33.16 genome. That surplus compared to genomes in the African population samples represents the evidence for Neandertal ancestry in those populations.

    Comparison of shared Neandertal derived variants in African, Chinese and European samples

    Admixed populations, including African-Americans and Puerto Ricans, shared Neandertal derived SNP alleles in the fraction expected for their African and non-African fractions of ancestry.

    Comparison of shared Neandertal derived variants in ASW, YRI and CEU samples

    As I also pointed out, the population samples in Europe and East Asia are not identical in the number of these shared derived variants. The difference between individuals can be caused by differences in the fraction of their genealogy that traces to Neandertals. The difference may also be caused by other aspects of the individuals' genealogy, if for example some aspect of population history has led to discrepancies in the fraction of ancient variations these people share with a Neandertal genome by incomplete lineage sorting.

    Here is the comparison of East Asian samples (Japanese, Han Chinese in Beijing, and Han Chinese originating in South China) and European samples (Tuscans, British, Finn and CEU samples, along with a handful of Spanish):

    Comparison of shared Neandertal derived variants in East Asian and European 1000 Genomes Project samples

    The Europeans average a bit more Neandertal than Asians. The within-population differences between individuals are large, and constitute noise as far as our comparisons between populations are concerned. At present, we can take as a hypothesis that Europeans have more Neandertal ancestry than Asians. If this is true, we can further guess that Europeans may have mixed with Neandertals as they moved into Europe, constituting a second process of population mixture beyond that shared by European and Asian ancestors.

    As we look more closely at the particular gene regions shared between each individual and the Neandertal, we will be able to consider the approximate time that they shared an ancestor for each gene region. That will allow us to distinguish incomplete lineage sorting (ILS) from introgression, although the two will overlap to some extent. We will rely on that test to examine hypotheses about the time and place of population mixture.

    The difference between Europeans and Asians when we lump all the samples together is not as interesting as the differences we can see among the samples within each of those regions. For example, here are British people compared to Tuscans:

    Comparison of shared Neandertal derived variants in British and Tuscan samples

    The Tuscans have the highest level of Neandertal similarity of any of the 1000 Genomes Project samples. They have around a half-percent more Neandertal similarity than Brits or Finns in these samples. The CEU sample is slightly elevated compared to Brits and Finns as well.

    It is tempting to interpret these differences as a north-south cline in Neandertal ancestry. I wouldn't jump too quickly on this idea, because Holocene population movements in Europe are now known to have covered up or erased a substantial fraction of the Upper Paleolithic gene pool. If we have a bonus of extra Neandertal ancestry in southern Europe, we need to explain how that cline persisted across subsequent history. Still, the difference is statistically very strong and deserves some explanation.

    Likewise, the populations within East Asia have some differences in Neandertal similarity. Here is the comparison of Han Chinese, with the Beijing versus South China origins separated out:

    Comparison of shared Neandertal derived variants in CHB and CHS samples

    North China has a bit more Neandertal, on average, than South China according to these samples. These are all identified as ethnic Han Chinese, so I expect that the comparison would be much more interesting if some minority populations had been examined. The "cline" here seems opposite in direction compared to the European case. I can add that the Japanese sample is largely intermediate between the CHB and CHS, with an average closer to the Beijing sample.

    If there was one thing that surprised me in the comparisons, it was this:

    Comparison of shared Neandertal derived variants in Luhya and Yoruba samples

    Yoruba have substantially more Neandertal similarity than Luhya. This may seem counter-intuitive, because the geographic location of Luhya in East Africa might seem better placed for Neandertal similarity to appear, whether through ancient population structure and ILS or through recent gene flow or backmigration into Africa of Neandertal descendants.

    Instead, it looks like the Yoruba are the recipients of Neandertal genes, whether by means of ancient population structure or introgression and recent trans-Saharan gene flow. I personally think both factors are involved, but again their relative importance will be determined by comparing individual gene regions.

    In this vein, it is useful to outline the hypothesis of differential ILS within African samples. We now know from examination of genetic variation within Africa today that some of today's diversity can be traced to ancient population structure in Middle Pleistocene African populations. For example, Neandertals could be more closely related to some African populations than others today because Neandertals actually exchanged genes with some ancient African populations. Or Neandertals might have sprung from one African population among many who lived 250,000 years ago. If some of these ancient populations persisted and contributed genes to different present-day African populations, those populations would share different fractions of genes with a Neandertal genome.

    I expect we will learn a substantial amount about African population structure in the MSA by using these Neandertal-similar regions of the genome. It's like having a probe that can trace the movement of people across Africa more than 100,000 years ago. As we combine the archaic genome data with our growing picture of diverse lineages in Africa today, we may discover ancient populations that are not apparent archaeologically. Again, genetics is giving us a totally new picture of the diversity and population dynamics of ancient people.

    Next: Which Neandertal-derived variants are shared between regions, and which are unique to one region? I touched on this question last spring by using genotype data. Now, we have sequences capable of telling us much more.

    Synopsis: 
    Europe has a touch more Neandertal than East Asia; Tuscans have more than any other European sample
  • Watch who you call "extinct"!

    Wed, 2011-10-26 00:29 -- John Hawks

    Sometimes people wonder why human genetics projects should bother to involve anthropologists.

    From now on, this seems like a good example: "Rebuilding the genome of a hidden ethnicity".

    CORRECTED: This article originally stated that the Taíno were extinct, which is incorrect. Nature apologizes for the offence caused, and has corrected the text to better explain the research project described.

    The news article reports on a conference talk by Carlos Bustamante, who is working on the population genetics of the 1000 Genomes Project samples. The project includes whole-genome sequencing data from 70 research subjects from Puerto Rico, many of whom have a substantial fraction of ancestry from the native peoples of the Caribbean, chiefly Taíno. There are more than 4 million Puerto Ricans today, both on the island and throughout the United States, and their ancestry averages around 15% Native American. Genetically, that works out to 1.2 million copies of a typical gene derived from indigenous peoples, of course scattered in different ways across the genomes of Puerto Rican people today. That's a lot of information, and Bustamante and colleagues are using the information to test hypotheses about the ancestry and pattern of native ancestry in these people.

    The news coverage of the talk ran into trouble by describing the Taíno as an "extinct ethnicity". What happened next won't be a surprise to any anthropologist who works in the Caribbean. Over the course of a weekend, the comment section of the Nature news article was filled by people outraged at the description of their ancestors as "extinct". Many identified themselves as Taíno people, protesting an injustice.

    The communication failure here is obvious. A presentation that refers to descendants of an ancient population ought to use terms that are anthropologically valid. Here we have two words that provoked confusion and anger: "extinct" and "Taíno".

    "Extinct" just is not a term that should apply to the ancestors of living people. Whatever the dictionary may say, to an ordinary reader or listener, the closest association of "extinct" is probably "dinosaurs". Extinction without issue. Even when we refer to cultural practices, the term "extinct" invites confusion. Extinction implies a model of disappearance that is sudden and complete, which in many cultural contexts didn't happen.

    "Taíno" is a contested cultural category. A growing group of people today claim Taíno identity, not merely Taíno ancestry, who live on many Caribbean islands. Some cultural practices derived from pre-Columbian Taíno people are today still widespread, among people who may have no strong belief about their ancestors 500 years ago. The movement toward greater self-identification as Taíno has emerged within an international population. Any discussion of Taíno ancestry ought to be framed in terms of the living people today who have that ancestry. Some of them may have a small fraction of Taíno ancestry but still self-identify in that category; others have never self-identified in that way, a few of whom might even be horrified at the prospect.

    Genetic observations themselves have contributed greatly to the revival of the concept of Taíno identity. By demonstrating the high fraction of indigenous ancestry in Caribbean people, genetics has provided something more "real" to people than their cultural ties may seem. Past studies of admixture in the Caribbean were hailed by activists as "scientific proof" that the Taíno still exist. That is one of the anthropological problems: the geneticists are not neutral players in this social milieu, even if they have no commitment to any possible result.

    In my opinion, the 1000 Genomes Project participants are the good guys. The scientists directing the project have given a lot of thought to their selection of samples, funded workshops to discuss ethical issues that arise from sampling and analysis, and even came up with boilerplate language so that their hundreds of postdocs have a standard way refer to the different sample groups. The project has created tremendous value for those of us who study the range of human diversity and human origins.

    Some of the project scientists have worked to explain why it is important to encompass human diversity within large-scale sequencing projects (for example, a recent paper by Bustamante and colleagues [1]). Genetic studies of human populations have been strongly biased toward European populations, and secondarily toward populations from other parts of the world that are well-represented by immigrant communities within the United States and Western Europe. The bias means that we don't understand as much as we should about the relationship between genetics and health in other populations of the world. Rare variations, some of which contributed to disease risk or protection, are missed by our current samples -- even though in some cases more samples could be added at minimal cost.

    My point is that there are really good intentions behind the project, and from an NIH-centric perspective, the project attempted to be inclusive. But competing ideas of identity make human genetics a difficult area where miscommunication is inevitable. Categories that a human geneticist may think are perfectly clear, an anthropologist will tend to be more wary about.

    I saw the story on Gene Expression, where Razib Khan provides good commentary along the lines of my reactions. I would add that cases like this one add a deeper dimension to the usual kind of science miscommunication. People are sometimes very selective about the science they accept to believe. Probably in no cases are people so selective as when the outcome concerns their own identity.

    A great power of today's genetic technology is the opportunity it presents to allow people to discover their ancestry. But that power is easily twisted into a license to impose identity. When different groups have motives to construct genetic identity, then genetics becomes a powerful tool for each group to proselytize its particular version of cultural identity.

    Anthropologists are already engaged in this problem, in different parts of the world. Yet they are minor players. As we see in this article, the geneticists have large voices. Those voices are heard rapidly by activists of various kinds, who have extremely high levels of engagement with broader communities. Taíno and Nature are both obscure to most Americans, but within 72 hours one of those groups mobilized and forced a response from the other, in a way that will have a large impact on future scientific and news reporting.


    References

    Synopsis: 
    A news article on the genomics of Puerto Rican descendants of Taino peoples runs into hot water.
  • More on the mutation rate

    Mon, 2011-07-18 04:43 -- John Hawks

    I've received several questions over the last few weeks about human genome-wide mutation rates. Some people are noticing heterogeneity in mutation rate estimates among family trios (spurred by a recent paper from the 1000 Genomes Project) while others are asking about apparent contradictions between estimates from pedigree-based methods and those based on phylogenetic comparisons with other primates (see, for instance, Dienekes' discussion of the recent paper by Li and Durbin [1]).

    I wrote a very extensive and referenced post last fall about this issue, and I just want to bring it to people's attention: "What is the human mutation rate?"

    The 1000 Genomes Project has adopted the low per-generation mutation rate that has been coming out of the family trio comparisons. This low rate is around 1.2e-8 per site per generation as opposed to the estimate of around 2.4e-8 per site per generation that was often used prior to last year. Several new or upcoming papers will use the lower rate as applied to comparisons in humans or other hominoids.

    I'll just point out two conclusions I arrived at last fall:

    1. The 1000 Genomes comparisons are not very strong evidence in favor of a low rate. There is too much error in the sequences, and the means of filtering errors may affect the rate estimation. Much stronger evidence comes from pedigree-based comparisons of de novo Mendelian diseases, which encompass tens of thousands of mutational events instead of a few dozen. These also suggest a low rate -- in particular Michael Lynch's work from early 2010 [2]. This work also demonstrates that different sequence contexts give rise to different effective rates of mutation.

    2. The higher rate based on phylogenetic comparisons was always based on circular reasoning. People applied a rate that would fit the observed sequence differences to some paleontological event. Logically, the fossil appearance of an extant lineage puts a minimum time on the divergence of that lineage from others; but geneticists typically assumed that this was the expected time of sequence divergence, not the minimum possible time of species divergence. These two dates may easily differ by a factor of two, given the quality of the hominoid fossil record. Sequence divergence must always precede speciation, and speciation must always precede the earliest fossil occurrence of a lineage. The paleontological dates were then often bootstrapped from estimated mutation rates. The famous "6 million year human-chimpanzee divergence" was always based on these faulty assumptions -- that we knew with exactitude the human-orang or human-macaque sequence divergence time, and that the sequence divergence time between humans and chimpanzees was identical to the speciation time of the two lineages.

    I've had several conversations with people about this issue during the past year. Some of them take it very seriously, others don't. Myself, I see that the lower rate simplifies many problems with the fossil record and comparisons of archaic genomes, but creates some others. For this reason, I'm cautious about it.


    References

    1. Li H, Durbin R. Inference of human population history from individual whole-genome sequences. Nature [Internet]. 2011;475:493–496. Available from: http://dx.doi.org/10.1038/nature10231
    2. Lynch M. Rate, molecular spectrum, and consequences of human mutation. Proceedings of the National Academy of Sciences [Internet]. 2010;107:961–968. Available from: http://dx.doi.org/10.1073/pnas.0912629107
    Synopsis: 
    The 1000 Genomes Project is on the verge of demonstrating a lower mutation rate in humans. Should we believe it?
  • What is the human mutation rate?

    Thu, 2010-11-04 01:33 -- John Hawks

    Last spring I wrote about a study that used whole-genome comparisons between parents and offspring to estimate the rate of per-genome mutation in humans ("A low human mutation rate may throw everything out of whack").

    The study was by Jared Roach and colleagues [1], and as you might guess from my post title, the result was surprising. Previous work had suggested a human mutation rate around 2.5 x 10-8 per site per generation. The new study found less than half the expected number of mutations between these parents and offspring, an estimated rate of only 1.1 x 10-8 per site.

    If this lower rate of mutation were to hold up, it would affect much of our understanding of the chronology of human evolution. Fossils and archaeological sites would not change in date, but some hypotheses about their relationships would be challenged. For example, the higher rate of 2.5 x 10-8 per site suggests a chimpanzee-human population divergence around 4 million years ago. A new rate of 1.1 x 10-8 would not have a linear effect on this divergence time -- the genes don't have genealogical roots at the same instant as the population divergence. But the human-chimpanzee divergence time would be radically higher than in many recent estimates.

    The same might be true for other primate divergences, and for genealogical relations within human populations today. Basically any times that are estimated from genetic differences may be affected by our knowledge of the per-generation rate of mutations.

    What does this mean? Open below the fold to read more.

    What mutations are we counting?

    Human genomes differ from each other in many ways. There are single base-pair changes in sequences, insertions and deletions, repeat polymorphisms, and larger-scale rearrangements such as inversions and gene duplications. Recent work suggests that some of these larger-scale effects may be very important to phenotypic variation among people. So why should we be talking about only the first of these kinds of variation?

    Single nucleotide mutations have been the focus of most attention about mutation rates because they are relatively easy and quantify. In high-quality sequence data, a single nucleotide change is relatively unambiguous. Reversals are fairly unlikely, although at a small fraction of "hotspot" sites, recurrent mutations can make a big difference.

    It is somewhat misleading to refer to "a" rate of single nucleotide mutations, because some kinds of sites (e.g., CpG nucleotides) have had a much higher probability of mutations than others. This affects the apparent rate of mutations in noncoding versus synonymous sites [2]. Also, the germline in males has been estimated to be as much as 6 times more likely to suffer mutations than the germline in females (discussed by Crow [3]). The idea of a genome-wide rate assumes that when we bin all the single nucleotide mutations together, across large amounts of sequence, we do arrive at a relatively stable rate that can be applied to similarly broad extents of sequence data. Or at least that we can identify sequence regions with compatible rates (e.g., noncoding DNA or synonymous sites).

    At the moment, technical issues make it hard to find and quantify many other kinds of variation. The current generation of sequencing devices tend to generate short reads, which make it difficult to assess the presence of insertions or deletions of more than a few base pairs. Duplications and other rearrangements require special treatment such as higher coverage or longer sequence reads. By contrast, a single nucleotide mutation will typically align in the proper location and be quite evident in a read. In principle, we can just run down the genome and count them.

    Still, finding novel mutations is not without its problems. Recent sequencing projects have yielded a very high rate of false positives. The rate of false negatives is really not yet known. We have a good reason to suspect that the false negative rate will be high. In a low-coverage genome, many short segments of the genome will have very low read numbers, making it likely that the sequence reads represent only one of the two copies of the genome present at that location. Any novel mutations in that area have a 50-50 chance of being missed by our sequencing efforts. This false negative risk can be reduced by adding higher sequence coverage, but we're not yet at the point where we have a lot of genomes sequenced at the 10x or higher coverage that we would really want.

    So while sequencing a parent and offspring genome is the most direct way to estimate the per-generation mutation rate, it is not yet ideal.

    Where did the high rate come from?

    That means we need to look very closely at other sources of data, to see if they may provide some independent confirmation of a lower per-generation mutation rate. In the process, we should ask, why did the higher rate, around 2.5 x 10-8 per generation, become so widely accepted?

    The source cited by Roach and colleagues for the higher rate, 2.5 x 10-8 per site, is a paper by Michael Nachman and Susan Crowell [4]. Nachman and Crowell examined processed pseudogenes in humans and chimpanzees, under the assumption that mutations in these pseudogenes would be neutral to selection in the human and chimpanzee lineages.

    The average mutation rate was calculated from the average autosomal rate of evolution assuming a generation time of 20 years (Table 3). Recent estimates of the time since humans and chimpanzees diverged (T) include 4.5 mya (TAKAHATA and SATTA 1997 ), 5.5 mya (KUMAR and HEDGES 1998 ), and 6.0 mya (GOODMAN et al. 1998 ). ARNASON et al. 1998 estimated the Homo-Pan divergence at 10–13 mya; however, their estimate is based on a calibration using distant, nonprimate species and is at odds with most other recent estimates. Mutation rates were calculated for a range of different human-chimpanzee divergence times and for two different ancestral population sizes. Mutation rate estimates vary from 1.3 x 10-8 (assuming T = 6 mya and Ne = 105) to 2.7 x 10-8 (assuming T = 4.5 mya and Ne = 104). If the average generation time is assumed to be 25 years (e.g., EYRE-WALKER and KEIGHTLEY 1999 ), then mutation rates are estimated to be between 1.6 x 10-8 and 3.4 x 10-8.

    Wait a minute. There's no independent estimate of mutation rate here at all!

    What they did was to assume values for the human-chimpanzee divergence and ancestral (chuman) effective size, and then provide an estimate of mutation rate consistent with those assumptions. That's perfectly reasonable as a way of quantifying the genetic divergence that they observed. If our goal is to predict the per-generation mutation rate from interspecific divergence, that's more or less the kind of estimate that we want.

    But many, many other studies have instead used a citation to the Nachman and Crowell rate as a justification for their own estimates of the human-chimpanzee divergence time! That's not perfectly reasonable, in fact, it's perfectly circular. It's turtles all the way down!

    Worse, those citations tend to cite the midpoint of Nachman and Crowell's range of estimates (2.5 x 10-8) as if it were a true value measured with little error. Reading the original reference, you can plainly see that Nachman and Crowell reported estimates that varied over a factor of three, corresponding to a wide range of chuman population histories. From their discussion:

    Mutation rates estimated for a range of divergence times and ancestral population sizes fall between 1.3 x 10-8 and 2.7 x 10-8 assuming a generation time of 20 years (Table 3) or between 1.6 x 10-8 and 3.4 x 10-8 assuming a generation time of 25 years. We suggest that 2.5 x 10-8 is a reasonable estimate of the average mutation rate per nucleotide site (but caution that the actual rate may be between 1.3 x 10-8 and 3.4 x 10-8).

    That 2.5 x 10-8 is simply the midpoint of their range of estimates with the 25-year generation time.

    What would be more reasonable? For hominins and chimpanzees, we probably want to apply a shorter generation length, a larger ancestral effective size, and a higher time of divergence. All of these would have yielded a lower rate for the Nachman and Crowell data. But we don't want to just assume these values, we should try to test whether they are valid based on other data.

    Other mutation rates from phylogenetic comparisions

    Nachman and Crowell have not been alone in their ultimate reliance on fossil evidence as an assumption underlying the per-generation mutation rate. But several other studies came to a slower mutation rate. Mostly, these studies have assumed that the human-chimpanzee divergence happened significantly earlier than 5 million years ago. Necessarily, then, the human per-generation mutation rate would have to be lower, as long as the sequence divergence remained the same.

    These estimates are ultimately rooted in the date of one or more fossils, among which the generation time certainly varied. The resulting per-site mutation rates are often reported as per-year instead of per-generation. For example, Yi and colleagues [5] yielded a rate of 0.99 x 10-9 per year for the human-chimpanzee comparison, which would multiply to 1.98 x 10-8 per 20-year generation. They propose this as a maximal rate, assuming that Sahelanthropus at a minimum date of 6 million years ago is a hominin. With an older divergence date, they propose a correspondingly lower rate (e.g., 0.79 x 10-9 per year, not accounting for ancestral population polymorphism).

    Similarly, Steiper and Young [6] considered a long (1.9 Mb) alignment of sequence from 19 primate species. In their model to estimate relative rates on different branches of the primate phylogeny, they incorporated the assumption that Sahelanthropus is on the hominin clade. A divergence date of 6 million years gave rise to a human per-site mutation rate of 0.65 x 10-9 per year (1.3 x 10-8 per 20-year generation). A divergence date of 7 million years lowered the mutation rate to 0.57 x 10-9 per year.

    Low mutation rates do not always result from these studies. Several have arrived at either a high human mutation rate or a recent human-chimpanzee divergence time. Sometimes a recent human-chimpanzee divergence emerges simply by assuming the rate given by Nachman and Crowell. Yang [7] provides an example of this -- a paper that very thoroughly explores the relationship of divergence time and ancestral effective population size, but ultimately roots the estimates on a single value for mutation rate. This rate we have already seen was itself based on an assumption about divergence time.

    Kumar and colleagues [8] came to a much lower estimate for the human-chimpanzee divergence time, based on an Old World monkey-hominoid divergence at 23.8 million years ago. This estimate did not consider the effect of ancestral polymorphism on the mean genetic divergence time, and so should -- in the language of computer software -- be deprecated.

    I should reiterate that none of these estimates are suitable for testing the times of phylogenetic divergences, because they all assume that the date of some particular fossil (or set of fossils, by fitting a model) is the minimum divergence time for a clade.

    So much of the literature in this area is ultimately circular, I'm pulling out my sparse hair reading through it. By the time we get back to the mid-1990's, the sequence data are even sparser than my hair by today's standards -- only a few hundred base pairs, or a sampling of restriction sites. But the divergence time estimates have propagated forward from that time to today, recycled through the assumptions of papers in the intervening time. It's like the genetic equivalent of money laundering!

    Evidence from parent-offspring sequence differences

    There is another way besides phylogenetic comparison: Simply look at living people and see how many new mutations they have.

    But this is tricky because we are rarely in a position to know which mutations are new. Most variations that we see between two people have persisted in the population for hundreds of generations or more. It takes a special kind of mutation to make its newness evident.

    Up until the advent of large-scale sequencing, the most important source of information about the mutation rate came from the rates of spontaneous Mendelian diseases. When a person has a dominant genetic disorder not carried by either of his parents, you know that the mutation must be new. Disease rates have long been tracked as standard public health data.

    However, the per-genome or per-locus rate of Mendelian disorders can estimate the per-site rate of mutations only by adding well-resolved information about the target size of functional genes. For example, if we know the average gene length and the proportion of different amino acids in functional proteins we can make some estimate of the ratio of synonymous to nonsynonymous sites. But we would still lack information about the fraction of nonsynonymous mutations that cause deleterious effects on protein function. For this reason, it was possible for very early workers (e.g., Haldane) to come within the ballpark of per-locus mutation rates even before the genetic code was available. Yet such estimates are not strictly useful for understanding per-site rates of mutation.

    By 2000, widespread sequencing had begun to identify disease-causing mutations at the sequence level. When exons are known, it is possible to determine the "target size" -- the number of sites at which loss-of-function mutations may occur. These two values provide the numerator and denominator for an estimate of the per-site mutation rate.

    Kondrashov [9] applied this method to estimate the per-site mutation rate across 20 human genes. He surveyed the literature for genes where more than 100 patients had been sequenced completely for the causative locus, finding the causal mutations. Using this value and the disease incidence allowed an estimate of the per-site rate of mutation for different categories of transitions and transversions. There was some variation among loci, with an average rate of per-site mutation equal to 1.8 x 10-8 per generation.

    Kondrashov observed a few hotspots in these genes, with substitution or deletion rates as much as a hundred times the average site. He also observed that the per-gene rate of mutation varies according to the number of CpG sites. The rate of short deletions was on the order of 5 x 10-10, insertions were even less frequent.

    The rate estimate by Kondrashov is within the range considered by Nachman and Crowell, but only 3/4 of the value 2.4 x 10-8 widely cited as the long-term estimate. If this rate were applied to Nachman and Crowell's pseudogene data, it would predict a human-chimpanzee divergence time around 6 million years.

    This year, Lynch [10] performed a more extensive comparison using similar methods as Kondrashov. Including more genes, and considering a broader range of mutational effects (including missense as well as nonsense coding mutations), Lynch found an even lower estimate of mutation rate per generation -- only 1.28 x 10-8 per site.

    These estimates are not precisely the same as comparing parent-offspring pairs, but they are exceedingly powerful because the data on disease rates encompass very large populations of people.

    We should keep in mind the result of Subramanian and Kumar [2], which showed that exons have a higher effective rate of substitution than do noncoding regions. That result implies that the genome-wide rate of change should be lower than estimated by Lynch, because his estimate encompasses only coding mutations. Also, any effect of purifying selection on these mutations will tend to decrease the long-term rate of substitutions per site to a lower value than the rate of mutations. The rate estimated by Lynch should then be an overestimate of the substitution rate that would be applicable to hominoid phylogenetic relationships.

    A slower rate

    These estimates of the per-generation mutation rate are all low compared to the commonly-cited 2.5 x 10-8. They are not quite as low as the rate estimated by Roach and colleagues [1], but the Lynch estimate is very close: 1.28 x 10-8 compared to 1.1 x 10-8 per site.

    The lower estimate from Roach and colleagues is a direct comparison of parent and offspring. In my earlier discussion of that rate, I suggested that false negatives in the sequence comparisons might have lowered the apparent rate of mutations. I still think we can't rule out that possibility. But the rate is not alone, and so it is less surprising than it may have seemed.

    My post last week on the 1000 Genomes Project results ("Now for anthropological genomics") mentioned that the 1000 Genomes comparisions have arrived at essentially the same rate as Roach and colleagues. Comparison of one family trio led to a rate of 1.0 x 10-8 per site per generation; the other family trio gave rise to an estimate of 1.2 x 10-8 per site per generation. These bracket the estimate given by Roach and colleagues.

    My basic observation about the human-chimpanzee divergence time is still sound:

    If this mutation rate is accurate, then the average human-chimpanzee gene divergence has to be up around 11 million years ago. That can be accommodated with a 7-million-year-old species divergence only if we assume a very large ancestral population -- on the order of 50,000 or higher. Or, the ancestral effective size could be lower -- but that would make the species divergence substantially older -- 9 million years or more.

    As we go further back in time, this lower human mutation rate may be less and less relevant, because different primate lineages may have higher (or lower) rates. When some of the kinks have been worked out of whole-genome sequencing, it would be tremendously useful to sequence parent-offspring pairs in other primate species. With those data, rate heterogeneity could be tested directly.

    For events within the hominins, the parent-offspring rate of mutations ought to be better than a rate estimated from phylogenetic distance. Phylogenetic distances are estimated with even more error than mutations, increasingly so as our methods for comparing genomes improve. But some fraction of new mutations will ultimately be lost to purifying selection. That implies, again, that the longer term rate of substitutions will be lower than the rate of mutations measured from parent-offspring comparisons.

    A rate of 1.1 x 10-8 would have no effect on the number of genetic differences observed between people, because these differences are just counted, not estimated by genealogical relationships that are known. It is the unknown genealogical relationships, which are estimated from genetic differences, that may change substantially.

    Let's consider an example. Harris and Hey [11] sequenced 4200 bp of the gene PDHA1, an X-linked gene whose product is part of a mitochondrial enzyme complex. At the time of their study (1999), their result was one of the oldest coalescence times estimated for non-African populations based on sequence data; they estimated the root of the PDHA1 genealogy was 1.8 million years old. This estimate was based on the assumption that human and chimpanzee copies, which differed by an average of 40.42 substitutions, had diverged at 5 million years ago. That would imply that the average genetic difference between humans across the deepest root of the genealogy, 15.05 mutational differences, corresponds to 1.86 million years of time. If we instead assert a per-generation rate of 1.1 x 10-8 per site, these data would generate an estimate of 163,000 generations for the root of the genealogy, roughly 3.3 million years.

    In other words, a coalescence that appeared to have happened in early Homo now looks rooted at the age of A. afarensis. The chimpanzee-human genetic root would be around 8.7 million years for these data.

    These estimates would likely be biased too low, because the X chromosome has a lower rate of mutation than the autosomes by some extent. That issue was addressed by Lynch [10], due to the fact that X chromosomes are in males (with their higher rate of mutations) only 1/3 of the time compared to 1/2 the time for autosomes. Any purifying selection would also bias the estimate too low. If these 4200 bp have a higher-than-average CpG content, that is one factor that might require a higher per-generation rate.

    Is any of this a problem? I don't think we know yet. A lower rate must readjust the apparent correspondence of some molecular time estimates with the archaeological record. But to be honest, most of the apparent correspondences of such dates have been illusory, because genealogical relationships among genes have such large expected variance under any realistic human population model. It is really the availability of whole-genome comparisons that has a chance of improving these population models.


    References

    1. Roach JC, Glusman G, Smit AFA, Huff CD, Hubley R, Shannon PT, Rowen L, Pant KP, Goodman N, Bamshad M, et al. Analysis of Genetic Inheritance in a Family Quartet by Whole-Genome Sequencing. Science [Internet]. 2010;328:636–639. Available from: http://dx.doi.org/10.1126/science.1186802
    2. Subramanian S, Kumar S. Neutral Substitutions Occur at a Faster Rate in Exons Than in Noncoding DNA in Primate Genomes. Genome Research [Internet]. 2003;13:838–844. Available from: http://dx.doi.org/10.1101/gr.1152803
    3. Crow JF. The origins, patterns and implications of human spontaneous mutation. Nature Reviews Genetics [Internet]. 2000;1:40–47. Available from: http://dx.doi.org/10.1038/35049558
    4. Nachman MW, Crowell SL. Estimate of the Mutation Rate per Nucleotide in Humans. Genetics [Internet]. 2000;156:297–304. Available from: http://www.genetics.org/cgi/content/abstract/156/1/297
    5. Yi S, Ellsworth DL, wen-Hsiung Li. Slow Molecular Clocks in {Old World} Monkeys, Apes, and Humans. Molecular Biology and Evolution. 2002;19:2191–2198.
    6. Steiper ME, Young NM. Primate molecular divergence dates. Molecular Phylogenetics and Evolution [Internet]. 2006;41:384–394. Available from: http://dx.doi.org/10.1016/j.ympev.2006.05.021
    7. Yang Z. Likelihood and Bayes Estimation of Ancestral Population Sizes in Hominoids Using Data From Multiple Loci. Genetics [Internet]. 2002;162:1811–1823. Available from: http://www.genetics.org/cgi/content/abstract/162/4/1811
    8. Kumar S, Filipski A, Swarna V, Walker A, Hedges BS. Placing Confidence Limits on the Molecular Age of the Human-Chimpanzee Divergence. Proceedings of the National Academy of Sciences, U. S. A. [Internet]. 2005;102:18842–18847. Available from: http://dx.doi.org/10.1073/pnas.0509585102
    9. Kondrashov AS. Direct estimates of human per nucleotide mutation rates at 20 loci causing mendelian diseases. Hum. Mutat. [Internet]. 2003;21:12–27. Available from: http://dx.doi.org/10.1002/humu.10147
    10. Lynch M. Rate, molecular spectrum, and consequences of human mutation. Proceedings of the National Academy of Sciences [Internet]. 2010;107:961–968. Available from: http://dx.doi.org/10.1073/pnas.0912629107
    11. Harris EE, Hey J. X chromosome evidence for ancient human histories. Proceedings of the National Academy of Sciences, U. S. A. 1999;96:3320–3324.
    Synopsis: 
    The 1000 Genomes Project is finding that the mutation rate is half the value usually assumed.
  • Copy number variation in 1000 Genomes

    Sat, 2010-10-30 13:01 -- John Hawks

    When I wrote earlier in the week about the 1000 Genomes Project results, I mentioned that a second paper was being published in Science. That paper, by Peter Sudmant and colleagues [1], works to quantify the amount of copy number variation of genes in the genomes of the study participants.

    It can be challenging to study copy number variation using shotgun sequencing methods, because each duplicated part of the genome creates multiple alignment targets for short reads. One way to deal with this problem is to use the drawbacks of shotgun sequencing as an advantage: Look for template regions of the genome that have much higher read depth than others. These places include many where a gene has been duplicated in the target genome, giving one-and-a-half or twice the number of reads for each duplication. Looking at read depth genome-wide is a quick way to assess copy number variation at sites where it was previously unknown. Once these are ascertained in a sample of genomes, they can be targeted for further study, including characterizing the boundaries of the duplicate region.

    The paper describes this methodology in some detail, with various embellishments to get more precise answers to certain kinds of structural questions. They developed a large set of SNPs that differentiate paralogous gene copies, among other things allowing them to examine which members of various gene families had been duplicated, and whether events were shared between populations.

    Through our analysis, we identified that duplicated regions are more likely to be stratified between human populations when compared with copy number variation within unique regions of the genome. For example, 59 (92%) of the top 64 stratified gene families overlap segmental duplications (P –16). Remarkably, many of these highly polymorphic genes map to duplications that promote recurrent rearrangements associated with intellectual disability, autism, schizophrenia and epilepsy. We hypothesize that the extreme polymorphism may contribute to genomic instability associated with disease and may predispose certain populations to different chromosomal rearrangements (30).

    Segmental duplications can be relatively effective ways to change the amount of gene product without changing the gene product. In other words, a duplication can increase the dosage of a particular gene product. That can sometimes be very useful. For example, salivary amylase production varies among people due to the number of duplicate copies of the gene [2]. The copy number variation is related to population history of agricultural subsistence -- old agricultural populations have more amylase copies. It's a simple case where the dietary ecology favors a dosage increase for an enzyme.

    Gene duplications and other structural changes to the genome are rare events -- any particular kind of change is substantially less likely than a single nucleotide mutation at a given point in the genome. So it is of some interest to consider which regions are actually invariant in copy number -- duplications that occurred on the human lineage but have been conserved in more recent populations -- because these may reflect old adaptations essential to the evolution of hominins. Here's what the paper concludes:

    We have also defined the ~49% of gene duplicates that are largely invariant in copy among humans. Although this is based only on an assessment of 159 genomes from select populations, the fact that this fraction of genes remains copy number invariant in a milieu of recurrent unequal crossover suggests functional importance. Among these, we find a number of genes involved in neurological development and disease. We note that many of these duplicated genes are themselves incomplete and may represent nonprocessed pseudogenes, which may modulate the expression of the ancestral gene. The characterization of the most recently duplicated genes should facilitate identification of those that acquired new functions (neofunctionalization) versus those that have become pseudogenes or have partitioned their function among duplicate copies (31).

    I was going to write that there's not much analysis in the paper and let it go at that. But the paper has a 108-page supplement.

    I know I write this like once a week, but what the heck is the point of a 4-page paper with a 108-page supplement? Granted, 7 of the supplement pages are the author list (!!), but I view the whole thing mainly as a rip-off for the people who did the analyses in the supplement. Why don't they get their own first-authored publications? Are other journals satisfied to accept first-authored versions of analyses that have already been in a supplement in Science?

    The supplement lists 64 gene families including segmental duplications that differ substantially in average copy number among the CEU, YRI and CHB/JPT samples to which the low-coverage whole-genome sequencing has been applied thus far. The table (S8) lists the mean copy number in the three populations and the total variance in copy number; the key statistic is a value called Vst, which is analogous to FST for length variations.

    These are not generally duplications of whole genes, and their boundaries don't generally correspond to the boundaries of coding regions or exons. Without further analysis, it is not clear which of these duplicated regions may have functional import. Many of the additional copies may be inactive, either because of pseudogenization or because the duplication may not include the promoter/enhancer elements needed for gene expression. Some of the duplications occur in regions with known pseudogenes. The "involvement" of some genes in these regions with neurological development and disease is interesting, but the paper attempts no statistical assessment of this. It's a list of candidates, with some interesting ones that are obviously worth further examination, but without a clear story for any of them.

    It is maybe interesting that salivary amylase didn't make the list. It's not clear from the supplement whether that is an omission or whether its population differentiation, great as it is, is not as high as the lower cutoff. The greatest differentiation for amylase copy number is between populations that are not yet represented in the 1000 Genomes whole-genome sequencing.

    That raises an interesting question: What if we applied the same methods to the read data from some of the other public genomes? The Bushman genomes from earlier this year are an especially interesting sample because they are notably not drawn from a long-time agricultural population. In which areas would they score atypical copy number variation compared to the 1000 Genomes samples?


    References

  • Now for anthropological genomics

    Wed, 2010-10-27 15:30 -- John Hawks

    The first of the papers describing results from the 1000 Genomes project has been released today in Nature [1].

    This is "big project" genomics news. Like many announcements of this kind, it represents more of a public relations milestone than actual scientific advance. Some of the project data have been publicly available for a while -- the 1000 Genomes and HapMap projects have to their great benefit been based on the strategy of immediate data release. The new paper and its supplements include many summary statistics and report on new genetic variants that have been found -- there's a lot of information here. But most of the interesting science is just getting started. A paper like this really represents the opening of a race to use the new data for innovative research.

    Here in my lab, we are exploring the ways that whole genome sequencing can change our study of human population history. A large part of this is our work on recent selection, of course ("Why human evolution accelerated"). Whole-genome sequencing is not essential to finding many recently selected regions of the genome, but it will help enormously in narrowing down the actual functional changes that affected fitness in past populations.

    Whole-genome sequencing will rapidly improve our ability to resolve the population history of Pleistocene humans. For older events -- going back to the origins of Homo -- whole-genome sequencing will give us samples of genealogies from across the genome. We will be able to resolve some very ancient episodes of population mixture, and we have a chance of testing what kinds of events accompanied the rise of our genus. Even for events of the Late Pleistocene and Holocene, for which haplotypes of SNP markers can be useful without resequencing, whole-genome sequencing can be tremendously valuable. Reconstructing haplotypes from diploid genotypes requires us to make some assumptions about the demography of the population, which may be exactly what we are trying to discover. A sample of genomes sequenced at high read coverage will free us from some of those assumptions. It's really exciting stuff for an anthropologist.

    All those are reasons why the data will be useful for us in the long term. But at the moment, the data are not nearly so rich. The current paper reports:

    1. Whole-genome sequencing at 42x coverage of six individuals, one three-person family trio from Utah, and one family trio from Nigeria.

    2. Low-coverage (2x-6x) sequencing of 59 Yoruba, 60 Utah residents, 30 Chinese and 30 Japanese individuals. These are a subsample of the original HapMap samples.

    3. Sequencing at 50x coverage of 8140 exons in 697 individuals. These are a subset of the HapMap v.3 population samples, including Yoruba, Luhya, Utah, Tuscan, Japanese and Chinese samples. These exons come from 906 genes targeted "randomly".

    It's pretty far from a thousand genomes, and even farther from the stated goal of 2400 genomes. The low-coverage genomes are not sufficient to call genotypes across most of the genome. This is a persistent problem with "whole-genome" sequencing projects so far. A person's whole genome is mostly diploid -- two copies of most everything. Recently, we've seen several "whole-genome" sequences where each base is given a consensus value. SNP variants may be called against other people's genomes, but rarely is there sufficient coverage to call SNPs within the individual. There are exceptions -- a handful of public whole genomes are at high coverage. The exon sequencing here should be enough to call SNPs in these functional regions with great confidence. The family trios also should have enough to call SNPs. So some of these will be our first chance to do actual population genetics on diploid genome-wide sequence data.

    One important piece of analysis in the paper is the confirmation of a low rate of de novo mutations in the children of the family trios. I discussed a result last spring that came to a very low rate of per-site mutation ("A low human mutation rate may throw everything out of whack"). The rate in that paper was 1.1 x 10-8 per site per generation. The current paper comes to a rate between 1.0 and 1.2 x 10-8. I have some more written on this issue and I'll integrate the new finding and post it later in the week. This aspect of the study is pretty important to our understanding of human evolution.

    The paper makes an interesting distinction between "accessible" and "inaccessible" portions of the genome -- accessibility meaning ease of mapping and aligning sequence reads:

    Accurate identification of genetic variation depends on alignment of the sequence data to the correct genomic location. We restricted most variant calling to the ‘accessible genome’, defined as that portion of the reference sequence that remains after excluding regions with many ambiguously placed reads or unexpectedly high or low numbers of aligned reads (Supplementary Information). This approach balances the need to reduce incorrect alignments and false-positive detection of variants against maximizing the proportion of the genome that can be interrogated.

    For the low-coverage analysis, the accessible genome contains approximately 85% of the reference sequence and 93% of the coding sequences. Over 99% of sites genotyped in the second generation haplotype map (HapMap II)4 are included. Of inaccessible sites, over 97% are annotated as high-copy repeats or segmental duplications. However, only one-quarter of previously discovered repeats and segmental duplications were inaccessible

    It's an interesting decision -- just focus and report on the majority of the genome where alignment is easier.

    The paper discusses selection briefly. There's not much new here other than the identification of candidate causal variants for some selected haplotypes.

    First, it provides a more comprehensive catalogue of fixed differences between populations, of which there are very few: two between CEU and CHB+JPT (including the A111T missense variant in SLC24A5 (ref. 38) contributing to light skin colour), four between CEU and YRI (including the −46 GATA box null mutation upstream of DARC39, the Duffy O allele leading to Plasmodium vivax malaria resistance) and 72 between CHB+JPT and YRI (including 24 around the exocyst complex component gene EXOC6B); see Supplementary Table 7 for a complete list. Second, it provides new candidates for selected variants, genes and pathways. For example, we identified 139 non-synonymous variants showing large allele frequency differences (at least 0.8) between populations (Supplementary Table 8), including at least two genes involved in meiotic recombination—FANCA (ninth most extreme non-synonymous SNP in CEU versus CHB+JPT) and TEX15 (thirteenth most extreme non-synonymous SNP in CEU versus YRI, and twenty-sixth most extreme non-synonymous SNP in CHB+JPT versus YRI). Because we are finding almost all common variants in each population, these lists should contain the vast majority of the near fixed differences among these populations. Finally, it improves the fine mapping of selective sweeps (Supplementary Fig. 14) and analysis of the dynamics of location adaptation. For example, we find that the signal of population differentiation around high Fst genic SNPs drops by half within, on average, less than 0.05 cM (typically 30–50 kb; Fig. 5d). Furthermore, 51% of such variants are polymorphic in both populations. These observations indicate that much local adaptation has occurred by selection acting on existing variation rather than new mutation.

    This last point is not especially demonstrated by the new sequencing data. What we are looking at is few complete sweeps, but that's expected even if all the selected variants were novel mutations -- there just hasn't been time to fix many variants. It remains to be shown the extent to which standing variants are involved in this selection, partial sweeps of new mutations, or parallel adaptations ("Spatial dispersal, parallel adaptation, and the 'Stooge effect'"). We'll probably see a lot more interesting work on recent selection coming out of the new data.

    Science has a companion paper to the Nature data summary, focusing on copy number variation and gene duplications. I will review that one separately.

    UPDATE (2010-10-27): Dienekes pulls out an interesting passage about the Y chromosome sequences, which in at least one case recover many markers separating haplogroups once thought to be much closer to each other. Not sure what to make of that yet.


    References

  • A low human mutation rate may throw everything out of whack

    Thu, 2010-03-18 16:30 -- John Hawks

    Last week, a paper looking for the genetic causes of Miller syndrome reported the whole genomes of four members of a single family: two siblings with the disorder and their two parents without. The idea was that they would simply compare the affected and unaffected genomes. They would then find candidate loci that might account for Miller syndrome in the affected siblings. By exploiting some other sources of information, they found what they were looking for. Daniel MacArthur covered the story in his post, "Disease hunting with whole genome sequences: the good news, and the bad news".

    I got interested in another aspect of the story. With whole-genome sequences of parents and offspring, it becomes possible to directly determine the rate of mutations in each generation. The paper by Roach and colleagues did just that -- they counted 28 in the 2.3 billion bases of sequence they included in their comparison. That makes a per-site mutation rate of 1.1 x 10-8 per generation.

    Which is a pretty interesting number. You see, it's less than half what it ought to be:

    [O]ur estimated human mutation rate is lower than previous estimates, the most widely cited of which is 2.5 x 10-8 per generation (10) based on three parameters: a human-chimpanzee nucleotide divergence per site (Kt) of 0.013, a species divergence time of five million years ago, and an ancestral effective population size of 10,000. More recent estimates indicate a nucleotide divergence of 0.012 (9), species divergence time between six and seven million years ago (11–15), and ancestral effective population size between 40,000 and 148,000 (16–19). With these parameter ranges and a generation length of 15 to 25 years, the mutation rate estimate is between 7.6 x 10-9 and 2.2 x 10-8 per generation, which is consistent with our intergenerational estimate of 1.1 x 10-8. Our estimate is within one standard deviation (SD) of an earlier estimate of 1.7 x 10-8 (SD: 9 x 10-9) based on 20 disease-causing loci (20). The rate we report is for autosomes, and should be several-fold lower than that of the Y chromosome, as in the male germline more cell divisions occur per generation. Though our rate differs approximately as expected from the recently reported estimate of 3.0 x 10-8 (95% CI: 8.9 x 10-9 – 7.0 x 10-8) for the Y chromosome, the error rates make this difference not significant (21).

    You can see the obvious implication: If this mutation rate is accurate, then the average human-chimpanzee gene divergence has to be up around 11 million years ago. That can be accommodated with a 7-million-year-old species divergence only if we assume a very large ancestral population -- on the order of 50,000 or higher. Or, the ancestral effective size could be lower -- but that would make the species divergence substantially older -- 9 million years or more.

    There is a second implication. Most studies of human genetic variation have assumed that 5-million-year-old human-chimpanzee divergence and the high associated rate of mutations. If the true rate is less than half that, then the coalescence times of human genes are more than double most estimates. That would include our estimates of human-Neandertal genetic differences.

    Well, that's a fine pickle.

    I'm not quite ready to believe the very low rate estimate. The analysis in this paper uncovered tens of thousands of false positives, and had to filter through those to arrive at 28 true mutations. The filtering involved resequencing all the positives to determine which were true and which were false, but maybe there's room in there for a substantial number of false negatives, too.

    If this low estimate were true of the human-chimpanzee divergence, it would imply vastly higher ages for other primate divergences, or a much lower rate on the human lineage specifically. So that allows another check on the process.

    But generally, I'll be looking at whole-genome family comparisons with great interest, because they will give us a much more precise understanding of the rate of mutations and recombinations across the genome.

    References:

    Roach JC and 14 others. 2010. Analysis of Genetic Inheritance in a Family Quartet by Whole-Genome Sequencing. Science (early online) doi:10.1126/science.1186802

    Synopsis: 
    Whole genome sequencing of a family finds a very low number of mutations, suggesting evolution doesn't have the timescale we thought.

Pages

Subscribe to 1000 Genomes Project

Neandertals

For years, I've worked on their bones. Now I'm working on their genes. Read more about the science studying these ancient people.

Denisova

From a finger bone of an ancient human came the record of a completely unexpected population. My lab is working on the science of the Denisova genome.

Acceleration

The advent of agriculture caused natural selection to speed up greatly in humans. We're uncovering some of the ways that populations have rapidly changed during the last 10,000 years.

Malapa

Just outside Johannesburg, the Malapa site is producing some of the most exciting finds in human evolution. This site is the headquarters of the Malapa Soft Tissue Project.