john hawks weblog

paleoanthropology, genetics and evolution

genomics

  • Gene Wiki

    Thu, 2008-07-10 16:32 -- John Hawks

    Larry Moran comments on the Gene Wiki. (If you haven't read about it, check out this AP article, or the PLoS Biology paper). Larry has written before about the errors in sequence databases and how hard it is to fix them, he's one of the people with the most practical experience trying to find ways to remove errors.

    His posts are a good way to learn about the limits of these resources. I've seen several cases where incorrect data made it into a database and proliferated through the literature. These cases are extremely hard to root out once they get in. Errors are inevitable -- sometimes things just aren't the way they look. The wiki concept does provide a chance to fix things, or at least a place for annotations of existing errors, as long as credible people are doing the annotations.

    I think that the baseline may have the potential as a foundation for a wiki about recent selection on human genes.

  • Sanger Institute sequences a trillion bases in six months

    Fri, 2008-07-04 11:03 -- John Hawks

    Genetic Future comments on news from the Wellcome Trust Sanger Institute:

    At the current rate (which is rapidly increasing) the Sanger is churning out more DNA sequence every two minutes than was generated by the entire research community from 1982-1987. This obscene rate of data generation has been enabled by the development of next-generation DNA sequencing platforms, which can each churn out one human genome equivalent in less than a week.

  • Ajit Varki profile

    Thu, 2008-07-03 09:35 -- John Hawks

    Reporter Bruce Lieberman profiles geneticist Ajit Varki in this week's Nature. It's a good summary of Varki's work in sialic acid evolution, focusing on one particular change in the N-glycolyl neuraminic acid (Neu5Gc), work that I touched on here around 3 years ago.

    On a molecular level, the difference between Neu5Gc and Neu5Ac is tiny -- a single added oxygen atom perched on one arm distinguishes one from the other (see graphic). But on a biological level, the difference could be enormous. "We thought if monkeys and all of our closest relatives have Neu5Gc and humans don't, then there must be a molecular basis for that," Varki says. He subsequently found it in an enzyme that converts Neu5Ac to Neu5Gc, but which is disabled by mutation in humans.

    The article also covers the founding of the Center for Academic Research and Training in Anthropogeny, a research effort of the University of California, San Diego and the Salk Institute. Led by Varki, Margaret Schoeninger, and Pascal Gagneux, the center aims to become an important focus of interdisciplinary work in human origins. I was lucky enough to be invited to one of their research seminars two years ago, and I can say it's a wonderful environment for collaboration, if the project can continue and build on these small meetings:

    Between 1998 and 2007, the Project for Explaining the Origin of Humans drew in anthropologists, primate biologists, geneticists, immunologists, neuroscientists, linguists and many others. They discussed topics ranging from the evolution of language to the differences between humans, Neanderthals and Homo erectus, the first hominid to leave Africa. Goodman says the interdisciplinary nature of the series made it extremely important to the field. "You really had the chance to explore an issue as it relates to the evolutionary origins of our species," he says.

    ...

    Varki estimates that he has listened to more than 300 talks on various aspects of this discipline. "The idea is the linguist needs to talk to the molecular biologist who needs to talk to the neuroscientist who needs to talk to the psychologist and philosopher about these issues," he says. "Most areas of human knowledge are somewhere relevant."

    I think that's exactly the right attitude -- we need more interdisciplinary efforts. I run up against the blind spots of various specialties all the time, and I'm just one person. On the other hand, it is very challenging to get people to invest the time to learn facts outside their narrow field. If this institute helps those efforts, it will be all to the good.

    References:

    Lieberman B. 2008. Human evolution: details of being human. Nature 454:21-23. doi:10.1038/454021a

  • How much data in your genome

    Sat, 2008-06-28 10:46 -- John Hawks

    Daniel Macarthur, of Genetic Future, reviews the amount of information required to store genomic information. Naturally, you'd probably think it was around 12 billion bits (2 bits per base pair), but sequencing technologies and the availability of references from other people make things a little more complicated.

    This interesting quote about the raw image files generated by the Illumina platform presents some of the range of complications:

    Almost as soon as these images are generated they are fed into an algorithm that processes them, creating a set of text files containing the sequence of each of the fragments. The image files are then almost always discarded. Why are they discarded? Because, as you will see in a minute, storing the raw image data from each run in even a moderate-scale sequencing facility quickly becomes prohibitively expensive - in fact, several people have suggested to me that it would be cheaper to just repeat the sequencing than to store these data long-term.

    An accurate read requires lots of redundant bits, which adds up to lots and lots of data storage. If these are winnowed down to a real "best" sequence, then you're back to 12 billion bits (=1.5 gigabytes), more or less. Of course, most of that sequence is redundant and may be significantly compressed. And if you compare with a reference sequence, really a small amount of information is sufficient to distinguish your genome compared to the reference. Anyway, all this is explained at the link.

  • Substitution rates and ancestral population sizes

    Thu, 2008-05-15 14:20 -- John Hawks

    The rate of neutral mutations varies across the genome. When studying a single gene, this variation in rates is not especially important -- it is generally possible to obtain an estimate of the neutral rate for a single locus by comparing just that locus among closely related species.

    But some comparisons involve looking at the pattern of variation among different loci. For instance, testing hypotheses about the ancestral populations leading to living species (like the common ancestor of humans and chimpanzees) involves comparing the amount of divergence among many independent loci. The variance in divergence times among loci gives an estimate of inbreeding in the ancestral population.

    I discussed this particular example two years ago this week, after the paper that proposed extended hybridization between ancestral hominids and chimpanzees. The conclusion of the paper was that the X chromosome displays much less divergence between humans and chimpanzees than the autosomes, and this might reflect a late introgression of the X chromosome into hominids from another population that (mostly) was ancestral to chimpanzees. The autosomes, by contrast, averaged very old genetic divergences, although there was substantial variance. As I concluded then, the data look consistent with a large population size in the human-chimpanzee ancestor species, coupled with greater selection on the X chromosome. The interpretation of large population size (or alternatively, the interpretation of long-term population structure) comes from the low inferred inbreeding in that ancestral population -- which caused the variance in divergence dates among loci.

    But there is another reason for a large variance in divergence dates: variance in mutation rates. Whenever mutation rates vary among loci, this variance adds to the variance among loci in their between-species genetic differences -- that is, the substitution rate. And as long as we are excluding selected sites (as we always try to do for these kinds of comparisons) we will overestimate the genetic diversity in ancestral species whenever the mutation rate varies among loci.

    A new paper by Svitlana Tyakucheva and colleagues looks at human and macaque genomes to find patterns underlying the variance in mutation rates among regions of the genome. They find that a number of factors may cause such variations, including chemical factors like the CG content of the genome, functional causes such as male versus female rates of recombination, and large-scale structural causes such as telomeric proximity:

    While a complete understanding of all biological mechanisms leading to variation in neutral substitution rates across the genome remains elusive, it is plausible that at least some of these mechanisms are conserved over relatively long evolutionary distances. For instance, both mouse-specific and rat-specific substitution rates are positively correlated with rodent-primate substitution rates [14], suggesting shared mechanisms persisting over ca. 90 million years [15]. Additionally, a positive correlation exists in substitution rates of homologous X- and Y-chromosomal introns that diverged from each other ca. 100 million years ago [16] (Tykucheva et al. 2008: R76).

    Their finding that male recombination is an important contributor to mutation rate heterogeneity puts the focus on the X chromosome -- which has little recombination in males -- as unusual. X versus autosomal position did not explain a large fraction of the variance in this study (only around 2 percent, controlling for other factors) but the deviation was in the right direction to help account for the low X chromosome divergence between humans and chimpanzees.

    Altogether in this study, a large fraction of variation in the human-macaque substitution variability could be explained by phenomena that affect the rate of mutations, including the structural and functional factors listed above as well as the corresponding homologous variability between mice and rats, and dogs and cattle. If these variations were explained by inbreeding in the human-macaque ancestral species, they would be random with respect to the dog-cow or mouse-rat divergences, and with respect to structural causes. So current estimates of the effective sizes of human-chimpanzee and other ancestral populations are almost certainly inflated. The amount of inflation is not clear, but a good estimate will require correcting for a large number of factors -- a complicated analysis.

    Since the date of the human-chimpanzee divergence depends on our assessment of the diversity within the human-chimpanzee ancestral population, it may be a while before we can settle the issue of human-chimpanzee divergence time. That may or may not provide hope for Sahelanthropus, Orrorin, and Ardipithecus kadabba -- all supposed hominids that would predate 5 million years ago, the current best genetic estimate of the human-chimpanzee divergence time. To be sure, if the date is simply in error, that error might encompass older dates consistent with a 7-million-year divergence. But I'm not sure we should believe that the error is biased toward an older divergence -- "error" might lean in either direction, and a younger species divergence remains possible.

    References:

    Tyakucheva S, Makova KD, Karro JE, Hardison RC, Miller W, Chiaromonte F. 2008. Human-macaque comparisons illuminate variation in neutral substitution rates. Genome Biol 9:R76. doi:10.1186/gb-2008-9-4-r76

  • Evolution of the monkeyflowers

    Fri, 2008-04-18 23:34 -- John Hawks

    Spring has finally come to us here in the North, and it's time to start thinking about planting. So, when I went to a seminar yesterday by John Willis, it was with dual motives.

    Naturally, I was interested in hearing about his work relating the evolutionary ecology of Mimulus species to their genomics. As Willis and his many former and current lab members made clear in a recent review article in Heredity, monkeyflowers have become a really interesting model system for studying the dynamics of natural selection on genomes -- particularly, with relation to local ecological adaptation, and also with relation to speciation.

    But I was also thinking about whether I could find a nice flower variety for my garden. I'm not particularly excited about peas, and I tolerate Arabidopsis when it comes up, but let's face it, it's not exactly a show flower. I'd love to get one of the prettier hawkweeds going (these have eponymical appeal as well as botanical interest) but the common ones are pretty boring.

    Well, Willis's lab has been a center of development for Mimulus genetics. They have developed a store of SNPs and other markers (available at the Mimulus evolution website) for QTL mapping, and are using them to find genes responsible for ecological adaptations in different wild Mimulus populations. In the talk, Willis featured some of his collaborators' work finding genes involved in wet versus dry habitat adaptations and in early versus late flowering. These traits are connected to each other, as well as to other life history, plant size and flower size.

    I left having my prior belief abundantly confirmed: botany is awesome. I mean, think about it. You can go outside, in your own neighborhood, and study biology. You can uproot your subjects and transplant them somewhere else, to watch how well they do. If they die, well, that's a data point, not an ethical emergency! Worried about gene-environment interactions? No problem, just put samples of all your subjects in the same greenhouse and wait. Need to isolate a QTL against a uniform genetic background? Cool, just repeatedly backcross it into an inbred line for a few generations, selecting for the trait each time. Want to study genetic correlations? Well, you can breed a thousand plants and select for any trait you want!

    Oh, and if you want to, you can clone them.

    Let's look at an example, from the Heredity review:

    Recent work on floral evolution demonstrates that fundamental evolutionary questions can be addressed in Mimulus through the combination of field experiments and modern genomic approaches. Bradshaw et al. (1995, 1998) pioneered the application of genome mapping to study of ecologically important traits in Mimulus using RAPD and allozyme markers to map floral QTLs underlying the divergence between red-flowered, hummingbird-pollinated M. cardinalis and pink-flowered, bee-pollinated M. lewisii. The initial mapping experiments, with hybrid phenotypes measured in controlled greenhouse environments, revealed QTLs with major effects on virtually every floral character studied, from coloration and morphology to nectar production. To determine the effect of these QTLs on pollinator visitation and discrimination, Schemske and Bradshaw (1999) moved the genotyped hybrids to a field site near one of the few regions where the species coexist, and observed bee and hummingbird visitation behavior. Amazingly, the M. cardinalis allele at a single QTL, YELLOW UPPER (YUP), was responsible for an 80% loss of visitation by bee pollinators, and the M. cardinalis allele at a QTL responsible for variation in nectar production doubled hummingbird visitation (Schemske and Bradshaw, 1999). Bradshaw and Schemske (2003) subsequently created near-isogenic lines (NILs), where heterospecific alleles at YUP were reciprocally introgressed into the parental genetic backgrounds, and evaluated the response of pollinators to the NILs in the field. They observed an even clearer pattern of pollinator discrimination due to this locus, with a 74-fold increase in bee visitation in M. cardinalis NILs that carried the M. lewisii YUP allele, and a 68-fold increase in hummingbird visitation in M. lewisii NILs with the M. cardinalis YUP allele. Although the ecological context, in this case the community of potential pollinators, is certainly important to the evolution of new pollinator associations, these results also demonstrate that single genomic regions can have a large effect on major evolutionary transitions (Wu et al. 2008: 224-225).

    The talk was mostly focused on the Mimulus guttatus complex, where some of the most pressing issues are life history, drought tolerance, and tolerance of high mineral concentrations, such as salt or copper. They were able to trace many QTL's of small effect with relation to the major differences in life history and moisture requirements in ecogeographic races of M. guttatus, to show that the within-population variation for these traits is caused by high-frequency (likely balanced) alleles rather than mutation-selection balance or rare alleles, and to find the correlated responses to selection of different plant traits based on different QTL's.

    With respect to the genetics of speciation and ecogeographic race formation, they are helped by a long history of research on Mimulus. For example:

    Macnair and Christie (1983) performed the first direct genetic analysis of hybrid incompatibilities in Mimulus. While studying the genetic basis of copper tolerance in California populations of M. guttatus, they noticed that some crosses between plants from the copper mines and certain other populations resulted in F1s that died as young seedlings. Further crossing studies revealed that the F1 lethality was caused by a deleterious epistatic interaction between the copper tolerance allele from the mine populations (or a gene tightly linked to it) and alleles at an unknown number of different loci from the other populations. Such deleterious interlocus interactions, usually referred to as Dobzhansky–Muller (D-M) incompatibilities, are thought to be the major cause of low hybrid fitness in plants and animals (reviewed in Coyne and Orr, 2004). Remarkably, it appeared that natural selection for copper tolerance had indirectly resulted in the evolutionary origin of the hybrid incompatibility (Wu et al. 2008:226).

    So yes, say what you want, botany is awesome. Plus, there's one more thing: I sat through an entire lecture about natural selection and ecological differentiation of species and races, and never once heard the word, "bottleneck." It was like traveling to some kind of bizarro world where biologists still read Darwin!

    So we come down to the really difficult question: which variety am I going to plant? Mimulus glabratus is native here in Wisconsin, including Dane County, but it is not very showy, and prefers wet habitat. That makes it a poor fit for my native plant patch, which is dry/mesic, and which I never water unless the black-eyed Susans and bee balms start to wilt. Mimulus ringens is prettier, with bigger, lavender flowers, but also likes it wet.

    I guess I'll have to keep looking. M. lewisii is a pretty variant, if I can find a good source for it, and I can keep it in one of the wetter corners of the yard. I would try for M. cardinalis, since we have hummingbirds sometimes, but I'd like to get Lobelia cardinalis going also, and it's a lot easier to find. Besides, it hardly looks like a monkey!

    References:

    Wu CA, Lowry DB, Cooley AM, Wright KM, Lee YW, Willis JH. 2008. Mimulus is an emerging model system for the integration of ecological and genomic studies. Heredity 100:220-230. doi:10.1038/sj.hdy.6801018

  • Probing for the alien within

    Sat, 2008-03-29 14:33 -- John Hawks

    Laura MacConaill and Matthew Meyerson present a cool short review in Nature Genetics of metagenomics applications in pathogen discovery.

    The basic principle is to extract DNA from a tumor or sore, do intensive sequencing of all the DNA in it, and use the computers to subtract out everything human. What's left after you subtract out the human DNA is any pathogen that might be in the sample:

    The two recent studies combined computational subtraction with microreactor-based pyrosequencing to identify viral signatures associated with human disease. Feng et al. used high-throughput pyrosequencing15 and comparison to the human transcriptome to identify a viral sequence in a library of cDNAs generated from individuals with Merkel cell carcinoma, a rare but aggressive human skin cancer. The authors sequenced over 395,000 reads of 150-200 bp in length. After digital transcriptome subtraction, 2,395 sequences remained. Among these, conceptual translation of one sequence showed similarity to a polyomavirus. By cloning the complete viral genome and carrying out further analyses, the authors found that the Merkel cell polyomavirus sequence was present in eight of ten Merkel cell carcinomas.

    A second group used the same high-throughput DNA sequencing technology to identify a previously undiscovered arenavirus that likely caused the deaths of three transplant recipients who all received organs from a single donor.

    I don't know if sequencing will ever get so cheap that this will become practical diagnostic method, but it really doesn't need to be. As soon as you suspect a pathogen, you can probe directly for that pathogen's DNA in a sample -- and there's no barrier to testing for hundreds of pathogens at once. Heck, there ought to be a SNP chip for it.

    But this is a potentially important way of identifying new pathogens in unknown samples from scratch. The article mentions that the current cost of this kind of sequencing is around $10,000 per sample, and that is rapidly falling. For that cost, you get the sequence on your computer, even if you can't identify it yet, and who knows -- it might pop up two years later when somebody else finds it in some unexpected place.

    References:

    MacConaill L, Meyerson M. 2008. Adding pathogens by genomic subtraction. Nat Genet 40:380-382. doi:10.1038/ng0408-380

  • Heritability review

    Tue, 2008-03-18 16:52 -- John Hawks

    Peter Visscher and colleagues present a long review paper on the concept and use of heritability in the current Nature Reviews Genetics.

    Heritability allows a comparison of the relative importance of genes and environment to the variation of traits within and across populations. The concept of heritability and its definition as an estimable, dimensionless population parameter was introduced by Sewall Wright and Ronald Fisher nearly a century ago. Despite continuous misunderstandings and controversies over its use and application, heritability remains key to the response to selection in evolutionary biology and agriculture, and to the prediction of disease risk in medicine. Recent reports of substantial heritability for gene expression and new estimation methods using marker data highlight the relevance of heritability in the genomics era.

    There's nothing particularly new here -- the "genomics" in the title doesn't amount to much beyond a discussion of how to estimate heritability from SNP-inferred relationships instead of pedigrees. But much that is old is worthwhile.

    It reads like twelve pages out of Falconer -- if Falconer were in a new edition -- and if you don't have Falconer, well, you might do well to read these twelve pages. They include a box about the "heritability of IQ controversy" as well as a discussion of the basic mystery about heritability in natural populations -- why should additive genetic variance be as high as it is?

    References:

    Visscher PM, Hill WG, Wray NR. 2008. Heritability in the genomics era -- concepts and misconceptions. Nature Rev Genet 9:255-266. doi:10.1038/nrg2322

  • Why have variants influencing recombination rate been selected in non-Africans?

    Sat, 2008-03-08 10:47 -- John Hawks

    A complicated story is tangled through this paper by Augustine Kong and colleagues, and I don't see where it may end. But here's the abstract:

    The genome-wide recombination rate varies between individuals, but the mechanism controlling this variation in humans has remained elusive. A genome-wide search identified sequence variants in the 4p16.3 region correlated with recombination rate in both males and females. These variants are located in the RNF212 gene, a putative ortholog of the ZHP-3 gene that is essential for recombinations and chiasma formation in Caenorhabditis elegans. It is noteworthy that the haplotype formed by two single-nucleotide polymorphisms (SNPs) associated with the highest recombination rate in males is associated with a low recombination rate in females. Consequently, if the frequency of the haplotype changes, the average recombination rate will increase for one sex and decrease for the other, but the sex-averaged recombination rate of the population can stay relatively constant.

    Perhaps it's not so curious that alleles of this gene have opposite effects on recombination in males and females. The mechanisms of gamete production are obviously different in the two sexes, and we might expect some kind of frequency-dependent mechanism to regulate recombination. At least, it's a hypothesis.

    What I find mysterious is this:

    A phylogenetic analysis of a 55-kb region containing rs3796619 and rs1670533 in the HapMap data (24) revealed three well-differentiated clusters of haplotypes showing notable differences in frequency between the Yoruban Nigerians (YRI) and CEU and East Asians (CHB and JPT) (fig. S6). The [C,T] and [T,C] haplotypes that associate most strongly with recombination rate have a combined frequency of only 17% in the YRI sample, but reach a frequency of 91% and 98% in the CEU and East Asian samples, respectively. Several SNPs in this region show an unusual degree of divergence among the HapMap groups, on the basis of the rank percentile of their FST values (Wright's coefficient, a measure of variance in allele frequencies among populations) among all autosomal SNPs with the same overall frequency in the HapMap. Specifically, we identified eight SNPs whose FST values are in the top 0.5% for differences between the YRI and East Asian HapMap samples and also in the top 5% of differences between the YRI and CEU samples. Each of these SNPs differentiated a subset of [T,T] haplotypes from the rest, perhaps indicating an episode of positive selection (or a severe founder effect) that increased the frequency of [C,T] and [T,C] haplotypes in the ancestors of European and East Asian populations.

    The [C,T] and [T,C] haplotypes are the ones associated with increased recombination rate in males and females, respectively. The markers are in strong disequilibrium (no [C,C] haplotypes were observed), and seem to have been selected outside of Africa.

    I have no idea why.

    The recombination rates were all inferred from a large Icelandic sample, so maybe the rates don't really characterize the haplotypes in other populations. Maybe recombination rate is incidental to the real reason for the selection. Or maybe in populations roaring with positive selection on many genes at once, it is a good thing to break them apart more often.

    References:

    Kong A and 16 others. 2008. Sequence variants in the RNF212 gene associate with genome-wide recombination rate. Science 319:1398-1401. doi:10.1126/science.1152422

  • The future of genetics is corny

    Sat, 2008-03-08 10:05 -- John Hawks

    Elizabeth Pennisi's story about maize genomics is a good reminder for why biology will continue to grow in importance toward our understanding of human history:

    With $9.1 million from the Mexican government, Jean-Philippe Vielle-Calzada of the National Laboratory of Genomics for Biodiversity in Irapuato and his colleagues have decoded a native "popcorn" strain grown at elevations above 2000 meters. Although still in more than 100,000 pieces, the sequence has revealed many new genes, he reported. This variety's genome "will be of tremendous value in terms of understanding the evolution of [maize] domestication," he says.

    Oh, and if you're interested in biology, consider the potential experiments from this:

    Another resource introduced at the meeting will help ... sort out how genes interact. The agribusiness giant Syngenta announced it was making available 7500 lines of corn, each representing a B73 genome with a single piece of DNA bred into it from one of the 25 strains of the Maize Diversity Project. Taken together, the lines incorporate all the genetic diversity of those strains but make it easier to understand the activity of particular genes. The community has long awaited these tools, says Brutnell: "They are really going to revolutionize the way we do genetics."

    I'd say. Imagine 7500 twins, all identical except for a unique piece of DNA spliced in from some other person. Except with corn, it's not 7500 twins, its 7500 experimental plots full of twins. Now, see what they all do!

    References:

    Pennisi E 2008. Corn genomics pop wide open. Science 319:1333. doi:10.1126/science.319.5868.1333

Pages

Subscribe to genomics

Neandertals

For years, I've worked on their bones. Now I'm working on their genes. Read more about the science studying these ancient people.

Denisova

From a finger bone of an ancient human came the record of a completely unexpected population. My lab is working on the science of the Denisova genome.

Acceleration

The advent of agriculture caused natural selection to speed up greatly in humans. We're uncovering some of the ways that populations have rapidly changed during the last 10,000 years.

Malapa

Just outside Johannesburg, the Malapa site is producing some of the most exciting finds in human evolution. This site is the headquarters of the Malapa Soft Tissue Project.