Hazards of associating genetic and morphological changes

Last year, Stedman et al. (2004) presented an analysis of the evolution of the MYH16 gene in humans, which concluded that a mutation deactivating the gene was fixed in the human lineage around 2.4 million years ago. An accompanying editorial by Pete Currie summarizes the story spun from this mutation:

The particular gene in question, MYH16, is specifically expressed in the jaw muscles of humans and monkeys. But, surprisingly, a mutation in the human gene prevents the accumulation of MYH16 protein. Stedman et al. found that, by contrast, all non-human primates for which genome sequence could be obtained have an intact copy of the gene, and have a high level of MYH16 protein in their jaw muscles. An analysis of the time at which the mutation arose during hominid evolution places it at about 2.4 million years ago, the period just before the evolution of the modern hominid cranial form. These findings suggest a seductive hypothesis: that a decrease in jaw-muscle size, produced by inactivation of MYH16, removed a barrier to the remodelling of the hominid cranium which consequently allowed an increase in the size of the brain (Currie 2004:373).

Two essential facts suggest the hypothesis that MYH16 evolved in association with the appearance of Homo: the estimated date for the frameshift mutation is 2.4 million years ago, and the gene is transcribed only in the muscles of the head, "specifically those derived from the embryonic first pharyngeal arch, including temporalis and tenso veli palatini."

Since then, some question has arisen about the date.

Paleoanthropology has a strange relationship with dates. On the one hand, new dates for fossils or archaeological sites are often put to rigorous criticism. We are rightly critical of dates, because they are so important to putting our evidence into sequence. Dates have power.

On the other hand, two different things having the same date are "seductive". In paleoanthropology, mere coincidence may always be a possibility, but it doesn't drive any headlines. Same date, same cause.

Of course, no dates from the past are really the same. They just have overlapping confidence intervals. And few things are worse for the estimation of confidence intervals than genes. Stedman et al. (2004) gave the MYH16 deactivating mutation a confidence interval of +/- 300,000 years. Anywhere in that range of dates is potentially associated with the appearance of Homo, since we really don't know when Homo originated, more specifically than between around 3 million and 2 million years ago. The "seductive" part of the hypothesis is that the mean date estimate for the gene, 2.4 million years, is the same as the estimate for AL 666-1, which is a plausible candidate for the first fossil evidence of Homo. But the date doesn't have to be the same for the gene and genus to be associated -- there is much uncertainty about both.

To some extent, this begs the question about dates in paleoanthropology. Two things that are plausibly causally associated might still be as much as several hundred thousand years different in dates. So how are we to resist the hypothesis that two events with the same date are causally associated? We may never confirm the hypothesis that two events actually do have the same date -- there is no statistical test for "significantly the same", just "not significantly different".

There are two ways to test the hypothesis that two events are causally related. One is to show that the causal link is impossible. In the case of MYH16, the proposed causal link makes a lot of sense, at least with respect to jaw muscle function.

The other test is to show that the dates really are significantly different. For MYH16, that is precisely the approach taken by Perry et al. (2005):

We describe the pattern of molecular evolution at a sarcomeric myosin gene, MYH16, using more than 30,000 bp of exon and intron sequence data from the chimpanzee and human genome sequencing projects to evaluate the timing and consequences of a human lineagespecific frameshift deletion. We estimate the age of the deletion at approximately 5.3 MYA. This estimate is consistent with the time of human and chimpanzee divergence and is significantly older than the first appearance of the genus Homo in the fossil record. We also find conflicting estimates of nonsynonymous fixation rates (dN) across different regions of this gene, revealing a complex pattern inconsistent with a simple model of pseudogene evolution for human MYH16.

The date estimate here comes from an assumption about what happens when the gene is deactivated. The downstream part of the gene was no longer functional after the frameshift deletion mutation happened. It should have evolved neutrally after the mutation, but not before. And Perry et al. found that there was only one nonsynonymous substitution on the chimpanzee lineage, but 16 on the human lineage. The reason the human lineage has more is assumed to be the absence of purifying selection against these substitutions in the period of time after the deletion.

The supplementary information for Stedman et al. (2004) puts the logic as thus:

Briefly, the assumption is made that non-synonymous mutations are selected against until the gene is inactivated, thereafter mutations at both synonymous and non-synonymous sites accumulate at the neutral mutation rate. Quantification of lineage-specific mutation rates at synonymous and non-synonymous sites remote from the inactivating deletion provides the information necessary for the calculation.

So to find the date of the deactivation, you assume that the substitution rate at synonymous and nonsynonymous sites after the deactivation was the same, and solve for the date that makes that ratio. The technique is simple, and was taken from Chao et al. (2002). That study used two different sources of evidence for dating the deactivation of the CMAH gene -- the gene sequence of the Alu insertion that deactivated it, and the number of nonsynonymous mutations in the newly-minted pseudogene. Both those approaches led to the same date. The latter approach -- the one used for the MYH16 gene also -- has no confidence interval. I may write more about that paper later, because it is interesting for several reasons, but at the moment it helps me very little.

You see, I have two questions that remain unanswered: (1) where does the confidence interval in Stedman et al. (2004) come from, and (2) what factors of uncertainty does that confidence interval leave out?

As far as where the confidence interval comes from, I'm afraid I am left with no clue. Neither the paper nor the supplementary information of Stedman et al. (2004) tell how a confidence interval on this estimate is derived, other than citing Chao et al. (2002), who don't report a confidence interval at all for this method.

Now as far as the second question, I'm wondering where the additional uncertainty may be in the estimate, because the estimate given by Perry et al. (2005) is so different. Consider:

Based on a 6-Myr divergence of the human-chimpanzee lineages (Haile-Selassie 2001; Brunet et al. 2002) and 15 nonsynonymous human lineage substitutions, we estimate the age of the exon 18 deletion at 5.3 1.0 MYA. Similar to Stedman et al. (2004), our confidence interval incorporates standard errors involving a 5 to 7 MYA range for human-chimpanzee lineage divergence as well as the genome-wide estimate of human-chimpanzee silent site nucleotide divergence (Yi, Ellsworth, and Li 2002). This age estimate is not only outside the confidence interval of the 2.4 0.3 MYA estimate obtained by Stedman et al. (2004) and significantly older than the first appearance of Homo in the fossil record but also consistent with an origin around the time that human and chimpanzee lineages diverged (Perry et al. 2005, emphasis added).

Reading this another way -- the "confidence interval" in the first analysis did not actually include all the uncertainty in the estimate. If it had included all the uncertainty, then the confidence interval should have been so wide as to include the later estimate, based on "better" data. So the true range of error was actually much larger than reported.

This is a major underestimated problem with associating any event with genetic changes. Nobody ever reports estimates of confidence intervals that account for these kinds of sampling errors. It's rare enough that we get any kind of confidence interval at all. For the most part, geneticists just don't know what the error from sampling could possibly be for any given dataset. There are just too many factors that might affect it, from population structure to the recombination rate to the timing of selection.

What about the estimate given by Perry et al. (2005). Is it right? Is its confidence interval accurate? At least as far as the confidence interval is concerned, the paper spells out what is included:

Similar to Stedman et al. (2004), our confidence interval incorporates standard errors involving a 5 to 7 MYA range for human-chimpanzee lineage divergence as well as the genome-wide estimate of human-chimpanzee silent site nucleotide divergence (Yi, Ellsworth, and Li 2002).

In other words, the 1 million years on either way is a safety margin based on the fact that the estimate of human-chimpanzee divergence date has a million years of uncertainty either way. Other sources of uncertainty, such as the stochastic nature of drift, or possible error in the assumption that distant silent sites and downstream nonsynonymous sites have the same mutation rate, etc., are not included.

I'm bothered by the functional part of Stedman et al. (2004), though. If this gene was deactivated in hominids 5.3 million years ago, and the gene is only expressed in muscles of the skull, including temporalis, then why did hominids get massively larger jaw muscles starting after 5.3 million years ago?

I'm waiting to be seduced by some explanation here. Anyone?

References:

Chao H-H. et al. 2002. Inactivation of CMP-N-acetylneuraminic acid hydroxylase occurred prior to brain expansion during human evolution. Proc Nat Acad Sci USA 99:11736-11741. Full text (free)

Currie P. 2004. Muscling in on hominid evolution. Nature 428:373-374. Full text (subscription)

Perry GH, Verrelli BC, Stone AC. 2005. Comparative analyses reveal a complex history of molecular evolution for human MYH16. Mol Biol Evol 22:379-382. Full text (free)

Stedman, H. H., B. W. Kozyak, A. Nelson, D. M. Thesier, L. T. Su, D. W. Low, C. R. Bridges, J. B. Shrager, N. Minugh-Purvis, and M. A. Mitchell. 2004. Myosin gene mutation correlates with anatomical changes in the human lineage. Nature 428:415418.Full text (subscription)