Mutual information between strings of loci

Fourth in a series on mutual information and genetic linkage. If you’re happening upon it for the first time, you can find the entire series or the first post, “Information theory: a short introduction”.

After the last post, you might wonder what the big deal is about these information theoretic measures of linkage. After all, we’ve got lots of other measures of linkage to choose in population genetics, with many years of theory behind them. The basic conclusion about genetic drift was that it adds mutual information to samples over short regions, but that recombination over longer areas washes it out. If the net effect is no linkage, why would we bother to come up with some non-standard linkage measure?

One answer: If the existing linkage measures were so great for testing neutrality, then we might expect some of the recent genome-wide selection scans to have used them. But they didn’t – instead we have several partially incompatible methods, all of which eschew the usual measures of linkage.

I’m not going to summarize all the reasons for this, at least not yet. But there is one thing that the current positive selection tests have in common: they all attempt to quantify the decay of linkage across long strings of loci. Each of them—LRH, iHS, LDD—distills into a single number the pattern of reduction of linkage across physical distance.

After the last post, that may seem odd. If we know the physical distance (or more to the point, the map distance), we can predict the amount of linkage under drift. That really ought to be enough to test neutrality. Yet the existing tests insist on a second dimension: the rate of decline of linkage. Partly, that’s because these tests try to separate positive selection from other things that violate neutrality, like inversions. But it also reflects a real limitation of the methods: they ignore some of the information available in the data, and thereby limit their power to test the hypothesis of genetic drift based on linkage. So they must turn to the rate of linkage decay as a test of drift.

<h4 class="subsectionHead"><a   id="x1-2000"></a>The trouble with homozygosity</h4> <p class="noindent" >Massive genotyping projects are not sequencing projects. They provide long lists of genotypes, not sequences. If you&#8217;ve got a long list of genotypes, one of the measures of variation immediately available to you is homozygosity. Homozygosity among                                                                                                                                      multiple sites, or <span  class="cmti-10">joint homozygosity </span>is not mathematically the same as the mutual information between sites, but it also is increased by linkage. To the extent that genetic drift predicts the distribution of long-range linkage, you can test neutrality by looking at the joint homozygosity of two or more sites. For example, the extended haplotype heterozygosity (EHH) is the probability that a pair of gametes that share a single core haplotype will also share identical haplotypes at longer distances. Homozygosity itself is not a test of linkage&#8212;for that we have to combine the probabilities of identity of all alleles in some way. </p> 

Homozygosity-based tests of neutrality set up this combination as a ratio: the ratio of homozygosity for one allele or core haplotype versus the homozygosity of other alleles or haplotypes. That procedure throws away a lot of information carried by the fraction of haplotypes that are nonidentical at distance. Since that fraction of non-homozygote mutual information, unlike homozygosity, actually increases with distance, it may in some circumstances provide a more powerful test of neutrality. The use of a ratio of homozygosity also reduces the chance to detect certain classes of non-neutral loci—most obviously, the ones that have two or more selected alleles.

To put some numbers to these ideas, let’s consider we have two loci, A and B. The alleles of A are a1 and a2, B has b1 and b2. Each of these alleles has an associated frequency, for example p(a1), and the frequency of the haplotype a1b1 will be written p(a1b1). If A and B are independent (unlinked) then the expected value of p(a1b1) would simply be p(a1)p(b1).

The usual measure of linkage, D, is calculated as

<table  class="equation"><tr><td>    <center class="math-display" > <img  src="/graphics/strings-of-loci-multiinformation0x.png" alt="D = p(a b)p(a b) - p(a b )p(a b )        1 1   2 2     1 2   2 1 " class="math-display"  /><a   id="x1-2001r1"></a></center></td><td class="equation-label">(1)</td></tr></table> <p class="nopar" > </p> 

It thus involves the frequencies of all four possible haplotypes. At linkage equilibrium, the two terms are expected to be the same, so that D = 0.

Several other measures of linkage are in common use (and sometimes lead to confusion). There are several reasons why you might want a different measure, and one of the biggest is that D depends on the frequencies of the alleles a1 and b1—low-frequency polymorphisms will have systematically lower linkage values for the same evolutionary scenario. A very good thing about D is that it does not require us to know the frequencies of the individual alleles a1 or b1, only the frequencies of their joint haplotypes. But the expectation that the products in the equation are equal depends on a 2 × 2 contingency table. Three-, four- or more-locus haplotypes demand some other measure.

Using the same terminology, the homozygosity of the haplotype a1b1 is given as p(a1b1)2. By itself, this tells us nothing at all about linkage. If we combine it with the homozygosity of one or the other allele, (for example with the difference p(a1)2 - p(a1b1)2, we will have a measure of the reduction in homozygosity with distance. That reduction in homozygosity still doesn’t tell us about linkage, without considering p(b1). But if we scale the reduction by taking it as a ratio with the reduction in homozygosity of one or more other haplotypes, those other haplotypes give us indirect information about p(b1). So we have an indirect measure of linkage, based on the relative rate of linkage decay.

Relying on the rate of decay of linkage isn’t such a bad idea if we don’t know the rate of recombination between markers. In that case, we can use the rate of decay of homozygosity of other haplotypes as a scaling factor. But there is a problem lurking: The decay of homozygosity around a marker or core haplotype depends on its frequency. Under drift, higher-frequency haplotypes decay over shorter distances. So our test will be biased in a way that depends on the frequency spectrum of haplotypes.

It seems much more direct to estimate the mutual information between A and B. We don’t need to use an indirect linkage measure when we have all the frequencies needed to calculate a direct measure. Mutual information is a useful measure because it extends easily to many loci. But it has a disadvantage: it requires us to obtain an independent estimate of the recombination fraction between our markers.

Maybe that’s not such a problem. High-resolution recombination maps are available for the HapMap markers. If we use those maps, then we can test neutrality with markers at fixed map distances instead of physical distances. That ought to let us easily quantify the genome-wide fraction of mutual information that originated in a given time interval in the past. But we’ll have to remain cautious, since the map distances are worked out under the assumption of neutrality.

<h4 class="subsectionHead"><a   id="x1-3000"></a>Why multilocus haplotypes are useful</h4> <p class="noindent" >Why do we care about many loci, when we can test neutrality using only one? Suppose we&#8217;re interested in the mutual information across a 500 kb stretch of the genome. We could pick a single marker on each end of the 500 kb, and measure the mutual information between them. That&#8217;s basically what I did in the case of drift in the last post, <a  href="" >&#8220;When genetic drift reduces entropy&#8221;</a>.                                                                                                                                      </p> 

That isn’t slicing the sample very finely. Suppose that our two markers both have major alleles at 80 percent. That makes the expected frequency of the most common joint haplotype 64 percent, and the least common 4 percent. In our sample of 120 gametes, we’ll conclude that the two are significantly associated if those values go to around 70 and 10 percent, respectively. That’s like salting the sample with seven or eight copies of the rare haplotype.

On the other hand, we could salt our sample with more than a dozen copies, evenly split between the two intermediate-frequency haplotypes, and never get a significant result. Even if the expected rare haplotype were completely absent, it wouldn’t be enough to reject the hypothesis of random association, much less the hypothesis of genetic drift. Natural selection can cover its tracks, if two or more linked haplotypes have both increased in frequency.

Is this likely to be a problem very often? Past some distance, positive selection loses its impact on linked haplotypes, because there’s too much recombination. Near a selected site, linkage is strong enough that there will usually be a single major haplotype hitchhiking upward. In between, there may be minor haplotypes hitchhiking more weakly but we will generally be able to look closer to find a major haplotype. But if there are multiple haplotypes under selection, perhaps because more than one adaptive mutation have occurred, the locus may well look completely neutral.

An example in human populations may be MC1R. Harding et al. (2000) sequenced the gene in 106 Africans and over 356 Europeans. They found a diverse array of amino acid mutations in Europeans that were absent in their African sample, concluding that purifying selection on skin pigmentation had prevented such mutations from becoming common in Africa. In contrast, purifying selection was relaxed in Europe, allowing several loss-of-function variants to increase in frequency. They found no statistical evidence for positive selection on these loss-of-function variants in Europe. Likewise, no other recent tests for positive selection have shown MC1R to stand out as selected.

But wait a minute. Where did all those European redheads come from?

Harding and colleagues found five different coding polymorphisms with frequencies more than 7 percent in Europe, but completely absent in their sample of Africans. That adds up to roughly half the copies of MC1R in Europe, all derived alleles. That kind of emergence of new alleles doesn’t happen by chance, at least not in a population history like Europe’s. But none of the usual tests will reject neutrality when five different alleles have increased under selection. And individually, each of these alleles is too rare to register a rejection of neutrality, at least, if we apply the usual tests.

What makes this gene so troublesome? There’s no problem estimating the frequency of any single derived allele in Europe, and it’s easy to figure out its extended homozygosity. But if we then compare that to other alleles at the same locus, well, some of them have long haplotypes, too. So none of them have an unusual pattern of LD decay compared to the others.

What to do? We could try to get a much larger genetic sample, or a large sample of individuals from a part of Europe where one of these alleles is especially common.

But since I’m lazy, I figure I’ll just consider the mutual information carried by haplotypes of multiple loci. Individually, no single allele may have a frequency high enough to reliably show linkage in a two-locus comparison. Individually, no single allele will reject neutrality by a homozygosity-based test, because the other selected haplotypes decay over very long areas, making the ratio for any single haplotype insignificant. Collectively, the selected haplotypes may carry linkage over long distances. Together they will reject neutrality, even if individually they don’t.

But first, we need to know the distribution of multilocus mutual information under neutrality.

<h4 class="subsectionHead"><a   id="x1-4000"></a>Degrees of freedom</h4> <p class="noindent" >If you can calculate a <span  class="cmmi-10">&#x03C7;</span><sup><span  class="cmr-7">2</span></sup> and its degrees of freedom, you can skip next four paragraphs. </p> 

In the second post I showed that the mutual information is distributed as a χ2. That’s a statistical distribution with one parameter—the degrees of freedom. The concept of degrees of freedom is one of the trickiest definitions in statistics. Roughly, it means the number of independent factors that contribute to a single observation.

Let’s take the case of a contingency table, where we have a number of rows, r, and a number of columns, c, where the total number of cells is r ×c. If all the cells were independently distributed, then the number in each cell is expected to be the fraction in its row times the fraction in its column. That’s the expected value. If you take the number you observed in a cell, subtract the number you expected, square that difference, divide it by the expected value, and add up the result for every cell in the table, that’s the χ2 statistic. This statistic is high when the observed numbers are far from the expected ones. It’s low when they are close to expected.

For a 2 × 2 table, we can ask, how many ways are there for the χ2 statistic to be high? Any one of the cells may have an unusually large number of observations, so we might be tempted to say there are four different ways to get a high value. But it’s obvious that if one value is too high, the value in the same row and the other column must be too low. Likewise, the observed number in the other row and the same column must be too low. And if both those diagonal cells are too low, well then the observed value in the cell diagonal to our high cell must also be high. In other words, all four cells depend on each other. Push one, and you push them all. So there’s really only one axis along which the cells can vary—one degree of freedom.

With a 3 × 2 table, things are a little different. We can hold one row entirely constant, and just change the proportions in the other two. If one cell has a high observed value, that might be because both of the other rows were shortchanged. Or it might be just a single row that’s low, the other being high as well. That makes two distinct ways that we could get a high χ2 statistic, two degrees of freedom. In general, there are (r - 1)(c - 1) degrees of freedom in a contingency table. It’s always one less than the number of rows or columns because at least one row and column must be shortchanged when a cell is higher than expected.

<h4 class="subsectionHead"><a   id="x1-5000"></a>Degrees of freedom with comparisons of multiple haplotypes</h4> <p class="noindent" >From the <a  href="" >previous post</a>, you might have anticipated how we could do simulations of drift in samples of multiple markers. For the following, I&#8217;m going to examine the mutual information between two sets of three SNP markers. That is, I have a stretch of chromosome roughly 50 kb long from end to end (or more properly, 0.05 cM), with three markers tightly linked at either end. Using the <a  href="" >same methods as the previous post</a>, I&#8217;m applying the same three-stage population history with genetic drift only to these six markers. </p> 

Here are the results for 10,000 trials:


That doesn’t look very much like the chart that I got in the previous post, using the same population history. Here is that one:


What’s going on? That red line in the second chart is the χ2 distribution with one degree of freedom. The first chart doesn’t match that at all.

Why not? The second chart shows the mutual information between two markers, with two alleles each. That’s one degree of freedom. The first chart shows the mutual information between two sets of three markers. Three markers give the possibility of eight distinct haplotypes (23). So we have an 8 × 8 contingency table, with 49 degrees of freedom. Here is that distribution, in red, along with the result of our 10,000 simulated samples:


Except, oops, it doesn’t look like 49 degrees of freedom either! What the heck is going on?

If you’ll remember from the second post in the series, our empirical distribution isn’t going to fit a chi-square unless we have enough observations. Sure, in theory, three markers could give us 8 haplotypes. But in practice, the three markers on each end of our 50 kb region are so tightly linked that we get many fewer haplotypes. Sometimes we get only three or four haplotypes on each end. If each set of three markers were independent of the other, they might fit a chi-square with four degrees of freedom, or nine, or six or twelve. In fact, the mutual information observed in these 10,000 simulations accords with a mixture of distributions. Here are the number of degrees of freedom in the 10,000 simulations, as a histogram:


Most of the trials have six or nine degrees of freedom, with 12 being a substantial proportion as well. No primes, there, except for 2 and 3. Only a tiny fraction of trials have as many as 20 degrees of freedom between the 3-marker sets. None get anywhere close to the theoretical 49, because the three-marker sets are too tightly linked to express all possible haplotypes.

So if we want a theoretical distribution to match to our simulated one, we are going to need a mixture of different χ2 in proportion to the number represented in the simulated set. Here’s that comparison:


The red line is a mixed distribution, in which different χ2 distributions are blended in the proportions represented by the histogram above. That’s the distribution of mutual information between the two marker sets, conditioned on the haplotypes actually present in each set, under the hypothesis of independence. The simulations lie to the right of this distribution, meaning that small-sample bias and genetic drift have added mutual information to the simulations, compared to the expectation in the absence of linkage.

The only difference between these simulations and the ones in the previous post is the number of markers. Here, we’re looking at mutual information between two three-marker sets; there, we were looking at mutual information between two one-marker loci. Both examples used the same distance and the same population history. Genetic drift has had a similar effect—in both sets of trials, there is a slight excess of cases with high mutual information. With both, there are around twice as many at the tail as expected if the markers were independent.

This leads to a couple of conclusions:

  1. Under this population history, genetic drift is not sufficient to cause very large amounts of mutual information to be shared by distant sites.
  2. A slight alteration to the χ2 test—for example, adding some proportion to the critical values—might allow us to test the hypothesis of neutrality for any given case without the necessity of all those simulations.

If we were to add more and more markers, we would eventually find that the bias in mutual information due to small sample size would get bigger. So there’s some maximum number of markers giving useful information in any given sample. But I just picked 3 markers out of a hat. Adding more markers, up to some point, should increase our ability to see rare haplotypes that might diverge from the expectations of genetic drift.

Next: Conditional mutual information: finding the haplotypes that explain linkage disequilibrium

<h2 class="likechapterHead"><a   id="x1-6000"></a>References</h2> <a   id="likesection.6"></a><a   id="Q1-1-7"></a>   <div class="thebibliography">   <p class="bibitem" ><span class="biblabel"> <a   id="XHarding:2000"></a><span class="bibsp">&#x00A0;&#x00A0;&#x00A0;</span></span>Harding  RM,  et&#x00A0;al.  2000.   Evidence  for  variable  selective  pressures  at   MC1R. Am J Hum Genet 66:1351&#8211;1361. </p>