Information theory and mutual information between genetic loci

This is the second in a series on information theory and tests for recent selection. The first entry, "Information theory: a short introduction" reviewed the basic concepts of information measures and their background.

The International HapMap is a massive project to determine the genotypes for up to 3 million single nucleotide polymorphisms (SNPs) in samples of people from 11 population samples around the world. The current data release (Phase 3) includes genotypes for a subset of over 1.5 million SNPs in 1,115 people. The 11 population samples include people of African ancestry from the US Southwest, Utah residents of Northern and Western European ancestry, Han Chinese from Beijing, people of Chinese ancestry from Denver, people in the Houston Gujarati Indian community, Japanese people from Tokyo, Luhya and Maasai people from Kenya, people of Mexican ancestry from Los Angeles, Italians in Tuscany, and Yoruba from Ibadan, Nigeria.

As impressive as this effort is, we may wonder why exactly SNP genotyping of so many people is a valuable enterprise in itself. The project’s homepage includes this short statement:

The goal of the International HapMap Project is to compare the genetic sequences of different individuals to identify chromosomal regions where genetic variants are shared. By making this information freely available, the Project will help biomedical researchers find genes involved in disease and responses to therapeutic drugs.</div>

There are theoretical and practical objections to this simple explanation (as I discussed here last month). However, what no one involved with the project seems to have expected is the extent to which the data would demonstrate the importance of recent adaptive evolution in human populations.

Here, I am describing some of the ways that we can test hypotheses about natural selection by using the SNP genotypes from the HapMap. This is a theory-centric description, with some digression into practical aspects of handling the genotype data. First, I consider how we might use information theoretic concepts to test the hypothesis of independence between two genetic loci.

Entropy and genotypes

The data from the HapMap consist of an array of biallelic genotypes for each population. The size of this array is different for each population; we may consider it to have m rows, each row corresponding to a single SNP locus, and n columns, each corresponding to a person. The entries are genotypes: AA, AT, CG, and so on. Each SNP is biallelic, so it hardly matters what we call the two alleles—we can arbitrarily label them a and b. Thus, the three possible genotypes may be labeled aa, ab and bb.

Of course, from a sample of genotypes, we can readily estimate the frequencies of the two alleles in the population. The sample allele frequency ˆp (a) = ˆp (aa) + 12ˆp (ab), where the estimate ˆp (aa) is the number of aa individuals in the sample divided by the sample size n. It is worth expressing some statistical caution about estimates. Although the HapMap SNPs are biased toward common alleles, nevertheless some of them are rare indeed. And as we stretch down the chromosome to consider multilocus haplotypes, many may be vanishingly rare or even singular in the population, even though they happen to occur in the sample.

Now, how shall we estimate the entropy of a single locus? It seems there are several ways we might look at the question.

  1. Allelic entropy We might use the entire sample of genotypes at the locus to estimate the allele frequencies. In that case, the entropy of the SNP locus would be estimated by
    Hˆ(SNP ) = - [ˆp(a)log ˆp(a)+ ˆp(b)log ˆp(b)]

    For a SNP locus with empirical genotype frequencies p(aa) = 0.16, p(ab) = 0.48 and p(bb) = 0.36, this estimate of allelic entropy would be 0.97 bits. </li>

  2. Genotypic entropy Or, we might consider the genotype frequencies as the essential elements of the system. In this case, the entropy would be estimated by
    Hˆ(SNP) = - [ˆp(aa) log ˆp(aa)+ ˆp(ab)logpˆ(ab)+ ˆp(bb)log ˆp(bb)]

    For the same genotype frequencies listed above, this genotypic entropy would be estimated as 1.46 bits. </li>

  3. Sample entropy Or, we might consider the sample of genotypes as a series of n repeated trials. That would make the sample entropy in the example 1.46n bits. We might also calculate a sample entropy considering the gametes instead of the genotypes—but this would work out to be the same, as long as we don’t know which gametes came from which parents of heterozygotes.
  4. </ol>

    To understand this last point, imagine that the sample is a deck of shuffled cards. Each card may be red with probability p(a) or black with probability p(b). If we drew cards one at a time, we could record a sequence (red, red, black,…) of 2n cards. Each card drawn thereby has an exact position in the sequence. If p(a) = 0.4 and p(b) = 0.6 as above, then this entropy will be 1.94n bits. Now, imagine instead that we draw the cards two at a time, and record only these pairs (homozygote red, homozygote black, heterozygote,…). We will have a sequence half as long, and this sequence does not include the exact position of the red and black cards, only their position as part of a pair. The entropy of this sequence of n genotypes is only 1.46n bits. This gives some insight into the nature of the sample entropy as defined above: It is the entropy of a sequence of genotypes.

    By contrast, the allelic entropy is the uncertainty associated with drawing a single copy of the SNP at random from the sample. The genotypic entropy is the uncertainty associated with drawing a genotype (two SNP copies) from the sample. Again, these are empirical estimates derived from the sample; they represent the underlying population just to the extent the sample does.

    Which of these estimates will be useful to us? The sample entropy tells us how many bits of hard drive space we need to store the HapMap data under an efficient coding scheme. That’s interesting, but not necessarily what we have in mind. The other two are sample estimates of an underlying population characteristic—either allele or genotype frequencies. In what follows, we will be interested in quantifying how much our uncertainty about one individual’s SNP genotype is reduced by knowing that same individual’s genotype for some other SNP. It would seem that the appropriate measure for this problem will be the genotypic entropy.

    Mutual information between SNPs

    The joint entropy represented by two SNP loci may be less than the sum of their individual entropies. That is to say, there may be mutual information between the two loci. Mutual information can arise between SNPs for several reasons. Most obviously, if the sample of individuals actually consists of members of two distinct populations with different allele frequencies, then two distinct SNPs may have mutual information because of this hidden population structure. This source of mutual information is exploited by admixture mapping—a process that involves taking people with recent ancestry from two or more populations and finding genomic regions that are linked by virtue of the fact that little recombination has yet had time to reshuffle haplotypes that may be locally common but globally rare. Or two loci may share mutual information because they are physically linked.

    As an example of mutual information estimation, consider two sets of genotypes (the rows of the following table):

    A = 0 1 0 2 0 1 1 0 1 0 1 0 2 1 0 0 1 0 1 0
    B = 2 2 2 0 1 1 2 2 0 2 1 1 0 1 2 1 2 2 1 2

    Each SNP locus has three genotypes; twenty people are represented in the sample, each column represents the genotypes of a single individual. The empirical estimate of genotypic entropy from the first SNP locus (Ĥ(A)), based on 10 zeros, 8 ones and 2 twos, is 1.36 bits per individual. The same estimate for the second SNP locus (Ĥ(B)), based on 3 zeros, 7 ones, and 10 twos, is 1.44 bits per individual. The sum of the individual entropies for the two SNP loci is 2.8 bits. But the joint entropy of the two loci (Ĥ(A,B)), based on three (0,1), seven (0,2), one (1,0), four (1,1), three (1,2), and two (2,0) joint genotypes, is only 2.36 bits. Hence, the two SNP loci share an estimated 0.44 bits of mutual information (e.g., (A;B) = 0.44 bits).

    What does this mean? Consider the following contingency table, based on the genotypes above:

    </tr> </table> </div>

    None of the sampled individuals have the joint genotype (0,0) and none have (2,1) or (2,2). But even more important, a full seven have (0,2) even though only 2.5 would be expected if the two SNPs were independent.

    The contingency table is a clue (for the statistically-minded). The mutual information is telling us something like Pearson’s χ2 test of association. If we know the genotype for SNP A, we can do better than chance predicting the genotype for SNP B.

    Significance testing for mutual information

    The comparison with Pearson’s χ2 test raises another obvious question: How do we test whether a given amount of mutual information is statistically significant? In the example above, we have estimated 0.44 bits of mutual information between SNP A and SNP B. Is this a lot? How much mutual information should we expect between two random samples of data?

    Let’s take an even simpler contingency table — the relation between two coin flips. Each flip may have a 50-50 chance of producing “heads” or “tails”. On average, we expect to see one fourth (H, H), one fourth (T, T), one fourth (H, T) and one fourth (T, H) in our results. But if we perform this experiment (2 coin flips) a finite number of times, we will almost always see some slight divergence from these proportions. Ten sets of coin flips absolutely can’t give us one fourth of each result, because ten doesn’t divide by four evenly. On top of that problem, we might get six or seven (H, H) by chance, even if we only expect 2.5. Any of these cases will give us a positive non-zero estimate of mutual information, even if there is no causal connection between the coin flips.

    Another way of stating this observation is that the estimate of mutual information, (A;B) is biased. We should expect the estimate from a small sample to be larger than the true mutual information in the population at large.

    The analogy between mutual information and the χ2 test is more apt than it might appear. In fact, over many trials, the distribution of sample estimates of mutual information will approximate a χ2 distribution, multiplied by a factor 2nlog 2, where n is the number of cases in the sample. Our sample of genotypes of A and B above has 20 individuals and 3 possible genotypes, so 40(log 2)(A;B) should be distributed as a χ2 with 4 degrees of freedom.

    Reflecting on the three ways we might estimate the entropy of a locus, above, this is twice the mutual information calculated from the sample entropy, measured in nats instead of bits. Even though we are interested in estimating the population characteristic, the genotypic entropy, the significance of our estimate can only be evaluated by considering the entropy of the sample.

    And indeed, we can perform a randomization of the two sets of genotypes and show the correspondence. Here, I’ve done ten thousand permutations of the genotypes in A with respect to those in B:

    Simulation outcome

    The bars are the histogram representing the 10,000 permuted samples, the curve is the χ2 density function with 4 degrees of freedom. You can see that the permutations show significant clumpiness. With only 20 sampled individuals, some fractional combinations come up quite often while others are impossible. Also, the permutations are more biased than the χ2 would predict—there are too few small values for (A;B). This is another small sample effect. Generally, the χ2 approximation fails when there are fewer than 5 observations in a cell, and shouldn’t really be trusted with fewer than 10. Here, we have nine cells and only 20 total observations. But our value of 0.44 bits—equal to a sample mutual information of 12.2 nats on the scale of the figure—is significant according to the (uncorrected) χ2 approximation with p = 0.016, and according to the permutation test with p = 0.014. If we are very concerned about the deviation from the χ2 distribution, we might decide to use Fisher’s Exact Test on the underlying contingency table.

    With more observations, data do tend to converge to a χ2 distribution. For example, here I have run 10,000 permutations of a sample of 10,000 individuals, for two loci, each locus with 10 alleles at equal frequencies. The curve is a χ2 distribution with 81 degrees of freedom:

    Simulation outcome

    Very nice.

    This comparison suggests a couple of things. First, we can always do a permutation test of the hypothesis of independence. Take a sample of paired genotypes, shuffle up one of them, and see if the observed pairs share higher mutual information than a large fraction (say, 95%) of the permuted sets. (Naturally, if at this point you are already thinking of a genome-wide survey, you will need to consider ways to correct for multiple comparisons….)

    Second, we can get approximate results for mutual information using a χ2 test. All we do is multiply the mutual information estimate (A;B) by 2nlog 2 and compare it to the appropriate significance level of the χ2 distribution with the appropriate number of degrees of freedom. This approximation will be poor for small samples, including the HapMap samples. Again, if we were testing the hypothesis of independence in those cases, we would likely want to use Fisher’s Exact Test instead.

    But in what follows, we will generally not be testing the hypothesis that two loci are independent; we will be testing the hypothesis that they are linked under neutrality. In that context, these statistical tests of independence will be useful for quickly weeding out genetic regions where linkage is negligible. Then, we can employ different tests for regions where linkage appears to be more substantial, tests that make more effective use of the properties of mutual information.

    Next: Genetic drift reduces mutual information

    </td>0 1 2
    0 3 7
    1 1 4 3
    2 2