This is the second in a series on information theory and tests for recent selection. The first entry, "Information theory: a short introduction" reviewed the basic concepts of information measures and their background.
The International HapMap is a massive project to determine the genotypes for up to 3 million single nucleotide polymorphisms (SNPs) in samples of people from 11 population samples around the world. The current data release (Phase 3) includes genotypes for a subset of over 1.5 million SNPs in 1,115 people. The 11 population samples include people of African ancestry from the US Southwest, Utah residents of Northern and Western European ancestry, Han Chinese from Beijing, people of Chinese ancestry from Denver, people in the Houston Gujarati Indian community, Japanese people from Tokyo, Luhya and Maasai people from Kenya, people of Mexican ancestry from Los Angeles, Italians in Tuscany, and Yoruba from Ibadan, Nigeria.
As impressive as this effort is, we may wonder why exactly SNP genotyping of so many people is a valuable enterprise in itself. The project’s homepage includes this short statement:
The goal of the International HapMap Project is to compare the genetic sequences of different individuals to identify chromosomal regions where genetic variants are shared. By making this information freely available, the Project will help biomedical researchers find genes involved in disease and responses to therapeutic drugs.</div>
There are theoretical and practical objections to this simple explanation (as I discussed here last month). However, what no one involved with the project seems to have expected is the extent to which the data would demonstrate the importance of recent adaptive evolution in human populations.
Here, I am describing some of the ways that we can test hypotheses about natural selection by using the SNP genotypes from the HapMap. This is a theory-centric description, with some digression into practical aspects of handling the genotype data. First, I consider how we might use information theoretic concepts to test the hypothesis of independence between two genetic loci.
The data from the HapMap consist of an array of biallelic genotypes for each population. The size of this array is different for each population; we may consider it to have m rows, each row corresponding to a single SNP locus, and n columns, each corresponding to a person. The entries are genotypes: AA, AT, CG, and so on. Each SNP is biallelic, so it hardly matters what we call the two alleles—we can arbitrarily label them a and b. Thus, the three possible genotypes may be labeled aa, ab and bb.
Of course, from a sample of genotypes, we can readily estimate the frequencies of the two alleles in the population. The sample allele frequency (a) = (aa) + (ab), where the estimate (aa) is the number of aa individuals in the sample divided by the sample size n. It is worth expressing some statistical caution about estimates. Although the HapMap SNPs are biased toward common alleles, nevertheless some of them are rare indeed. And as we stretch down the chromosome to consider multilocus haplotypes, many may be vanishingly rare or even singular in the population, even though they happen to occur in the sample.
Now, how shall we estimate the entropy of a single locus? It seems there are several ways we might look at the question.
- Allelic entropy We might use the entire sample of genotypes at the locus to estimate the allele frequencies. In that case, the entropy of the SNP locus would be estimated by
For a SNP locus with empirical genotype frequencies p(aa) = 0.16, p(ab) = 0.48 and p(bb) = 0.36, this estimate of allelic entropy would be 0.97 bits. </li>
- Genotypic entropy Or, we might consider the genotype frequencies as the essential elements of the system. In this case, the entropy would be estimated by
For the same genotype frequencies listed above, this genotypic entropy would be estimated as 1.46 bits. </li>
- Sample entropy Or, we might consider the sample of genotypes as a series of n repeated trials. That would make the sample entropy in the example 1.46n bits. We might also calculate a sample entropy considering the gametes instead of the genotypes—but this would work out to be the same, as long as we don’t know which gametes came from which parents of heterozygotes. </ol>
To understand this last point, imagine that the sample is a deck of shuffled cards. Each card may be red with probability p(a) or black with probability p(b). If we drew cards one at a time, we could record a sequence (red, red, black,…) of 2n cards. Each card drawn thereby has an exact position in the sequence. If p(a) = 0.4 and p(b) = 0.6 as above, then this entropy will be 1.94n bits. Now, imagine instead that we draw the cards two at a time, and record only these pairs (homozygote red, homozygote black, heterozygote,…). We will have a sequence half as long, and this sequence does not include the exact position of the red and black cards, only their position as part of a pair. The entropy of this sequence of n genotypes is only 1.46n bits. This gives some insight into the nature of the sample entropy as defined above: It is the entropy of a sequence of genotypes.
By contrast, the allelic entropy is the uncertainty associated with drawing a single copy of the SNP at random from the sample. The genotypic entropy is the uncertainty associated with drawing a genotype (two SNP copies) from the sample. Again, these are empirical estimates derived from the sample; they represent the underlying population just to the extent the sample does.
Which of these estimates will be useful to us? The sample entropy tells us how many bits of hard drive space we need to store the HapMap data under an efficient coding scheme. That’s interesting, but not necessarily what we have in mind. The other two are sample estimates of an underlying population characteristic—either allele or genotype frequencies. In what follows, we will be interested in quantifying how much our uncertainty about one individual’s SNP genotype is reduced by knowing that same individual’s genotype for some other SNP. It would seem that the appropriate measure for this problem will be the genotypic entropy.
The joint entropy represented by two SNP loci may be less than the sum of their individual entropies. That is to say, there may be mutual information between the two loci. Mutual information can arise between SNPs for several reasons. Most obviously, if the sample of individuals actually consists of members of two distinct populations with different allele frequencies, then two distinct SNPs may have mutual information because of this hidden population structure. This source of mutual information is exploited by admixture mapping—a process that involves taking people with recent ancestry from two or more populations and finding genomic regions that are linked by virtue of the fact that little recombination has yet had time to reshuffle haplotypes that may be locally common but globally rare. Or two loci may share mutual information because they are physically linked.
As an example of mutual information estimation, consider two sets of genotypes (the rows of the following table):
|A = 0 1 0 2 0 1 1 0 1 0 1 0 2 1 0 0 1 0 1 0|
|B = 2 2 2 0 1 1 2 2 0 2 1 1 0 1 2 1 2 2 1 2|
Each SNP locus has three genotypes; twenty people are represented in the sample, each column represents the genotypes of a single individual. The empirical estimate of genotypic entropy from the first SNP locus (Ĥ(A)), based on 10 zeros, 8 ones and 2 twos, is 1.36 bits per individual. The same estimate for the second SNP locus (Ĥ(B)), based on 3 zeros, 7 ones, and 10 twos, is 1.44 bits per individual. The sum of the individual entropies for the two SNP loci is 2.8 bits. But the joint entropy of the two loci (Ĥ(A,B)), based on three (0,1), seven (0,2), one (1,0), four (1,1), three (1,2), and two (2,0) joint genotypes, is only 2.36 bits. Hence, the two SNP loci share an estimated 0.44 bits of mutual information (e.g., (A;B) = 0.44 bits).
What does this mean? Consider the following contingency table, based on the genotypes above: