Skip to content

A short introduction to information theory

10 min read

I lectured this week in my Biology of Mind course about information theory, and in particular the concept of Shannon entropy. I’ve typed up a few notes for my students, and I’m cross-posting them on my own blog because they are relevant to another topic I’ll be writing about: discovery and testing of natural selection in the human genome. You see, the kind of data that are presently being collected as part of the International HapMap , single nucleotide polymorphisms (SNPs), are naturally treated by information theoretic measures. So first, it may help to define the essential concepts of information theory.

Many readers will have heard of the concept of entropy in connection to the thermodynamics of physical systems. Indeed, one common statement of the Second Law of Thermodynamics is that a closed system must increase in entropy over time. Entropy is a statistical characteristic of a system, related to the probabilities that the particles of a system will be found in given states at any given time. In thermodynamic terms, a system’s entropy is related to our ability to extract work from the system. The Second Law implies that work cannot endlessly be extracted from a closed system without the addition of energy from outside. By 1927, Leó Szilárd (probably my favorite physicist) had shown that the entropy of a physical system can be naturally defined in terms of information. In other words, one way of looking at entropy is in terms of uncertainty about the state of any given particle in a system, and one might apply energy to a system in order to reduce this uncertainty (for example, by concentrating particles in one part of the system.

Claude Shannon developed the concept of entropy as applied to communication systems. By doing so, he established the field of information theory. Shannon developed several fundamental theorems, including a derivation of the relation between channel capacity (bandwidth) and noise, studies of optimal encoding strategies, and a means of treating continuous as well as finite communications. His most basic definition is that of information entropy, which has also come to be called the Shannon entropy. This definition places entropy as a measure of our uncertainty about the state of a system—in particular, as applied to information, a system of signs.

Shannon published “A mathematical theory of communication” in 1947, describing his theoretical work. Along with this article, the (UW-Madison) mathematician Warren Weaver wrote a popular treatment of Shannon’s work, titled “Recent contributions to the mathematical theory of communication.” I mention these articles because it is very hard to improve upon them; they are clear in their exposition and notation. The two can be found together in the 1948 book, The Mathematical Theory of Communication, which has been reprinted several times. I can’t recommend this book highly enough.

Keeping that in mind, I won’t be reiterating the essentials of information theory here; I just want to give a basic understanding that can be applied to other problems—particularly with respect to SNP datasets.

Entropy and probability of signs

Suppose we have an electron in a box. Electrons have a property known as spin, which is denoted as left or right with equal probability. We haven't looked in the box, so we are uncertain about the spin of the electron. How can we measure the uncertainty?  

One way of looking at the question: We are just as uncertain about the electron as we would be about the flip of a fair coin. If we want to define a measure of our uncertainty, the electron and coin should yield the same value for the measurement—even though the two physical systems are in other respects totally different from each other. That is, our measure of uncertainty should not be related to the physical characteristics of the system, but only upon the probabilities of signs within the system.

If we were certain about the outcome—if we already knew what was going to happen—then the probability of one outcome would be 1 and the other would be zero. For instance, if we had a two-headed coin, we would be absolutely certain of flipping heads. Naturally, a definition of uncertainty should assign zero to the case in which we already know the outcome. But for a fair coin (with one side heads and the other tails), we have a probability of 0.5 for one outcome, and a probability of 0.5 for the other. Our measure of uncertainty should have a positive value in this case, whatever unit we may choose.

Now suppose we have a nucleotide of DNA. It may be adenine, guanine, cytosine or thymine, each with probability 0.25 (1/4). How many coin flips would give us the same amount of uncertainty? The answer is two: Two flips have four possible outcomes (0,0; 0,1; 1,0; 1,1) with equal probability (0.25) for each. Again we have an equivalence between two systems in the amount of uncertainty about the outcome. However, in this case we can see that it takes two trials of one system to attain the same uncertainty as one trial of the other system. It would seem that we should be twice as uncertain about the nucleotide as we are about a single coin flip. Indeed, three nucleotides of DNA (a codon) will give us 64 possible outcomes—the same as six coin flips.

The number of possible outcomes of a set of trials grows as the exponent of the number of trials. So for example, 5 coin flips yield 25 = 32 possible outcomes; 10 coin flips yield 210 = 1024 possible outcomes. If the coin is fair, then every outcome is equally probable; meaning that the probability of any sequence of 10 coin results might be observed with probability 1/1024. A consideration of this system over a slightly larger scale will give some idea of the power of encoding. With 10 coin flip results, we might choose a room in a 1024-bed hospital at random. With 100 coin flips, we may describe a system of 1.2 × 1030 elements—enough to randomly choose a point on the Earth’s surface to within a millionth of an inch.

If we are uncertain where we have hidden our microdot, a hyper-GPS could communicate its location anywhere in the world to us with a string of 100 heads or tails. The information that will remove our uncertainty is related to the logarithm of the number of outcomes. This leads to a mathematical definition of uncertainty in terms of logarithms. In particular, for a system X with possible outcomes x1,x2,…,xn, the information entropy (H(X)) is:

<table  class="equation"><tr><td>    <center class="math-display" > <img  src="/graphics/information-theory-intro-20080x.png" alt="          &#x2211;n H (X ) = -   p(xi)logp(xi)           i=1 " class="math-display"  /><a   id="x1-2001r1"></a></center></td><td class="equation-label">(1)</td></tr></table> <p class="nopar" > </p>

The logarithm is conventionally taken as a base-2 logarithm, so that the measure of entropy is the binary digit, or bit. A single coin flip has two possible outcomes each with probability 0.5. The equation gives us:

<table  class="equation"><tr><td>    <center class="math-display" > <img  src="/graphics/information-theory-intro-20081x.png" alt="H (coin flip) = - [0.5 log0.5+ 0.5log 0.5] = 1 bit " class="math-display"  /><a   id="x1-2002r2"></a></center></td><td class="equation-label">(2)</td></tr></table> <p class="nopar" > </p>

This equation allows us also to handle systems in which the probabilities of different outcomes are not all equal. For example, suppose there are two outcomes, with probability 0.9 and 0.1. What is the uncertainty?

<table  class="equation"><tr><td>    <center class="math-display" > <img  src="/graphics/information-theory-intro-20082x.png" alt="H (X ) = - [0.9 log0.9+ 0.1log 0.1] = 0.47 bits " class="math-display"  /><a   id="x1-2003r3"></a></center></td><td class="equation-label">(3)</td></tr></table> <p class="nopar" > </p>

Here we are less than half as uncertain as in the case of a fair coin—and indeed, that is the point. If we had a coin that consistently gave only 10% tails, we would on average be considerably more certain about the outcome. On average, we can communicate two flips of our unfair coin with only one bit. Exactly how this can be done is by using the right kind of encoding. The point for us is that a system with much less uncertainty than a coin flip can be specified with less information than a coin flip.

<h3 class="sectionHead"><a   id="x1-3000"></a>Mutual information</h3> <p class="noindent" >Now, suppose we have two distinct events. If these events are independent, then the <span  class="cmbx-10">joint entropy </span>represented by both is simply the sum of their individual entropies: </p>

<table  class="equation"><tr><td>    <center class="math-display" > <img  src="/graphics/information-theory-intro-20083x.png" alt="H (X,Y ) = H (X)+ H (Y) " class="math-display"  /><a   id="x1-3001r4"></a></center></td><td class="equation-label">(4)</td></tr></table> <p class="nopar" >                                                                                                                                      </p>

Why is this? Consider two coin flips. In the combined system including both flips, we have four possible outcomes (0,0; 0,1; 1,0; 1,1). If our two flips are independent, then the probability of each combined outcome is the product of the probabilities of the individual outcomes. That is, p(0,1) = p(0)p(1), and log p(0)p(1) = log p(0) + log p(1). Then, equation 4 can be derived from 1 by algebraic manipulation.

But, if the two events are not independent—that is, if the outcome of one depends on the outcome of the other, then their joint entropy must be less than the sum of their individual entropies.

<table  class="equation"><tr><td>    <center class="math-display" > <img  src="/graphics/information-theory-intro-20084x.png" alt="H (X,Y ) &#x2264; H (X)+ H (Y) " class="math-display"  /><a   id="x1-3002r5"></a></center></td><td class="equation-label">(5)</td></tr></table> <p class="nopar" > </p>

And the difference between the joint entropy and the sum of the individual entropies is a measure of the correlation between the two events. We call this difference the mutual information of the two events, and we define it mathematically:

<table  class="equation"><tr><td>    <center class="math-display" > <img  src="/graphics/information-theory-intro-20085x.png" alt="I(X;Y ) = H (X)+ H (Y)- H (X,Y ) " class="math-display"  /><a   id="x1-3003r6"></a></center></td><td class="equation-label">(6)</td></tr></table> <p class="nopar" >                                                                                                                                      </p>

Consider a game of Blackjack at a casino. Most people play probabilities as if every card were equally likely to be dealt in every hand. Under this scenario, the house has a consistent edge—this is, after all, how casinos make money. But in fact every card is not equally likely to be dealt. In particular, there is a serial correlation among the cards dealt in a Blackjack game. If the King of Hearts is dealt, it is rather less likely to occur again very quickly. In other words, there is mutual information between the dealing of a card and the outcomes of later hands: Once the King of Hearts is observed, its probability of being dealt in later hands declines. A clever player with a good memory may make use of this mutual information to guide his bets: putting down more money when the house is less likely to draw face cards, for instance. Players who can “count cards” in this way would be the doom of the casinos if they allowed it to continue. To prevent it, they reduce the extent of serial correlations by dealing cards from boots of four or more decks, and the casinos eject players suspected of counting.

Redundancy

Moreover, a system composed of two distinct events may include redundancy. We may consider a very prominent system in which two independent events, each with two outcomes, give rise to a combined system with only three, not four outcomes. Consider a Mendelian gene A with two alleles 1 and 2. When two heterozygotes (1,2) mate, their offspring will have one of three genotypes: (1,1), (1,2), or (2,2). If we want to communicate the genotypes of their children, we will need an average of 1.5 bits for each.

This seems counter-intuitive, considering that each gamete from the parents is a1 or a2 with equal probability. That is, p(a1) = p(a2) = 0.5. The entropy represented by each gamete is a coin flip’s worth—one bit. It might seem that combining two gametes, each derived independently from a different parent, we should have 2 bits of entropy, not only 1.5.

Yet it is a simple matter to show that we can transmit information about the children’s genotypes using only 1.5 bits. Let’s encode a heterozygote as a “0”, an a1 homozygote as “1,0” and an a2 homozygote as “1,1”. In this encoding, a single bit communicates whether the child is a homozygote or heterozygote, and if homozygote is followed by a second bit communicating which of the two alleles. Now, if our couple of heterozygotes have eight children, four of whom are heterozygotes and two of each homozygote, we can transmit their family’s genotypes as “000010101111”. Eight children. Twelve bits. That’s 1.5 bits per genotype. In some unlikely cases (all homozygotes) we will use more bits, in others (all heterozygotes) we will use less.

In this case, two different outcomes from the point of view of inheritance are in fact not distinguished in the genotype of each child. “Heterozygotes” include two distinct classes: those who inherit a1 from the mother (and a2 from the father) and those who inherit a2 from the mother (and a1 from the father). The system that gives rise to the genotypes is relevant to the probability that each genotype will occur (as Mendel discovered), but it is not very relevant to the way we describethose genotypes. All we know about heterozygotes is that they have two different alleles. When we want to predict phenotypes, we may not care which allele came from which parent. When we collect genotypes (on an Affy or Illumina chip, for example), we are in a poor position to determine which allele came from which parent. Thus, even though two bits of gametes went into the physical system creating the genotypes, only 1.5 bits is sufficient to communicate them.

What is the import of this redundancy? Clearly, if we are interested in children’s genotypes, then a system that tracks their parents’ gametic contributions is a poor way of encoding the information. That system includes redundant information, with respect to genotypes. But to put the problem another way, knowing the children’s genotypes leaves us with uncertainty about their parents’ gametic contributions. If we want to devise an accurate paternity test, we will sometimes need to know more than the child’s genotype.

Next: Information theory and mutual information between genetic loci

genetics and networksinformation theory
John Hawks

John Hawks Twitter

I'm a paleoanthropologist exploring the world of ancient humans and our fossil relatives.