john hawks weblog

paleoanthropology, genetics and evolution

snps

  • How widespread is Denisovan ancestry today?

    Tue, 2011-11-01 00:32 -- John Hawks

    Last month, David Reich and colleagues [1] reported on estimates of Denisovan ancestry for island and mainland Asian populations. Their most memorable conclusion was that they could find no substantial sign of Denisovan ancestry anywhere on the Asian mainland, or indeed on any island that had ever been connected by land to Asia.

    The distribution was stark, as illustrated by the map from the paper:

    I wrote about the paper when it was released ("Denisovan DNA in the islands, and an Australian genome"), noting:

    Notice the apparent lack of Denisovan ancestry in anyone who lives anywhere that was once connected by land with mainland Asia. I say "apparent" deliberately: Abi-Rached and colleagues reported last month on the widespread distribution of Denisovan HLA types among today's Asian populations, and those may well be products of Denisovan genes that were later selected. I've already identified a handful of other loci that seem to reflect Denisovan ancestry in mainland Asian people. According to the comparisons by Reich and colleagues, such loci must be exceptions.

    Abi-Rached and colleagues [2] had argued that HLA alleles found in the Denisovan genome are presently common in some parts of Asia, and likely reflect local adaptive introgression. Substantial introgression of a small number of genes would not be enough to create a strong genome-wide appearance of Denisovan ancestry. Still, it was a little odd that the first genes anybody looked closely at would provide strong evidence of introgression.

    Now, Pontus Skoglund and Mattias Jakobsson [3] say that Denisovan ancestry is widespread across China and Southeast Asia.

    That conclusion contradicts Reich and colleagues, so why do the studies come to such different results?

    Skoglund and Jakobsson suggest that they have succeeded in finding introgression where others failed because their model accounts for ascertainment bias in the available datasets. SNP data come from genotyping chips, which have been designed using known polymorphisms. Five years ago, we knew much more about polymorphisms in Europe than other parts of the world, and so the HGDP, and HapMap to a lesser extent, do a good job of sampling rare alleles in Europe but miss many rare alleles in Africa and other populations. This is the ascertainment bias.

    Some of the most obvious signs of introgression today are cases where rare alleles are shared with an archaic genome. If ascertainment bias causes you to miss the rare alleles, you'll miss the introgression.

    But that explanation isn't really sufficient to explain the differences between these papers. For one thing, Reich and colleagues [1] also worked hard to account for ascertainment biases in their SNP samples. For another, whole genome comparisons between East Asian samples and the Denisova genome have not yielded evidence of Denisovan ancestry, even though whole genomes have no ascertainment bias. The number of whole genomes so far compared is very small, and so the statistical ability to detect introgression is lower, but Skoglund and Jakobsson actually replicate that null result in their current paper.

    Probably most important, it's not clear that Skoglund and Jakobsson's result can actually be explained by rare alleles. Here is Figure 1e from their paper:

    Figure 1e from Skoglund and Jakobsson (2011). Original caption: Interpolated spatial distribution of the frequency of Denisova alleles at SNPs where Denisova is different from chimpanzee and Neandertal. Sample localities are indicated with rectangles.

    This map represents a clever comparison. It is a heat map of the mean local frequency of the subset of alleles that are present in Denisova but absent from chimpanzees and Neandertals. These are presumptively derived alleles relative to the chimpanzee. The SNPs here are all known to vary in human populations, because they are all included in the HGDP sample. So the map does not represent all the Denisova derived mutations in humans today, only a particular subset that is especially likely to be informative.

    Given that the sites have been picked in a special way, we need to examine carefully how strong the pattern really is. Notice the scale of the heat map. The difference between the orange area in south China, from the green area in north China, is around 0.001, or a tenth of a percent in mean frequency. The actual values are reported in the online supplement, in Table S3. An exception of Yizu in south China who have around 0.006 more than their neighbors. The Yizu sample includes only 10 individuals (9 males, 1 female). The paper does not report the number of SNPs included in this comparison, but it must be a very small set relative to the total, because only a small fraction of human SNPs are known to be derived in Denisova and ancestral in Neandertals.

    With this very small difference in frequencies, I would not rule out the hypothesis that the zone of high Denisova derived frequencies in south China is caused entirely by frequency enrichment of a small number of loci. A handful of genes like the HLA loci observed by Abi-Rached and colleagues might be enough to create this very slight elevation in the average. Hence, the best case is that the data here simply provide greater sensitivity to small amounts of introgression. The worst case is that the pattern may be dominated by the Yizu sample, which is really too small to carry this kind of load.

    The strongest evidence presented in the paper is a comparison of north and south East Asian regions directly. Although the comparison of south China against other regions of the world (Africa, Europe) does not yield significant evidence of Denisovan similarity in this paper, south China differs from north China in essentially the same way that the Oceanian people do from other regions. And the Oceanian populations (here, Papua New Guinea and Bougainville) differ from other regions because of their Denisovan ancestry. So Skoglund and Jakobsson infer that the north/south comparison reflects Denisovan ancestry as well.

    I think this comparison is sound, and the question is, how much introgression would this pattern require? The paper answers that question in this way:

    Quantitative estimation of the precise fraction of Denisova-related ancestry in Southeast Asian populations based on genotype data are unfortunately sensitive to ascertainment bias and genetic drift, and such estimates will require genome sequence data that are currently unavailable. However, both the PCA results (Fig. 1B) and the approximately six times lower absolute values of the D statistic in tests between Northeast Asians and Southeast Asians compared with tests between Northeast Asians and Oceanians (Table S4) indicate a relatively low fraction of Denisova-related ancestry. Thus, the fraction is likely to be smaller than both the ~5% fraction of Denisova-related ancestry present in Oceanians and the ~2.5% fraction of Neandertal ancestry present in non-Africans (23, 24), perhaps around 1%.

    One percent is an amount that whole genome comparisons at present do not rule out, and I think it's a reasonable guess. I would not have thought we could rule out a one percent contribution from other, non-Denisovan archaic people, for example.

    We aren't very far from a more definitive answer of this question, as the data continue to accumulate every day. What I find interesting is the way that models can generate these 1% differences in ancestry proportions, depending on sampling and the pattern of migration assumed to have happened in the past. Two estimates that differ by less than a percent are not really different. This paper provides the suggestion of a more widespread Denisovan legacy, and I accept that as a possibility.

    I should mention: less than one percent of a half billion people is still a very large number, added to five percent of the indigenous population of New Guinea and Australia, and smaller fractions of other island populations. The total amount of Denisovan legacy present in living people probably exceeds the population of Earth at the time the Denisovans lived.


    References

    Synopsis: 
    A new paper contradicts earlier work, by suggesting a widespead Denisovan legacy in south China
  • Information theory: a short introduction

    Fri, 2008-09-26 21:37 -- John Hawks

    I lectured this week in my Biology of Mind course about information theory, and in particular the concept of Shannon entropy. I’ve typed up a few notes for my students, and I’m cross-posting them on my own blog because they are relevant to another topic I’ll be writing about: discovery and testing of natural selection in the human genome. You see, the kind of data that are presently being collected as part of the International HapMap , single nucleotide polymorphisms (SNPs), are naturally treated by information theoretic measures. So first, it may help to define the essential concepts of information theory.

    Many readers will have heard of the concept of entropy in connection to the thermodynamics of physical systems. Indeed, one common statement of the Second Law of Thermodynamics is that a closed system must increase in entropy over time. Entropy is a statistical characteristic of a system, related to the probabilities that the particles of a system will be found in given states at any given time. In thermodynamic terms, a system’s entropy is related to our ability to extract work from the system. The Second Law implies that work cannot endlessly be extracted from a closed system without the addition of energy from outside. By 1927, Leó Szilárd (probably my favorite physicist) had shown that the entropy of a physical system can be naturally defined in terms of information. In other words, one way of looking at entropy is in terms of uncertainty about the state of any given particle in a system, and one might apply energy to a system in order to reduce this uncertainty (for example, by concentrating particles in one part of the system.

    Claude Shannon developed the concept of entropy as applied to communication systems. By doing so, he established the field of information theory. Shannon developed several fundamental theorems, including a derivation of the relation between channel capacity (bandwidth) and noise, studies of optimal encoding strategies, and a means of treating continuous as well as finite communications. His most basic definition is that of information entropy, which has also come to be called the Shannon entropy. This definition places entropy as a measure of our uncertainty about the state of a system—in particular, as applied to information, a system of signs.

    Shannon published “A mathematical theory of communication” in 1947, describing his theoretical work. Along with this article, the (UW-Madison) mathematician Warren Weaver wrote a popular treatment of Shannon’s work, titled “Recent contributions to the mathematical theory of communication.” I mention these articles because it is very hard to improve upon them; they are clear in their exposition and notation. The two can be found together in the 1948 book, The Mathematical Theory of Communication, which has been reprinted several times. I can’t recommend this book highly enough.

    Keeping that in mind, I won’t be reiterating the essentials of information theory here; I just want to give a basic understanding that can be applied to other problems—particularly with respect to SNP datasets.

    Entropy and outcomes

    Suppose we have an electron in a box. Its spin may be “left” or “right”, with equal probability. What is our uncertainty about the electron’s spin? One way of looking at the question: We are just as uncertain about the electron as we would be about the flip of a fair coin. The uncertainty in both cases has the same quantity, even though the systems are in other respects totally different from each other. Hence, it is desirable that our definition of uncertainty not depend on the actual physical characteristics of a system, but only upon the probabilities of signs within the system.

    If we were not uncertain at all, the probability of one outcome would be 1 and the other would be zero. For instance, if we had a two-headed coin, we would be absolutely certain of flipping heads. Naturally, a definition of uncertainty should assign zero to the case in which we already know the outcome. But for a fair coin, we have a probability of 0.5 for one outcome, and a probability of 0.5 for the other. We are uncertain, and our measure of uncertainty should have a positive value in this case, whatever unit we may choose.

    Now suppose we have a nucleotide of DNA. It may be adenine, guanine, cytosine or thymine, each with probability 0.25 (1/4). How many coin flips would give us the same amount of uncertainty? The answer is two: Two flips have four possible outcomes (0,0; 0,1; 1,0; 1,1) with equal probability (0.25) for each. Again we have an equivalence between two systems in the amount of uncertainty about the outcome. However, in this case we can see that it takes two trials of one system to attain the same uncertainty as one trial of the other system. It would seem that we should be twice as uncertain about the nucleotide as we are about a single coin flip. Indeed, three nucleotides of DNA (a codon) will give us 64 possible outcomes—the same as six coin flips.

    The number of possible outcomes of a set of trials grows as the exponent of the number of trials. So for example, 5 coin flips yield 25 = 32 possible outcomes; 10 coin flips yield 210 = 1024 possible outcomes. If the coin is fair, then every outcome is equally probable; meaning that the probability of any sequence of 10 coin results might be observed with probability 1/1024. A consideration of this system over a slightly larger scale will give some idea of the power of encoding. With 10 coin flip results, we might choose a room in a 1024-bed hospital at random. With 100 coin flips, we may describe a system of 1.2 × 1030 elements—enough to randomly choose a point on the Earth’s surface to within a millionth of an inch.

    If we are uncertain where we have hidden our microdot, a hyper-GPS could communicate its location anywhere in the world to us with a string of 100 heads or tails. The information that will remove our uncertainty is related to the logarithm of the number of outcomes. This leads to a mathematical definition of uncertainty in terms of logarithms. In particular, for a system X with possible outcomes x1,x2,,xn, the information entropy (H(X)) is:

              ∑n H (X ) = -   p(xi)logp(xi)           i=1
    (1)

    The logarithm is conventionally taken as a base-2 logarithm, so that the measure of entropy is the binary digit, or bit. A single coin flip has two possible outcomes each with probability 0.5. The equation gives us:

    H (coin flip) = - [0.5 log0.5+ 0.5log 0.5] = 1 bit
    (2)

    This equation allows us also to handle systems in which the probabilities of different outcomes are not all equal. For example, suppose there are two outcomes, with probability 0.9 and 0.1. What is the uncertainty?

    H (X ) = - [0.9 log0.9+ 0.1log 0.1] = 0.47 bits
    (3)

    Here we are less than half as uncertain as in the case of a fair coin—and indeed, that is the point. If we had a coin that consistently gave only 10% tails, we would on average be considerably more certain about the outcome. On average, we can communicate two flips of our unfair coin with only one bit. Exactly how this can be done is by using the right kind of encoding. The point for us is that a system with much less uncertainty than a coin flip can be specified with less information than a coin flip.

    Mutual information

    Now, suppose we have two distinct events. If these events are independent, then the joint entropy represented by both is simply the sum of their individual entropies:

    H (X,Y ) = H (X)+ H (Y)
    (4)

    Why is this? Consider two coin flips. In the combined system including both flips, we have four possible outcomes (0,0; 0,1; 1,0; 1,1). If our two flips are independent, then the probability of each combined outcome is the product of the probabilities of the individual outcomes. That is, p(0,1) = p(0)p(1), and log p(0)p(1) = log p(0) + log p(1). Then, equation 4 can be derived from 1 by algebraic manipulation.

    But, if the two events are not independent—that is, if the outcome of one depends on the outcome of the other, then their joint entropy must be less than the sum of their individual entropies.

    H (X,Y ) ≤ H (X)+ H (Y)
    (5)

    And the difference between the joint entropy and the sum of the individual entropies is a measure of the correlation between the two events. We call this difference the mutual information of the two events, and we define it mathematically:

    I(X;Y ) = H (X)+ H (Y)- H (X,Y )
    (6)

    Consider a game of Blackjack at a casino. Most people play probabilities as if every card were equally likely to be dealt in every hand. Under this scenario, the house has a consistent edge—this is, after all, how casinos make money. But in fact every card is not equally likely to be dealt. In particular, there is a serial correlation among the cards dealt in a Blackjack game. If the King of Hearts is dealt, it is rather less likely to occur again very quickly. In other words, there is mutual information between the dealing of a card and the outcomes of later hands: Once the King of Hearts is observed, its probability of being dealt in later hands declines. A clever player with a good memory may make use of this mutual information to guide his bets: putting down more money when the house is less likely to draw face cards, for instance. Players who can “count cards” in this way would be the doom of the casinos if they allowed it to continue. To prevent it, they reduce the extent of serial correlations by dealing cards from boots of four or more decks, and the casinos eject players suspected of counting.

    Redundancy

    Moreover, a system comprised of two distinct events may include redundancy. We may consider a very prominent system in which two independent events, each with two outcomes, give rise to a combined system with only three, not four outcomes. Consider a Mendelian gene A with two alleles a1 and a2. When two heterozygotes (a1a2) mate, their offspring will have one of three genotypes: a1a1, a1a2, or a2a2. If we want to communicate the genotypes of their children, we will need an average of 1.5 bits for each.

    This seems counter-intuitive, considering that each gamete from the parents is a1 or a2 with equal probability. That is, p(a1) = p(a2) = 0.5. The entropy represented by each gamete is a coin flip’s worth—one bit. It might seem that combining two gametes, each derived independently from a different parent, we should have 2 bits of entropy, not only 1.5.

    Yet it is a simple matter to show that we can transmit information about the children’s genotypes using only 1.5 bits. Let’s encode a heterozygote as a “0”, an a1 homozygote as “10” and an a2 homozygote as “11”. In this encoding, a single bit communicates whether the child is a homozygote or heterozygote, and if homozygote is followed by a second bit communicating which of the two alleles. Now, if our couple of heterozygotes have eight children, four of whom are heterozygotes and two of each homozygote, we can transmit their family’s genotypes as “000010101111”. Eight children. Twelve bits. That’s 1.5 bits per genotype. In some unlikely cases (all homozygotes) we will use more bits, in others (all heterozygotes) we will use less.

    In this case, two different outcomes from the point of view of inheritance are in fact not distinguished in the genotype of each child. “Heterozygotes” include two distinct classes: those who inherit a1 from the mother (and a2 from the father) and those who inherit a2 from the mother (and a1 from the father). The system that gives rise to the genotypes is relevant to the probability that each genotype will occur (as Mendel discovered), but it is not very relevant to the way we describe those genotypes. All we know about heterozygotes is that they have two different alleles. When we want to predict phenotypes, we may not care which allele came from which parent. When we collect genotypes (on an Affy or Illumina chip, for example), we are in a poor position to determine which allele came from which parent. Thus, even though two bits of gametes went into the physical system creating the genotypes, only 1.5 bits is sufficient to communicate them.

    What is the import of this redundancy? Clearly, if we are interested in children’s genotypes, then a system that tracks their parents’ gametic contributions is a poor way of encoding the information. That system includes redundant information, with respect to genotypes. But to put the problem another way, knowing the children’s genotypes leaves us with uncertainty about their parents’ gametic contributions. If we want to devise an accurate paternity test, we will sometimes need to know more than the child’s genotype.

    Next: Information theory and mutual information between genetic loci

Subscribe to snps

Neandertals

For years, I've worked on their bones. Now I'm working on their genes. Read more about the science studying these ancient people.

Denisova

From a finger bone of an ancient human came the record of a completely unexpected population. My lab is working on the science of the Denisova genome.

Acceleration

The advent of agriculture caused natural selection to speed up greatly in humans. We're uncovering some of the ways that populations have rapidly changed during the last 10,000 years.

Malapa

Just outside Johannesburg, the Malapa site is producing some of the most exciting finds in human evolution. This site is the headquarters of the Malapa Soft Tissue Project.