Ancestry testing: what, me worry?

8 minute read

This post at dissects my opinion about the DNAPrint AncestrybyDNA tests. The post does explicate many aspects of the test correctly, but it completely misses the points I was hoping to make. This may be my mistake, so I'll try to respond clearly.

1. While Hawks does mention the issue of statistical significance, he doesnt follow through on it to completion. If a low level of affiliation (eg, "Native American" in Greeks) is below the level of statistical significance and the "confidence intervals" overlap zero, then that result is statistically equivalent to zero. Given that tests using continuously distributed alleles will have a low error rate, and given that levels of statistical significance are given on the companys website, these low level results should not be taken too seriously.

This is certainly true: insignificant results are meaningless. Except that the company says that these insignificant results can be important. But even this is not my main concern. My problem is that people are paying for these results. Remember the low, low price of $219? Is it fair to say, as does, "that some people misunderstand and/or misuse the test does not logically invalidate the worth of the test itself"? Who is misusing the test? For one, sociology professors trying to make their students shed their "traditional" notions of race. Who is misunderstanding it? Presumably, most people who read the company's informational materials, particularly those who have paid for the test.

To me, that's bad. Doesn't make the test invalid. Just bad.

A serious misunderstanding and one that DNAP itself may have some responsibility for because their explanations of the tests are rather poor is the idea that DNAPrint's categories somehow represent the specific populations they are named after in a direct fashion, and, in so doing, the categories represent direct descent from these "pure populations" -- which Hawks sees as a serious flaw in the assumptions built into the tests. </p>

The problem is that the categories obviously do not directly represent any modern populations since, for example, real-life South Asians and Middle Easterners are NOT 100% South Asian or Middle Eastern when tested with DNAP. Furthermore, DNAP admits, for example, that they really are not sure what their Euro 1.0 categories exactly represent; one cannot say that DNAPrints categories represent ancient pure populations, either. The straw man of racial purity is irrelevant here. There needs to be no pure populations as an assumption for any of these tests.

Important - What the categories represent are those sets of gene frequencies that characterize the predominant distinctive ancestral genetic component of particular population groups. (emphasis in original)

The test assays alleles. Period. The result of the test is a genotype for each of the surveyed loci.

So where do populations come into this? The company has allele frequencies taken in many different populations. Let's consider what they could do with these comparative samples:

  1. They could tell people which of the surveyed populations they are most like. The problem with this: a person may have ancestors in different populations, and wants to know which.
  2. They could tell a person if he has one or more alleles that are very rare in his population, but common in some other population. These might constitute strong evidence of some ancestry from that population. This is not a proportion; it is a threshold. The problem: a person may already know he has ancestors in different populations, and really wants to know what proportion have come from these populations. African-Americans are a good example, since it may often the proportion of European admixture they want to know, rather than the mere fact of it.
  3. They could tell a person all the possible combinations of populations that would be likely to produce his genotypes. The problem: there might be an indefinitely large number of combinations. The data are not sufficient to test combinations of many populations; the allele frequencies do not differ enough for most of the loci among most populations.

So DNAPrint, like most genetic testing companies, does none of these things. Instead, they markedly reduce the necessary degrees of freedom by reducing the number of "populations" they compare. If you do this, you have to choose which populations to use. What better populations to sort your samples than Linnaean races?: Caucasian (European + West Asian), East Asian, African, Native American.

Does this mean that DNAPrint actually assumes that these races existed as separate populations in the past? No. In fact, they must know these "pure races" didn't, because they are the ones who lumped the data to create the groups.

So what is my problem? It is this: why did the company choose the Linnaean races as their comparative groups? Clearly it is because those group names already mean something, and that they want to present their data in that interpretive framework.

Now suppose that someone receives a test result, telling him he is 58 percent Caucasian, 25 percent East Asian, 10 percent African, and 7 percent Native American? How would you expect him to interpret that result?

Compare to this hypothetical result, based on alleles only without any reference to Linnaean taxonomy. The person is told he has 89 alleles that are common worldwide, 35 that are common in Europe but rare elsewhere, 4 that are very common in East Africa and moderately common in the Near East, 10 that are very common in China and Thailand, moderately common in India and Pakistan, and present but less common in the Near East, and 2 alleles that are very high frequency in Native Americans, but also present in Siberia, Caucasus, the Near East, and Greece.

Neither of these explanations is sufficient to reconstruct the person's ancestry. More genealogical information would be necessary for that. But which summary gives more information? Which leads to an interpretation of ancestry from four different continents? Which leads to an interpretation of ancestry in one or two geographic locations?

My point is that the method of presentation determines the interpretation. In this case, the geneticists should know better than to present their comparative data as if pure races had existed and people were mixtures of them. It naturally leads to a false interpretation, that would be avoided by presenting the information differently.

I dont quite understand Hawks problem with the South Asian data. South Asia is at the geographical/racial crossroads between the Caucasian and Mongoloid worlds. Furthermore, a variety of genetic studies (which I believe Ive already mentioned on this blog) demonstrate the unique, mixed nature of South Asian populations and, in particular, East Asian influences (HLA studies, as well as the Ray et al Alu work). Looking at populations of northern and eastern India, Nepal, Bangladesh, Burma etc, I fail to see why it is so impossible that South Asians are a Caucasian/Mongoloid mix, with the former of course predominating, while certain SE Asian populations may contain a certain Caucasian influx from South Asians. I fail to see as well why the people of Hawaii would not test out as predominantly in the East Asian/Pacific Islander category, why East Africans are not a genetic mix of African/Caucasian, and I thought as well that the Asian influences in Madagascar are well known, along with the obvious African components.

These results are exactly what I would expect from the test. We don't disagree on how the test would likely classify people from different regions.

But in what sense, exactly, does it mean anything to say that East Africans are a mix of Africans and Caucasians? Or that South Asians are a mix of Caucasians and Mongoloids? Indeed the question presupposes a hypothesis of racial history that genes do not support. Is South Asia "at the geographical/racial crossroads"? Why "of course" should Caucasian genes have predominated if this were true? Do the allele frequencies provide any evidence of this ancient mixture of populations? What, exactly, were these ancient "Caucasians" and "Mongoloids" if they were not the "pure races" the test supposedly is not assuming existed?

This may sound like a word game. But these words have a specific meaning and history. It seems to me much clearer and more accurate to say that some alleles have geographic distributions that include Greeks and Native Americans than to say that a Greek has a "Native American affiliation". The former has the advantage of being true, while the latter is quite obviously false -- unless you are a geneticist classifying people by race.

But, thats the point, the data are NOT saying that there must be direct descent from those particular groups. Instead what the data simply show is that the predominant distinctive genetic (ancestral) components in Middle Easterners and South Asians are also found as minor components in the gene-pool of the Irish. Why is not known, but it does not logically require direct descent from one group into another - simply a sharing of allele frequencies.

Then why are we talking about "distinctive genetic components" at all? Why don't we just talk about allele frequencies? That would have the added advantage of not foreclosing evolutionary explanations for them.

I'll say again, we don't disagree about the biology here. What we disagree about is whether the comparisons provided by the test carry relevant true or false information, and whether they lead paying customers to interpretations that are biologically false. I think the comparisons would mislead any non-knowledgeable person to believe that he actually had genetic input from different races, and that these races are meaningful categories representing ancient human groups. And to the extent that a person goes to the trouble to understand detailed deviations from the racial categores, such as the high "Native American affiliation" of Greeks, he is learning biological nonsense.