What do we know about early hominin relationships?

Ten years ago I published a paper on the failure of cladistics to resolve questions of early hominin relationships. My study used computer simulation to produce a very large number of small “fossil” samples drawn from populations that evolved entirely under random genetic drift, with every anatomical character accurately measured and independent of every other character. This scenario was unrealistically good in many respects compared to the real fossil record, where characters are not independent, can be distorted by postdepositional processes, and often evolve in parallel under natural selection. What I found is that many of the small samples in the hominin fossil record are not good enough to test hypotheses about their phylogeny.

The tests in this paper show that parsimony recovers a correct phylogeny in nearly 100% of cases where either sample sizes or the number of independent characters are large. But for the foreseeable future, most hominid taxa will be known from only very small samples, and there can only be a very limited number of independent characters observable on fossil skeletal remains. This paper shows that simple parsimony in such cases will often fail to obtain correct results, and the lack of statistical tests for sample adequacy in phylogenetics has meant that until now, paleoanthropologists have not commonly known to what extent their phylogenetic models are falsely influenced by the factors examined here. Many paleoanthropologists may understand that small sample size, correlations among characters, heterogeneity of samples, and other issues pose barriers to phylogenetic research, but nevertheless may feel that cladistics analyses of fossil hominids provide successively better approximations of the truth. However, the results of this paper show that the output of parsimony analyses does not follow the innate statistical instincts that most researchers may have developed in other analytical contexts; indeed, they can be paradoxical, as discussed below.

Large samples work. Small samples mislead. Worse, including small samples into a study with large samples can lead to incorrect arrangements of the large samples.

This positivist outlook is reflected by a common, but fallacious, perception: that phylogenetic research has been converging on the “correct” answers, with the “problem” preventing stable evolutionary trees to be drawn being the continual appearance of new specimens and species. **But if the samples available to test hominid phylogenetic hypotheses were statistically sufficient, then analyses would be very unlikely to change when new specimens or species were added.** Recent discoveries of early hominids confirm the substantial possibility of change in the current most parsimonious phylogenetic hypotheses. For example, the possible addition of _Kenyanthropus_ (Leakey et al., 2001) as a sister taxon to _H. rudolfensis_ would either remove _H. rudolfensis_ from the _Homo_ clade or it would remove the _Homo_ clade as a sister to _Australopithecus_. In any event, the topology of basal nodes in the phylogeny (including relationships that are in complete consensus among pre-1999 cladistics studies) could be completely rearranged. That this might occur on the basis of the few apparently derived similarities between two specimens, KNM-WT 40000 and KNM-ER 1470, is strong proof of the statistical weakness of the data. It also implies that even the interrelationships of relatively large samples such as those assigned to _A. afarensis_ and _H. habilis_ may be contingent on the most parsimonious arrangement of other quite small samples. We can expect that other new hominid taxa, including _Orrorin_, _Ardipithecus_, _Australopithecus garhi_, and possibly _Sahelanthropus_, will therefore further disrupt our previous understanding. With the addition of each new taxon, the number of possible hominid phylogenies grows exponentially greater, and with this number grows the number of ways that phylogenies may be in error.

Since 2004, many paleoanthropologists have done better acknowledging the weaknesses of parsimony analysis. Most substantial discoveries (for example, Ardipithecus ramidus in 2009 and Australopithecus sediba in 2010) have been published with cladograms placing them among known hominin samples. But results have been very cautiously discussed in these cases, emphasizing the drawbacks of other, less-complete specimens in earlier studies of hominin phylogeny. In these cases, specimens that preserve both cranial and postcranial remains have shown how biased the study of purely cranial characteristics can be.

What these examples do not present – at least not yet – is more than one or two specimens for most characters. So they underrepresent the variability within species, preventing us from telling with characters are fixed, and which vary. Small sample size remains a severe constraint on our ability to test hypotheses of relationships. What we know about early hominins depends disproportionately on the Hadar, Sterkfontein and Swartkrans samples – and the attendant assumption that each of these samples mostly represents a single species assemblage. In each case, the variability represented is very extensive, showing us the limits of understanding mixed-species assemblages like that represented in the Turkana basin between 2 million and 1.5 million years ago.

Our knowledge of large hominin samples is very good, and we can be fairly confident about their relationships. But even in those cases, there is ambiguity. For example, is A. africanus closer to Homo than A. afarensis? That depends on how we constitute the samples and which characters we include. The latter question seems deceptively simple – include everything! But the more we include, the more we must rely on singular specimens.

At an extreme, we turn to features like the upper-to-lower limb proportion. This would seem to have strong adaptive relevance, and the lower limb is clearly relatively longer in humans and Homo erectus compared to earlier hominins. Many scholars have argued that the upper-to-lower limb length ratio in AL 288-1 (Lucy) is more humanlike than in several later australopithecine skeletal specimens (including OH 62, often attributed to Homo habilis). But until recently this was the only skeleton with both upper and lower limb elements sufficiently preserved to estimate length. To compare other “species”, researchers were forced to compare the dimensions of joint surfaces, or to estimate bone length based on regressions from joint dimensions or small portions of bone shafts. The discussion of OH 62 has been particularly protracted, with some scholars arguing for humanlike proportions and some for more apelike proportions, on the same bone fragments. In other words, the question comes down to “character analysis” – the detailed consideration of how the character develops, how it varies within samples, and how it should be scored on fossil specimens. As long as we are counting characters independently in our cladistic study, without considering sample sizes for those characters or the confidence in the character analysis for those characters, our comparisons will be limited to the accuracy of the smallest samples.

Often people have argued that previously-unknown results are credible if a study replicates other results on which prior work largely agrees. That is, credibility can be judged as a function of consistency with earlier work. I considered this issue in my 2004 paper:

It was no surprise, for example, that _A. robustus_ and _A. boisei_ were grouped as sisters in most cladistic analyses, or that _A. afarensis_ was an outgroup to later hominids, or that _H. habilis_ and _H. rudolfensis_ were often grouped with later _Homo_. The original descriptions of the fossils pointed out the derived resemblances in each of these cases, and there has been relatively little disagreement on any of these points since the fossils were unearthed. Although the inclusion of these well-documented sister-group statements may be a minimum standard of credibility for a cladogram, they convey no necessary confidence in the results of the method for new, unknown, or disputed relationships. Different cladistic analyses of fossils do not sample different possible worlds in the same way as the simulations presented in this paper; they apportion a single set of observations in different ways. Because the observations are the same, the results must agree—absent differences in character analysis or parsimony assumptions—and we should expect unanimity of analyses even if they are statistically inadequate.

Small samples may lead to wrong results, and they are likely to lead to the same wrong results no matter how many times we look at them. The only way to do better is to increase the sizes of samples.

None of this means we shouldn’t use parsimony approaches. But we should pay much more attention to the results from analysis of larger samples. And we should be very critical of the composition of those samples. Particularly bad are the surface lag deposits representing landscapes that may have had multiple species on them. In these cases, each specimen may have hundreds of thousands of years of uncertainty in its provenience, and may be attributed to a “species” based on nothing more than the local abundance of dental remains across a half-million year span.

References:

Hawks J. 2004. How much can cladistics tell us about early hominid relationships? American Journal of Physical Anthropology, 125(3), 207-219. doi:10.1002/ajpa.10280