Scaling data, some hints in dimension reduction methods

PLoS Computational Biology has a very helpful article by Lan Huong Nguyen and Susan Holmes meant to help people with statistical visualizations: “Ten quick tips for effective dimensionality reduction”.

Commonly, people examining large datasets with many dimensions will present their results with figures that show only two dimensions. In genetics, most of them will use principal components analysis (PCA) to reduce thousands of dimensions into two. In morphology, PCA is also very common, although some specialists may use Procrustes fitting or other methods. This paper by Nguyen and Holmes runs through several common misconceptions and errors in choosing methods to reduce dimensions and displaying the results of such procedures.

One of the biggest: A PCA plot should be scaled according to the variances of the dimensions, not an arbitrary scale. Otherwise, data that are really normally distributed may look anything but.

Figure 2 from Nguyen and Holmes 2019, showing the effects of different aspect ratios upon visualizations of PCA results
Figure 2 from Nguyen and Holmes, 2019. These charts all show the same data, which were generated by selecting two sets of normally distributed (Gaussian) random variables with two centers. The two clusters are red and blue in the final frame, which has an aspect ratio based on the variance in the data. The others show incorrectly scaled data, which are easily misinterpreted. I would add, morphological datasets are based on much smaller samples, and more easily give rise to false interpretations.
It's a frequent irritation to me that for data visualizations we are so often at the mercy of people who write up papers but do not share original data. So in presentations or for secondary work you're left relying upon someone else's PCA plot. These are almost always composed with bad choices of colors, unreadable fonts, and weird scales that make no sense. Don't get me wrong, there are some beautiful data visualizations out there. But the average paper in morphology or genetics is full of stinkers. And it would be so easy to just provide the original data so that those of us who re-use data in other contexts can make your results look better. Share!