john hawks weblog

paleoanthropology, genetics and evolution

bioinformatics

  • "Brittle techniques"

    Mon, 2013-01-28 00:03 -- John Hawks

    I was pointed to a rant from early last year written by Fred Ross: "A farewell to bioinformatics".

    Like any good rant, it is extreme and I don't endorse it, but like all good rants it has kernels of truth.

    This all seems an inauspicious beginning for a field. Anything so worthless should quickly shrivel up and die, right? Well, intentionally or not, bioinformatics found a way to survive: obfuscation. By making the tools unusable, by inventing file format after file format, by seeking out the most brittle techniques and the slowest languages, by not publishing their algorithms and making their results impossible to replicate, the field managed to reduce its productivity by at least 90%, probably closer to 99%. Thus the thread of failures can be stretched out from years to decades, hidden by the cloak of incompetence.

    Data structures in bioinformatics should be designed for robusticity and ease of re-use by different research teams. But that won't happen unless grant money to support data collection requires it. Open access to data is wonderful, but it is only the first step toward open science.

  • Without the code, it's hand-waving

    Mon, 2012-09-03 23:03 -- John Hawks

    A new post by C. Titus Brown is worth reading: "Anecdotal science"

    I'm starting to notice that a lot of bioinformatics is anecdotal.

    People publish software that "works for them." But it's not clear what "works" means -- all to often either the exact parameters or the specific evaluation procedure is not provided (and yes, there's a double standard here where experimental methods are considered more important than computational methods).

    This means that their result is not an example of computational science. It's an anecdote.

    He gives an example and discusses the real cost, which is that a published advance really doesn't advance anything, because everyone else has to spend so much time trying to get the code to work for their projects.

    Time after time I'm reminded of my conversation with the big data astronomer, who reflected that his friends who are biologists complain that students are all being trained in computer programming instead of biology. Compared to astronomy, he said, biologists don't have a data problem at all.

    Clearly, bioinformatics isn't taking seriously the need to really engineer software, with documentation and standard programming interfaces.

  • Sequencing is outpacing computing

    Wed, 2011-11-30 23:36 -- John Hawks

    The New York Times notices DNA sequencing's Malthusian trap: "DNA sequencing caught in deluge of data."

    That is a decline [in sequencing costs] by a factor of more than 800 over four years. By contrast, computing costs would have dropped by perhaps a factor of four in that time span.

    The lower cost, along with increasing speed, has led to a huge increase in how much sequencing data is being produced. World capacity is now 13 quadrillion DNA bases a year, an amount that would fill a stack of DVDs two miles high, according to Michael Schatz, assistant professor of quantitative biology at the Cold Spring Harbor Laboratory on Long Island.

    I have spoken with several scientists in other fields, like astronomy and particle physics, who deal with truly big datasets. Until now, biology data has actually been pretty small potatoes compared with the sheer amount pumped out by large projects in other fields. But that's changing. The Times article points out a unique aspect of the data problem in genetics: There are now thousands of labs that can generate large datasets, many of whom have no special plan for data archiving or availability.

    “Google has enough capacity to do all of genomics in a day,” said Dr. Schatz of Cold Spring Harbor, who is trying to apply Google’s techniques to genomics data. Prodded by Senator Charles E. Schumer, Democrat of New York, Google is exploring cooperation with Cold Spring Harbor.

    Google’s venture capital arm recently invested in DNAnexus, a bioinformatics company. DNAnexus and Google plan to host their own copy of the federal sequence archive that had once looked as if it might be closed.

    I don't see Google as a deus ex machina for this one -- although I do observe that several other big data projects are sponsored by large Microsoft investors or founders.

  • The problems of computer-aided biologists, 1

    Wed, 2010-03-17 18:53 -- John Hawks

    On the subject of modeling in genetics, John Timmer of Ars Technica has been running an excellent series on the challenges of computer models in biology. I'll devote a few words to some of these articles in the next several days.

    An article from earlier this winter, "Keeping computers from ending science's reproducibility," discusses the problems with replicability. Data from genomes and genotyping platforms go through frequent revisions, so that the same methods may lead to different results depending on the version of the dataset. Not replicable, in other words, and it may be very hard to track down exactly why slight differences in results persist. It's also hard to verify that the methods are working the same way when the same results aren't found -- it's not like the problem of significant digits in measurement, in other words.

    That problem is compounded when it comes to analytical methods:

    An analysis pipeline may involve dozens of specialized software tools chained together in series, each with a number of parameters that need to be documented for their output to be reproduced. Like the data, some of these tools are proprietary, and many of them undergo frequent revisions that add new features, change algorithms, and so on. Some of them may be developed in-house, where commenting and version control often take a back seat to simply getting software that works. Finally, even the best commercial software has bugs.

    "Getting it to work" is too often the major goal in human genetics, where in-house development of population history models is the norm. Rigorous validation of these models is beyond any single lab's purview; to be published, it is enough to cite prior art.

    The end of the article includes some reporting on possible solutions, including this:

    Even if we solve the legal and computational portions of the problem, however, we're going to run into issues with the fact that many of the people who use computational tools understand what they do, but don't feel compelled to learn the math behind them. That's where a paper in the latest edition of Science comes in. Its author, Jill Mesirov of the Broad Institute, describes how many biologists aren't well versed in computational analysis, but are increasingly reliant on tools created by those who are; she then goes on to describe one type of solution, called GenePattern, that she and her colleagues put together with the help of Microsoft Research.

    The idea is to "embed" the actual bioinformatic research methods into the paper, as one would embed a spreadsheet into a Word document. That way, anyone who reads the paper could just run an active version of the methods, to verify the results were accurate, and (potentially) play with the parameters.

    Not a bad idea for the toy example, but for simulations that take days or more to run, it isn't going to be practical. What we need is people to learn the math, not people to dumbly click buttons in a paper.

    The specific idea of an interactive workflow is implemented fairly well in the Galaxy bioinformatics platform. There are definite strengths to that approach -- most importantly, for simple operations it can be incredibly useful to have a running record of what you've done, so that you can get it again yourself. But an equivalent record can fairly easily be accomplished using Python, Perl or any other scripting language. A risk of an online system is that it runs into the versioning problem very quickly -- interactive downloads may bring inconsistent datasets that use different genome draft assemblies, for example.

    In any event, much pain can be circumvented with a little math, in many cases. We should make it a priority to get students a common-sense understanding of how genetic parameters relate to each other.

    UPDATE (2010-03-18): Another section of the article is worth discussion. Along the lines of my post from earlier this year regarding the importance of code sharing and transparency ("The bugs will out"), Timmer wrote:

    "You need the code to see what was done," [Victoria Stodden] told Ars. "The myriad computational steps taken to achieve the results are essentially unguessable—parameter settings, function invocation sequences—so the standard for revealing it needs to be raised to that of when the science was, say, lab-based experiment." This sort of openness is also in keeping with the scientific standards for sharing of more traditional materials and results. "It adheres to the scientific norm of transparency but also to the core practice of building on each other's work in scientific research," she said. But the same worries that apply to more traditional data sharing—researchers may have a competitor use that data to publish first—also apply here. In the slides from her talk, she notes that a survey she conducted of computational scientists indicates that many are concerned about attribution and the potential loss of publications in addition to legal issues. (The biggest worry is the effort involved to clean up and document existing code.)

    A lot of the code we use is really rather simple. The coalescent can be implemented in a few lines, and most common alterations of it can be handled with 10-line subroutines. A forward-time simulation can be done in a single line of Python, and again the common alterations don't take too much to implement.

    There are rather radically more complicated models in use, and we should direct more attention to making these human-readable, separating modular elements apart so that they can be run with different simulation engines, and making clear distinctions between functional code, parameters, and data. I've been doing this long enough to know how simple it can be to hard-wire your parameters into the code, undocumented, so that nobody can figure out what is going on but the author. That's not where you want to be.

  • Microsoft tries to patent the comparative method

    Sun, 2009-08-09 11:31 -- John Hawks

    Elizabeth Pennisi writes in Science about a case where computer scientists and their lawyers are bumbling through biology:

    Patent 20090030925 was filed by Microsoft researcher Stuart Ozer, an expert in databases, in July 2007. Ozer says he wanted to apply database technologies to complex problems in biological sciences: "I saw an opportunity to create a new approach in analyzing sequence data when phylogenetic information was available," he says.

    The patent application describes a way to use biological data that has been organized according to evolutionary relatedness. It includes methods for counting evolutionary events and grouping positions within molecules. However, "this patent is written in such broad language that it appears to swallow up any activity that involves understanding biodiversity through phylogenetics," says William Piel, a phylogeneticist at Yale University. He points out that such analyses date back to Charles Darwin, who sketched the first evolutionary tree; today, more than 350 phylogeny software packages are available on the Web. "Microsoft might as well patent the multiplication tables," Piel says.

    The only novelty in this case is that systematists are likely to take it personally. Biomedical researchers already work in a patent-rich environment, museum researchers in taxonomy do not -- at least, not yet. But with genomics and bioinformatics, the field is ripe for colonization by enterprising software developers whose companies' lawyers will be looking to protect their time. And let's not forget that universities have been stomping into the patent game.

    Seems to me that somebody needs to fund a few systematists to outline prior art in the area, so that the field is protected from overly broad patents that might stifle the development of new research methods.

    Maybe there's a bright side: There's no way that Microsoft could develop anything worse than the ICZN.

    References:

    Pennisi E. 2009. Systematics researchers want to fend off patents. Science 325:664. doi:10.1126/science.325_664

Subscribe to bioinformatics

Neandertals

For years, I've worked on their bones. Now I'm working on their genes. Read more about the science studying these ancient people.

Denisova

From a finger bone of an ancient human came the record of a completely unexpected population. My lab is working on the science of the Denisova genome.

Acceleration

The advent of agriculture caused natural selection to speed up greatly in humans. We're uncovering some of the ways that populations have rapidly changed during the last 10,000 years.

Malapa

Just outside Johannesburg, the Malapa site is producing some of the most exciting finds in human evolution. This site is the headquarters of the Malapa Soft Tissue Project.