john hawks weblog

paleoanthropology, genetics and evolution

software

  • Without the code, it's hand-waving

    Mon, 2012-09-03 23:03 -- John Hawks

    A new post by C. Titus Brown is worth reading: "Anecdotal science"

    I'm starting to notice that a lot of bioinformatics is anecdotal.

    People publish software that "works for them." But it's not clear what "works" means -- all to often either the exact parameters or the specific evaluation procedure is not provided (and yes, there's a double standard here where experimental methods are considered more important than computational methods).

    This means that their result is not an example of computational science. It's an anecdote.

    He gives an example and discusses the real cost, which is that a published advance really doesn't advance anything, because everyone else has to spend so much time trying to get the code to work for their projects.

    Time after time I'm reminded of my conversation with the big data astronomer, who reflected that his friends who are biologists complain that students are all being trained in computer programming instead of biology. Compared to astronomy, he said, biologists don't have a data problem at all.

    Clearly, bioinformatics isn't taking seriously the need to really engineer software, with documentation and standard programming interfaces.

  • Mailbag: Graphing software

    Thu, 2012-03-08 01:01 -- John Hawks

    Re: graphics

    I've enjoyed reading your blog for awhile now as I like the anthropological take on genomic data. A post back in February ( http://johnhawks.net/weblog/reviews/neandertals/neandertal_dna/1000-geno... ) was accompanied by some of the more attractive bar plots I've seen (nice alpha, great fonts) -- can you divulge what software you used?

    Thanks for the kind words!

    These and most of my graphs are done with Mathematica. The fonts are in the PT Sans family, which are free from Google Fonts. The color scheme is stock. I composite almost all my graphs in Illustrator and in particular add nearly all the data labels that way, even though I could do them programmatically, I find it easier to just label by hand.

    This post on heritability has some xy plots also from Mathematica:

    http://johnhawks.net/explainer/stats/heritability-and-stature

  • Tweets will find a way

    Tue, 2010-09-21 16:24 -- John Hawks

    A Twitter virus emerged within the 140 character limit:

    The exploit was fairly simple, but remarkably effective. Somebody found a bug in the Twitter.com website that allowed them to insert simple bits of JavaScript – a programming language that lets people add interactivity to web pages – into messages or Tweets sent on the service. The code was able to detect when the user's mouse passed over the tweet, and trigger a retweet. By hijacking user input in this way, the Twitter hack code was able to replicate itself. And so a new artificial life form of tenuous sorts was born.

    Given the transmission by retweeting, it would be interesting to see how the networks of followers facilitated or impeded its spread. There are many hub individuals with hundreds or thousands of followers. But the most widely followed may not themselves do much reading of tweets, and so may not have been very susceptible to spreading the virus. Smaller networks of frequent readers and retweeters would be better vectors. Still, eventually this bug got to the big twits:

    On Twitter, the spread of the worm to a highly connected person or people may have been enough to tip infection rates over that threshold and allow it to break out into the wider world. It may not be a coincidence that around the time the second peak was building Sarah Brown was infected, retweeting the bug to her 1.1m followers like a virtual Typhoid Mary.

  • Online toolkits -- the good and the frustrating

    Sun, 2010-02-21 14:51 -- John Hawks

    In pursuit of my DIY genomics posts, I've been playing around with the Galaxy bioinformatics web tools. The team responsible for the South African genomes published the data to Galaxy, and their uploads are easy to get -- either to download, or to work with the online Galaxy platform.

    Working with a resource like this helps to illustrate both how tremendously useful bioinformatics tools can be, but also how frustrating it can be to figure them out. Some things are a breeze, although others are completely obscure. Documentation for the uploads is skimpy so far -- one thing that drove me up the wall is that SNPs are listed by genome, but without indicating genotypes -- is the individual a homozygote or a heterozygote? The paper by Schuster and colleagues describes their genotype calling procedure, but the results turn out not to be posted along with their other data. I'm sure they'll become available as the data are updated, but I did waste some time figuring out how the releases correspond to descriptions in the online supplementary material from the paper.

    Despite occasional frustrations, we seem to be heading in the direction of all-in-one online bioinformatics toolkits. Galaxy, for example, lists several advantages on a promo page. A couple of entries:

    Now your results are reproducible! | When publishing results, replace “the data were analyzed using a collection of in-house scripts” with a URL pointing to Galaxy’s history. Your reviewers will have no further questions. That’s reproducible genomics!

    ...

    No tools for new datatypes | Some datatypes generated by high throughput genomics are so new that there are no tools to analyze them. For example, how do you extract sequences of coding exons from the latest 28-way alignments of vertebrate genomes or analyze quality scores from 454/Solexa/SOLiD? With Galaxy.

    I live at the mathematical end of this stuff. I work with models of populations and assume that sequences are known, you know, as if we looked at them and read off the ACGT's. But in reality, a lot of complexity lies between models and the biochemistry. Going from sequencing reads to genomes, and aligned genomes, involves a lot of analysis. Many of the details differ entirely between different sequencing platforms. As we continue to move toward whole-genome analyses of populations and other species, it's really important to have an abstraction that allows for different underlying sequencing models, while allowing replication of the population genetics modeling.

    The disadvantage of a single widely used tool is that it can limit creativity and lock people into a certain way of processing data. Locked-in assumptions sometimes lead to wrong conclusions -- as we've seen in human genetics many times over the years. But the advantage is that it allows everybody access to the same methods and data, so that results can be replicated and augmented with new observations.

  • R profiled in NY Times

    Thu, 2009-01-08 07:45 -- John Hawks

    If you do much statistics and haven't worked with R, you should try it out. The NY Times profiled the software yesterday:

    R is similar to other programming languages, like C, Java and Perl, in that it helps people perform a wide variety of computing tasks by giving them access to various commands. For statisticians, however, R is particularly useful because it contains a number of built-in mechanisms for organizing data, running calculations on the information and creating graphical representations of data sets.

    ...

    What makes R so useful — and helps explain its quick acceptance — is that statisticians, engineers and scientists can improve the software’s code or write variations for specific tasks. Packages written for R add advanced algorithms, colored and textured graphs and mining techniques to dig deeper into databases.

    The graphs are pretty, and it's free software. The article describes it as a "lingua franca" for grad students. Maybe not, but I wouldn't invest my time learning anything less powerful.

  • Product recommendation: PDF to Keynote

    Tue, 2008-09-16 21:29 -- John Hawks

    I've never endorsed a product before, but I have to tell you that for the past couple of weeks I have loved, loved, loved PDF to Keynote. It does just what it says -- it takes a PDF presentation and makes a Keynote presentation out of it. It's free software.

    For whatever reason, Keynote doesn't import multi-page PDFs. I've been making Beamer presentations for one of my classes, and create PDF output. So now I can translate these into Keynote presentations to use the presenting tools there, with the Beamer-generated slides and outlines. It's a simple tool, and it does one thing well. Pretty cool.

    I should mention, although I like Keynote for presentations, it's never available at conferences. I always show my conference presentations as PDFs. Powerpoint always messes up something -- graphics don't import, colors change, fonts are the wrong size so text falls of the edge of the slide. This happens whether you're going from Keynote to Powerpoint or from one version of Powerpoint to another. Those computers always have Acrobat installed, and you can show a PDF of a presentation full screen. Nothing will change; it will look just like it did where you created it.

    Tags: 
Subscribe to software

Neandertals

For years, I've worked on their bones. Now I'm working on their genes. Read more about the science studying these ancient people.

Denisova

From a finger bone of an ancient human came the record of a completely unexpected population. My lab is working on the science of the Denisova genome.

Acceleration

The advent of agriculture caused natural selection to speed up greatly in humans. We're uncovering some of the ways that populations have rapidly changed during the last 10,000 years.

Malapa

Just outside Johannesburg, the Malapa site is producing some of the most exciting finds in human evolution. This site is the headquarters of the Malapa Soft Tissue Project.