High school genomics

Ronald Bailey writes in the January Reason about his experiences with personal genomics (“Ill Show You My Genome. Will You Show Me Yours?”). He’s a booster, and much of the article is a review of basic objections (privacy concerns, weakness of gene-phenotype associations, imprecision) and some replies to them. He has several passages worth quoting, including this one:

Some time before the end of this decade, kids are going to be running gene scans and maybe even whole genome sequencing experiments in their ninth-grade biology classes, just the way some of us did blood typing experiments back in the mid-20th century. Then they are going to share that information with their friends on whatever social media follow Facebook and Twitter, and theyll do it without parental consent. Nerdy high school sweethearts might swap DNA profiles and run them through computer programs designed to predict what their potential children might look like. In the process, of course, they will also be sharing information about their parents genes.

We’re just starting a new decade, I had to remind myself. Gene chips probably will be cheap enough then to run in high school labs. Is there anyone who’s thinking about the need to teach high school kids about factor analysis? Bayesian inference? Because I find a twisted appeal in the idea that postdocs now are doing what high school science projects will be about in ten years.

I’ve thought for a long time that most of the basic analysis of genomes is undergraduate-level work. Most of the effort is learning how to use software, which is not mathematically demanding but does take time.

Writing the software is a different issue. But as we apply the same techniques to more and more organisms, there will be no new software to write for most analyses. Plug in your data, assuming that you’ve been sensible enough to define an appropriate sampling strategy, and the software will give you an answer.

Consider a time when genotyping can be done for $2 a chip in bulk. Each year, a new chip design is distributed to high schools across a state. One year, it may be dandelions. The kids sample yards across the state, collect plant phenotype data, and submit data to a common pool. Dispersal patterns, flowering time, other phenotypes are all possible targets of study. A structured population enables them to stratify their sample, exploit linkage due to historical events, and study traits linked to biological invasion.

For the price of one R01 grant, kids across a whole state might develop a new model organism, learn the principles of genomics and produce the data equivalent of dozens of research papers.

(via Razib)

UPDATE (2010-12-15): A reader writes quizzically:

I can't figure out what you are saying here. That it's all so simple that high school kids will understand it without any training in statistics? That all possible analyses of genomic data have already been devised, and all that's left is to turn the crank? Maybe I'm just dense, but I think you need to describe the twisted appeal you're experiencing, not just report it. What good will the data be from gillions of dandelion gene chips, if the kids don't have the time to measure umpteen different dandelion phenotypes to correlate with the gene data? Whose judgment will decide which traits to consider, and will high school teachers have that judgment? Etc. Are you saying the software already exists to correlate (or fail to do so) the mountains of new human gene chip data with all of the subjects' medical and life history data? Or are you saying that this is exactly the problem? I'm honestly not sure if you are in frank trans-humanist pro-technocracy mode, or if you are ironically alluding to its liabilities.

Never assume a blog post has a well-formed point.

I think the potential study I describe is one with enormously more power than anything being done today on plant dispersal, and with power at least equal to the best work on gene-phenotype associations in model organisms (setting aside developmental biology).

Kids in school aren’t statisticians, but thousands of them do have brute force on their side. I don’t see any obvious reason why software can’t be written to spit out these answers. Naturally that software will have to make lots of assumptions, which means that somebody is going to have to design a sensible sampling scheme that can be carried out by students, allowing for their lack of training. It’s an educational challenge, but I’d say it’s’ doable.

This means, of course, in 10 years the statistics that support this kind of study won’t be interesting to the kinds of people who write such software. The science progresses. I hope that in 10 years the real scientists will be doing something a little better than what the software will be able to spit out.

At the same time, I think we have to acknowledge that most of what today’s genomics postdocs are doing is exactly the kind of analysis that I’m describing for high school kids in 2020, except with much smaller, poorly-designed samples. What makes this Ph.D.-level work is that our current software is not very good at it – in large part because the current software is mostly written by postdocs with little training in systems design.