Data mining

2 minute read

IBM and Google want students to ditch their laptops and pick up some big iron:

For the most part, university students have used rather modest computing systems to support their studies. They are learning to collect and manipulate information on personal computers or what are known as clusters, where computer servers are cabled together to form a larger computer. But even these machines fail to churn through enough data to really challenge and train a young mind meant to ponder the mega-scale problems of tomorrow. If they imprint on these small systems, that becomes their frame of reference and what theyre always thinking about, said Jim Spohrer, a director at I.B.M.s Almaden Research Center.

I love that analogy – like they’re cute little baby ducks learning that their computers are mama.

Meanwhile, this is all about teaching students how to deal with data-mining software. They believe that the future of science is in being able to use these immense datasets, from sources like genomics and high-throughput astronomy.

It sounds like science fiction, but soon enough, youll hand a machine a strand of hair, and a DNA sequence will come out the other side, said Jimmy Lin, an associate professor at the University of Maryland, during a technology conference held here last week. The big question is whether the person on the other side of that machine will have the wherewithal to do something interesting with an almost limitless supply of genetic information.

There’s some truth to this. On the other hand, I don’t see how this explosion of data is going to create a raft of new jobs for scientists. Sure, IBM and Google want to recruit the best, in their position who wouldn’t? Maybe we’ll need fewer clinicians and techs to prep samples for data analysis, and that will shift some jobs to data analysis. But what they’re talking about here are software development jobs to support science, not the science itself.

Yes, geneticists will need to deal with larger datasets, but that means that more instances of small data features will empower them to test certain hypotheses that would have been untestable before. The scientist’s job is to think of those hypotheses, work out the logic by which data may refute them, and root the inquiry in existing theory.

There’s a practical aspect to this, where working with large datasets helps to train students to think about data and theory. But the tools we’re using now to access datasets will be different in four years, and ten years down the line – the times when today’s beginning students will be entering graduate school, or finishing Ph.D.’s Those little ducklings are going to need to swim on their own.