john hawks weblog

paleoanthropology, genetics and evolution

data access

  • Goodall record digitization

    Mon, 2011-03-28 22:05 -- John Hawks

    Jason Goldman covers the acquisition of Gombe chimpanzee records from the Jane Goodall Institute by Duke University ("Digitizing Jane Goodall's legacy at Duke").

    Now, researchers at Duke University are taking more than twenty file-cabinets full with fifty years of check-sheets, longhand narratives in both English and Swahili, hand-drawn maps, videos, and photos, and carefully digitizing everything. This will allow researchers to construct searchable life-histories of the chimpanzees of Gombe, for the first time. The word "archives" is a bit misleading, though. The new Jane Goodall Institute Research Center at Duke is continuing to receive new data from Gombe, which will all become digitized and included in the collection as well.

    The move toward digitizing and making primate field records available has been a major challenge for primatology. Different research teams have legacies of partially incompatible records, which complicates the process of comparing data from different sites and different species. My UW-Madison colleague Karen Strier together with many of the leading figures in primate field research have been involved for several years in an effort to bring life history records from different primate species together. One of the first tangible results of the collaboration is a paper that appeared earlier this month in Science by Anne Bronikowski and colleagues [1].

    Seems to me that this kind of archiving is absolutely essential to our ability to study primate behavior in the future. Not least, data archives will be necessary to document the effect of range contractions and habitat fragmentation on primate behavior. Openness is difficult to negotiate in these contexts, because of the long-term effort put into data collection. But in thirty years, these archives will not be useful unless they are extended and put into accord with formats that are widely used. Goldman describes the idiosyncrasies of Goodall's data, and many other field projects have similar traditions that differ from each other. Without building a larger community capable of understanding these records, the data may be as useful as WordStar files from 1981.


    References

  • Genomes unzipped, unzipped

    Mon, 2010-10-11 14:49 -- John Hawks

    Genomes Unzipped, has finally unzipped:

    From today, we’ll be making all of our raw genetic data and the reports generated from these tests freely available online. As the project proceeds, we aim to obtain data from an ever larger array of tests – ultimately extending to whole-genome sequencing – and release it openly. Right now you can freely download the 23andMe data from everyone in the project from this website.

    It's a great project, putting personal stories and reactions together with a scientific view on genotype data. It's also the perfect topic for a blog -- just the right amount of navel-gazing. It's worth doing just to make you figure out how to use the browser software.

    What I wonder is, how much will personal genomics be like nude beaches? I mean, it's been a long time since the first nude beaches, but most people don't take advantage of the opportunity. Clearly, there's variation in different countries! But most people neither feel compelled to see others' data nor feel comfortable sharing their own.

    Well, they used the word unzipped, not me!

  • NSF to require data access plan

    Thu, 2010-05-06 12:14 -- John Hawks

    Science Insider reports that the National Science Foundation is going to make a "data management plan" a requirement of every grant application.

    NSF's current policy requires grantees to share their data within a reasonable length of time so long as the cost is modest. "That's nice, but it doesn't have much teeth," said Seidel. Under the new policy, which is expected to be unveiled this fall, a researcher would submit a data management plan as a two-page supplement to any regular grant proposal. That would make it an element of the merit review process.

    NSF wants to avoid a one-size-fits-all approach to the issue, Seidel explained, because each discipline has its own culture about data-sharing. "A scientist might say that my plan is that I don't need one, because I don't save my data," he told the board committee, which has just formed a task force on data policy. "The important thing is that it puts people on notice that they have to think about it, maybe for the first time."

    It sounds to me like it still doesn't "have much teeth." The kind of scientist he describes, who "doesn't need a plan", doesn't need any federal money, either.

    I mean, seriously -- they're going to "put people on notice that they have to think about it"? Give me a break.

  • Online toolkits -- the good and the frustrating

    Sun, 2010-02-21 14:51 -- John Hawks

    In pursuit of my DIY genomics posts, I've been playing around with the Galaxy bioinformatics web tools. The team responsible for the South African genomes published the data to Galaxy, and their uploads are easy to get -- either to download, or to work with the online Galaxy platform.

    Working with a resource like this helps to illustrate both how tremendously useful bioinformatics tools can be, but also how frustrating it can be to figure them out. Some things are a breeze, although others are completely obscure. Documentation for the uploads is skimpy so far -- one thing that drove me up the wall is that SNPs are listed by genome, but without indicating genotypes -- is the individual a homozygote or a heterozygote? The paper by Schuster and colleagues describes their genotype calling procedure, but the results turn out not to be posted along with their other data. I'm sure they'll become available as the data are updated, but I did waste some time figuring out how the releases correspond to descriptions in the online supplementary material from the paper.

    Despite occasional frustrations, we seem to be heading in the direction of all-in-one online bioinformatics toolkits. Galaxy, for example, lists several advantages on a promo page. A couple of entries:

    Now your results are reproducible! | When publishing results, replace “the data were analyzed using a collection of in-house scripts” with a URL pointing to Galaxy’s history. Your reviewers will have no further questions. That’s reproducible genomics!

    ...

    No tools for new datatypes | Some datatypes generated by high throughput genomics are so new that there are no tools to analyze them. For example, how do you extract sequences of coding exons from the latest 28-way alignments of vertebrate genomes or analyze quality scores from 454/Solexa/SOLiD? With Galaxy.

    I live at the mathematical end of this stuff. I work with models of populations and assume that sequences are known, you know, as if we looked at them and read off the ACGT's. But in reality, a lot of complexity lies between models and the biochemistry. Going from sequencing reads to genomes, and aligned genomes, involves a lot of analysis. Many of the details differ entirely between different sequencing platforms. As we continue to move toward whole-genome analyses of populations and other species, it's really important to have an abstraction that allows for different underlying sequencing models, while allowing replication of the population genetics modeling.

    The disadvantage of a single widely used tool is that it can limit creativity and lock people into a certain way of processing data. Locked-in assumptions sometimes lead to wrong conclusions -- as we've seen in human genetics many times over the years. But the advantage is that it allows everybody access to the same methods and data, so that results can be replicated and augmented with new observations.

  • NAS president calls for data sharing

    Sat, 2010-02-06 22:58 -- John Hawks

    Science has a one-page editorial by National Academy of Science President Ralph Cicerone. He alludes to the climate change scandals of the last few months, and points to a significant loss of public confidence in science as a result:

    In the wake of the [University of East Anglia] controversy, I have been contacted by many U.S. and world leaders in science, business, and government. Their assessments and those from various editorials, added to results from scattered public opinion polls, suggest that public opinion has moved toward the view that scientists often try to suppress alternative hypotheses and ideas and that scientists will withhold data and try to manipulate some aspects of peer review to prevent dissent. This view reflects the fragile nature of trust between science and society, demonstrating that the perceived misbehavior of even a few scientists can diminish the credibility of science as a whole.

    Cicerone argues that scientists need to shape up. The only way to maintain confidence in the scientific enterprise is to establish "clarity and transparency":

    Clarity and transparency must be reinforced to build and maintain trust—internal and external—in science. Scientists are taught to describe experiments, data, and calculations fully so that other scientists can replicate the research. Last year, the Committee on Science, Engineering, and Public Policy (COSEPUP) of the National Academy of Sciences (NAS), National Academy of Engineering, and Institute of Medicine put forth a framework for dealing with research data,* emphasizing that "Research data, methods and other information integral to publicly reported results should be publicly accessible." Some journals have established policies that require the sharing of materials and data. However, post-publication complaints regarding data sharing persist. Despite many efforts, the scientific community has failed to uniformly integrate these standards into their practices.

    Access to data may not be enough. In the case of climate research, open access to models and software is equally important -- otherwise, results are not replicable. This means greater support must be given from grant agencies for public accessibility and publication of research methods, including software archives.

    It also means that data sharing policies must have some teeth in them. At a minimum, funding renewal should be contingent on meeting the guidelines for data sharing proposed in grant applications. In 2010, there is no reason in the world why these cannot be downloaded freely from third parties, so that the scientists do not feel "harassed" by requests for information.

    References:

    Cicerone RJ. 2010. Ensuring integrity in science. Science 327:624. doi:10.1126/science.1187612

  • Data mining

    Mon, 2009-10-12 11:15 -- John Hawks

    IBM and Google want students to ditch their laptops and pick up some big iron:

    For the most part, university students have used rather modest computing systems to support their studies. They are learning to collect and manipulate information on personal computers or what are known as clusters, where computer servers are cabled together to form a larger computer. But even these machines fail to churn through enough data to really challenge and train a young mind meant to ponder the mega-scale problems of tomorrow.

    “If they imprint on these small systems, that becomes their frame of reference and what they’re always thinking about,” said Jim Spohrer, a director at I.B.M.’s Almaden Research Center.

    I love that analogy -- like they're cute little baby ducks learning that their computers are mama.

    Meanwhile, this is all about teaching students how to deal with data-mining software. They believe that the future of science is in being able to use these immense datasets, from sources like genomics and high-throughput astronomy.

    “It sounds like science fiction, but soon enough, you’ll hand a machine a strand of hair, and a DNA sequence will come out the other side,” said Jimmy Lin, an associate professor at the University of Maryland, during a technology conference held here last week.

    The big question is whether the person on the other side of that machine will have the wherewithal to do something interesting with an almost limitless supply of genetic information.

    There's some truth to this. On the other hand, I don't see how this explosion of data is going to create a raft of new jobs for scientists. Sure, IBM and Google want to recruit the best, in their position who wouldn't? Maybe we'll need fewer clinicians and techs to prep samples for data analysis, and that will shift some jobs to data analysis. But what they're talking about here are software development jobs to support science, not the science itself.

    Yes, geneticists will need to deal with larger datasets, but that means that more instances of small data features will empower them to test certain hypotheses that would have been untestable before. The scientist's job is to think of those hypotheses, work out the logic by which data may refute them, and root the inquiry in existing theory.

    There's a practical aspect to this, where working with large datasets helps to train students to think about data and theory. But the tools we're using now to access datasets will be different in four years, and ten years down the line -- the times when today's beginning students will be entering graduate school, or finishing Ph.D.'s Those little ducklings are going to need to swim on their own.

  • Open access and fossil reconstruction

    Thu, 2009-10-08 14:30 -- John Hawks

    I would love to be able to say that the Ardipithecus pelvic and cranial reconstructions were open access.

    The reconstruction of fragmentary fossils has in the past been more of an art than a science. An anatomical expert can eliminate some possible morphological configurations based on the remains themselves. But for many, she has only her knowledge of variation in extant species as a reference. The bones she has studied might or might not be representative of anatomical variations; variations within extant taxa might or might not be relevant to ancient species. Working with casts of a fragile specimen is fraught with problems. Missing parts or uncertain joins in the fossil material can be shored up with plasticine, but to other scientists these This process poses obvious drawbacks: the resulting reconstruction may present the appearance of features that are in fact completely sculpted out of clay.

    I describe this as an “art” for one important reason: there are many barriers to replicating a reconstruction. The reconstruction ends up including the quirks of other specimens used as reference material, may have fragments misplaced due to uncertain identifications.

    The pelvic reconstruction of Sts 14 is one example of how the implicit assumptions of a reconstruction can affect the interpretation of a fossil. After its discovery in 1947, John Robinson reconstructed the distorted ox coxae and partial sacrum with a rounded pelvic inlet, more or less like humans. After the discovery of Lucy’s pelvis (AL 288-1), it was clear that A. afarensis had a very broad pelvis, flattened from front to back — different from Sts 14’s apparently rounded pelvic shape. The original reconstruction of Sts 14 was revisited in the 1990’s, when it was found that a flatter, more Lucy-like shape is consistent with the specimen’s anatomy (Abitbol1995). The point is not that Robinson’s reconstruction was wrong — any reconstruction will be wrong in some details. Nor is the point that the reconstruction was not replicable in principle – at any time, anybody could have sawn apart a few casts of Sts 14 into the component bones, and then built them back in a different shape. The point is that replicating the reconstruction would have been expensive and difficult, so for forty years nobody made the effort.

    With digital scanning, all the expense and difficulty go into producing the initial scans. After that, the only limit on testing a model reconstruction is the time that someone is willing to spend studying the anatomy.

    In principle, this is wonderful. A whole team of researchers can easily share digital models, working on the specimen with an explicitly shared referent. The digital model can be instantly transported anywhere in the world, allowing direct comparisons with original material housed in museums. The existence of such virtual 3-d images of fossil and model allows independent scholars to apply their own models, testing a model’s assumptions without needing to handle and possibly damage the original specimens.

    But many of these benefits of the technology depend on scholars being able to access the scans. Today they can’t.

    I’m hopeful that in the future we’ll be able to make full use of the technology — not only enabling a single reconstruction, but multiple reconstructions and widespread comparison of digital models.

  • Mailbag: The Ardipithecus wait

    Sun, 2009-10-04 12:08 -- John Hawks

    I sense a touch of criticism regarding the grand unveiling of Ardi after 15 years wait. Now I've completed that sentence it makes sense. A large team spend 15 years on one species with a limited number of remains. How long did it take one man, Darwin, to develop, test, comminucate regarding and write one the world's most importantbooks? Remember without cars, email or phones any travel or questions/opinions would take considerably longer than today.

    S

    Well, to be fair, Darwin did correspond and communicate with a much broader range of people, including many critics. This helped him identify weaknesses and errors in his thinking (and others') more quickly....

  • Whoa, who stole the data?

    Sat, 2009-10-03 10:59 -- John Hawks

    OK, as you know I do this thing where I read the supplementary information in papers. I hate doing it; think they should put the stuff in the actual paper where it belongs, but well, that's life, right?

    Sooooo...I'm reading through the 73 pages of Supplementary Information for the Ardipithecus dental paper...

    Supplementary table S1 from Suwa et al. 2009

    Now let me just explain what's going on here. This is a spreadsheet of all the dental specimens they studied, and all the dental elements that they could measure. And they've entered an "m" in the table if they could measure the specimen, and an "f" if it was too fragmentary to measure. Fair enough.

    But wait a minute. There aren't any measurements. IT'S A DATA TABLE WITHOUT ANY DATA.

    What kind of rinky-dink journal is this?

    They give us descriptive statistics for each tooth, and print the canine measurements necessary to replicate their sex assignment bootstrap program, but they include no other measurements and no plots of measurements that aren't multiplied or divided by others.

    I understand why the authors don't want the numbers published. There's nothing you can do to compare individuals in the dataset to other samples of fossils. The summary statistics are enough to compare species with A. ramidus tooth by tooth, but not enough to study the relation of different teeth to each other. Some of the authors must want to do this themselves.

    What I don't understand is the journal. I mean, it's like some kind of government agent blacked out all the information. It's not like anyone can say it's appropriate to hold the data for a monograph -- there are SEVENTY-THREE PAGES here. It's not even like many of them are secret -- the ones discovered by 1994 have the measurements reported in White and colleagues' 1994 paper. In the current supplement, much of the information presented is valuable, and includes multiples of many of the measurements. I'd expect that any journal would include the measurements, and routinely require it when I edit papers. That way, other scientists can use the data in comparisons of their own samples, and outsiders can replicate the study's conclusions.

    Don't get me started on the scans....

  • Fossil access editorial

    Mon, 2009-08-24 22:12 -- John Hawks

    The editors of Scientific American offer arguments for greater data and public access to fossils in their current (September 2009) issue: "Fossils for All: Science Suffers by Hoarding". The editorial hits on several issues that I've discussed here over the years:

    In 2005 the National Science Foundation took steps toward setting limits, requiring grant applicants to include a plan for making specimens and data collected using NSF money available to other researchers within a specified time frame. But paleoanthropologists assert that nothing has really changed. And according to Leslie Aiello of the Wenner-Gren Foundation, a major source of private funding for anthropological research, both public and private funding agencies typically lack the resources to enforce access policies, if they have them at all.

    Ultimately, the adoption of open-access practices will depend in large part on paleoanthropologists themselves and the institutions that store human fossils—most of which originate outside the U.S.—doing the right thing. But the NSF, which currently considers failure to make data accessible just one factor in deciding whether to fund a researcher again, should take a firmer stance on the issue and reject without exception those repeat applicants who do not follow the access rules. The agency could also create a centralized database to which researchers could contribute measurements, observations, high-resolution photographs and CT scans—a GenBank for paleoanthropology. And journals could require that authors submit their data prior to publication, as they do with authors of papers containing new genetic sequences.

    The editorial also discusses the ongoing "Lucy" exhibition:

    As for the public display of these fragments of our shared heritage, surely taxpayers, who finance much of this research, deserve an occasional glimpse of them. Irreplaceable objects are routinely transported and displayed. And in countries such as the U.S., where a staggering proportion of the population does not believe in evolution, scientists should embrace the opportunity to share with laypeople the hard evidence for humankind’s ancient roots. The future of science education may depend on it.

    I went cruising back through my archives looking for other posts that might be informative. I highly recommend my essay from the very beginning of the data access rules at NSF, "NSF and data access." Here's a sample:

    If the new policy is to be a success, then the proof of it cannot wait for ten to thirty years. It needs teeth. It needs two or three high-profile grants to be declined because of data access issues. And it needs those cases to be made public, so that everyone can have confidence in the openness of the process. This doesn't mean that the names of the applicants and their alleged sharing violations should be dragged through the press. It does mean that NSF should publish the number of grants (and their proposed funding amounts) declined for failings in the data access plan.

    But more importantly, it needs replication among other granting agencies. A large set of molecular anthropologists have just shown their willingness to completely forego public funding, in order to maintain certain kinds of controls (in this case ethical ones) over their research (See Genographic Project). Will paleoanthropologists do the same? It would be helpful if some of the important private foundations, such as the National Geographic Society, the Leakey Foundation, Wenner-Gren, and others would establish data access provisions also.

    Another helpful idea would be for one of these foundations to establish a data bank. Notice what is missing in the NSF policy is any discussion of a data archive. Other areas of NSF and NIH have such archives and maintain policies of mandatory deposition of data. This is most prominent for genetics, with the GenBank archive and journal publication of most results conditional on mandatory submission of data to the archive. Thus, there is no logical impediment to the creation of such a resource by a federal agency. The fact that they chose not to implement such a policy, I find significant.

    Four years later, I think it's fair to give a synopsis of the results. All NSF grant applications do now include a mandatory section detailing how results will be shared with the public. To my knowledge in paleoanthropology, no grant renewal or follow-up application has been declined for failure to comply with a data access plan. NSF has funded at least one workshop on data sharing in paleoanthropology. There are no CT scans of fossil hominids available for free public download. None.

    The European Union and a number of European institutions have made some good progress toward data availability and database sharing. The NESPOS cooperative is a wonderful step toward CT scan availability. It is not as open as I would like -- this is not a site that your science-fair-inclined high school students can access. But at least professionals can download useful primary data from the site. The University of Vienna's CT archive is also a good (if limited) source. Several European institutions and regional or national projects have databases online -- covering everything from faunal species lists to high-resolution photographs of stone tools.

    Yet, there is nothing to alter what I wrote four years ago:

    The real problem is that twenty to thirty years after many fossils are uncovered, there is no cast availability, little public data access, few financial accommodations to make such access possible. Specialists like me often find ways around these barriers. But I do not think it would be overstating the problem to suggest that perhaps half the people teaching human evolution in four-year universities have never touched a cast of a Hadar fossil. I would be delighted to be proved wrong, but I don't think I am. Our field is educating students into a world in which A. afarensis is unknown in the laboratory and poorly represented in our textbooks. I'm not talking about new specimens, here, I'm talking about fossils that were found in the mid-1970's and monographed in 1982. Nor is this problem limited to early hominids. What proportion of people teaching about the modern human origins problem do you suppose have seen a cast of any "early modern" fossil other than Skhul 5?

    I'm not picking on Ethiopia; the problem is the same for many regions and time periods -- even those with relatively open access to original fossil collections.

    More recently, I looked at the impact of those data access rules, along with the prospect that they might be removed by new legislation: "Congress to repeal open access science provisions?" I don't think that we'll see that action in this session, but it's obvious that a policy with no record of success is always in danger of being rolled back.

Pages

Subscribe to data access

Neandertals

For years, I've worked on their bones. Now I'm working on their genes. Read more about the science studying these ancient people.

Denisova

From a finger bone of an ancient human came the record of a completely unexpected population. My lab is working on the science of the Denisova genome.

Acceleration

The advent of agriculture caused natural selection to speed up greatly in humans. We're uncovering some of the ways that populations have rapidly changed during the last 10,000 years.

Malapa

Just outside Johannesburg, the Malapa site is producing some of the most exciting finds in human evolution. This site is the headquarters of the Malapa Soft Tissue Project.