Full frontal genomes

7 minute read

In Erika Check's Nature article on celebrity genomes, she includes a passage in which Francis Collins points out a problem with public access to private genomes:

But it's not clear that all of the genome pioneers are acting altruistically. Watson said at the Cold Spring Harbor meeting on 10 May that he has not asked either of his grown sons for permission to publish his genome sequence, which 454 has said will be publicly posted in some form. That has raised questions about the responsibility of sequenced individuals to family members who share their DNA.
"This will be a challenging question, because if you're planning to put this information in a truly open database, there are issues of risk not just to you, but to your relatives," Collins says. "Jim clearly felt those risks were not such as to cause him to take action on them."

Putting your genome information online is not only about you: it includes half the genome of each of your children, half the genome of your parents, a fourth that of your grandchildren, nieces, and nephews, and so on.

I wrote about this problem two years ago, linking to a New Scientist article that described how a young man had tracked down his biological father -- using DNA samples put online by the man's relatives.

The boy paid FamilyTreeDNA.com $289 for the service. His genetic father had never supplied his DNA to the site, but all that was needed was for someone in the same paternal line to be on file. After nine months of waiting and having agreed to have his contact details available to other clients, the boy was contacted by two men with Y chromosomes closely matching his own. The two did not know each other, but the similarity between their Y chromosomes suggested there was a 50 per cent chance that all three had the same father, grandfather or great-grandfather.

OK, so this particular situation must be pretty rare. But it is a good example of a case where a parent and child may have divergent interests with respect to genetic information. On the obvious level, the son wants to discover his father's identity while the father may want to conceal it. On the not-so-obvious level, a grandfather may want to find children that his son may have fathered, irrespective of the father's wishes. The father in question might even be dead, might have specified in his will his wishes for all sperm donations to remain private, but a grandfather can easily circumvent those wishes through the simple expedient of publicizing his DNA profile.

Families with inherited genetic conditions are already dealing with these privacy issues, such as mothers who don't want a Huntington's test and daughters who get it anyway, revealing the mother's status (my post earlier this year, referring to Amy Harmon's NY Times article). Whole-genome scans for most people will not reveal the same, tragic, level of risk, but will generate hundreds of smaller questions -- like a load of tiny skeletons-in-the-closet.

This week in Science, Collins and coauthor William Lowrance expand on the problem. Their "Policy Forum" article notes existing U.S. federal law and regulations concerning personal data and the problems that genomic information is likely to generate in the current legal context.

Until recently, most genomic research used data and biospecimens obtained fairly directly, from the data subjects themselves or clinical repositories or specialized research collections. This will continue, as it has many advantages. But now, in efforts to increase the range and quantity of data, large-scale research platforms are being built that assemble, organize, and store data, and sometimes biospecimens, and then distribute these to researchers (see figure). The advantages of such platforms, in addition to scale, are that they can be a robust staging-point for screening data quality, fostering uniformity of data format, and facilitating analysis. Some platforms accumulate data directly (as the Framingham Heart Study does); others assemble them from a variety of sources (as The Cancer Genome Atlas, the Genetic Association Information Network, and the Wellcome Trust Case Control Consortium do and U.K. Biobank will) (7). Among the design and governance issues are whether and how to de-identify the data and at what stages to conduct scientific and ethics review.
These new data flows, genomewide analyses, and novel arrangements such as the Informed Cohort scheme recently proposed by Kohane et al. (8) are relatively uncharted territory with respect to human subjects and privacy considerations. Precedent doesn't provide sufficient guidance. For example, the Human Genome and HapMap Projects have geno-typed DNA from only a few hundred carefully selected people who prospectively consented to the analysis and to open publication after thorough explanation, discussion, and community consultation. The projects have been scrutinized closely all along. But when the data relate to more people (by orders of magnitude) or to retrospective analysis of biospecimens, then for pragmatic reasons such painstaking selection, consent negotiation, and scrutiny can't generally be achieved (Lowrance and Collins 2007:600).

The article does not really arrive at any conclusions about what should be done -- Lowrance and Collins limit themselves to a fairly dry listing of potential problems and conditions leading to them. Throughout, they emphasize the reliance of the current regulations on "de-identification" -- that is, the removal of most identifying information from sequences or samples. Under today's U.S. guidelines, data that have had identifying information removed may be used quite broadly without further consideration of human subjects protections:

Construal of genomic "human subject." If data have been de-identified but include large amounts of genetic information, are the individuals still considered "human subjects"? The answer has important implications for consent, ethics review, and safeguards. McGuire and Gibbs have urged that "genomic sequencing studies should be recognized as human-subjects research and brought unambiguously under the protection of existing federal legislation" (22), but this could be unnecessarily extreme. In the United States, the Office of Human Research Protections considers that data or biospecimens collected for one purpose but then key-coded and used secondarily for research are not "individually identifiable," and therefore the research is not human-subjects research (7). This is a strong incentive to support de-identification and to de-identify data (Lowrance and Collins 2007:602).

Lowrance and Collins mention that "de-identification" is by no means as simple as applied to substantial parts of genomes, particularly when accompanied by phenotypic data such as redacted medical histories. Routine data-mining techniques would be sufficient to identify individuals within medical research studies; matching individual genome profiles to a name may be accomplished without need to match data to a "key" if the information is unique enough.

I favor the protection of individual privacy over greater research access to research data, particularly since DNA sampling and data retention by governmental agencies has become increasingly routine. In a post directly before her Personal Genome Project Q&A, Hsien-Hsien Lei wrote "Police want to collect abandoned DNA from everyone," noting that UK police will soon have authority to collect DNA with the same legal standing as trash -- if you throw it away, it's not private. We have to assume that governments will keep multiple databases of DNA barcodes for people, that these will include other personal information, and that they will be insecure. One may argue that most of the privacy threat actually comes from these other databases, and that personal genome information adds relatively little. Nevertheless, it would be better to add nothing at all, or to generate new models accentuating security.

Since I've been thinking about information theory a lot lately, I can't help but think that some kind of cryptographic solution should be applied -- so that nobody can read a person's sequence data without her private key. A person might choose to opt-in to research studies or other projects that require genotyping data, but still the sequence would be secured by encryption.

The objection to such an approach is that large-scale, long-term studies of health attributes require samples of many thousands -- even tens or hundreds of thousands of people. Today, these datasets are routinely deindividualized and dispersed around the world to researchers involved with many different projects. There is little chance of centralized control over this information after it is dispersed -- and Lowrance and Collins describe the potential problems with changing the system. With so many participants, the genotype data are a tempting target for black-hats. Any very large-scale study, in which hundreds of researchers have access to deindividualized data, there are many chances for unscrupulous researchers to steal information or put it in situations where theft by outsiders may occur.

But practices can be implemented to reduce the risk of data loss or theft. For one thing, the main reason why those studies need so many participants is because they are waiting for people to have rare adverse health events, and don't want to wait so long for results. So they really only need to know genotype data for the small group of people who have these conditions. If decryption is restricted to such small groups of study participants, the risk of unauthorized data access would be greatly reduced.

No system is perfectly safe, but in this case the agglomeration of data from thousands or millions of individuals in single databases leads to risks that scale nonlinearly with database size. So reducing the size of data chunks available to any one person may be a significant protective step.


Lowrance WW, Collins FS. 2007. Identifiability in genomic research. Science 317:600-602.doi:10.1126/science.1147699

Check E. 2007. Celebrity genomes alarm researchers. Nature 447:358-359. doi:10.1038/447358a