The "dark matter" of the cell

Last week's Nature got most of the media, with its papers on the chimpanzee genome. But last week's Science was actually the more important of the two -- a full issue devoted to the decidedly less sexy, but far more significant, topic of RNA function. Moreover, the articles in the issue say they are free (I can't confirm because I have a subscription).

I've been meaning to review some of the new work on microRNA (miRNA) gene regulation for awhile. This short introductory article gives a good introduction to why this is important stuff, and extends the story beyond miRNA to other kinds of noncoding RNA (ncRNA) sequences and their possible functionality.

Small noncoding microRNAs (miRNAs) have been found in such abundance that they have been christened the "dark matter" of the cell, a view reinforced by an analysis of the small RNAs found in Arabidopsis (pp. 1567 and 1525). The role of miRNAs and of their close cousins small interfering RNAs (siRNAs) in RNA silencing is discussed by Zamore and Haley (p. 1519), and illustrated in the poster pullout in this issue and in research showing that miRNAs can repress the initiation of translation (p. 1573) and, intriguingly, can also increase mRNA abundance (p. 1577). [See also this week's online Science of Aging Knowledge Environment (SAGE KE) and Signal Transduction Knowledge Environment (STKE)]. The phrase "dark matter" could well be ascribed to noncoding RNA in general. The discovery that much of the mammalian genome is transcribed, in some places without gaps (so-called transcriptional "forests"), shines a bright light on this embarrassing plentitude: an order of magnitude more transcripts than genes (pp. 1559, 1564, and 1529). Many of these noncoding RNAs (p. 1527) are conserved across species, yet their functions (if any) are largely unknown: A cell-based screen shows one, NRON, to be a regulator of the transcription factor NFAT (p. 1570). Of course, in some cases it is the act of transcription that is the regulatory event, as in the case of the transcriptional regulation of recombination (p. 1581). Finally, even the coding and base-paring capacity of RNA can be altered--by RNA editing, in which bases in the RNA are changed on the fly. Analysis of editing enzymes (p. 1534) reveals that the cell-signaling molecule IP6 is required for their editing activity (Riddihough 2005:1507).

On the subject of "What do introns do?", John Mattick provides a short review of noncoding RNA genomics that includes this:

It is also clear that the majority of the genomes of animals is indeed transcribed (12), which suggests that these genomes are either replete with largely useless transcription or that these noncoding RNA sequences are fulfilling a wide range of unexpected functions in eukaryotic biology. These sequences include introns (Fig. 1), which account for at least 30% of the human genome but have been largely overlooked because they have been assumed to be simply degraded after splicing. However, it has been shown that many miRNAs and all known small nucleolar RNAs in animals are sourced from introns (of both protein-coding and noncoding transcripts) (13), and it is simply not known what proportion of the transcribed introns are subsequently processed into smaller functional RNAs. It is possible, and logically plausible, that these sequences are also a major source of regulatory RNAs in complex organisms (20) (Mattick 2005:1528, citations in original).

If all you've ever heard of is mRNA and tRNA, the world has changed, my friend. A lot of that junk DNA you've heard about actually does stuff. Roll it over in your mind: "the majority ... is indeed transcribed"

A perspective by Jean-Michel Claverie hits another essential point:

A few months before the publication of the first drafts of the human genome sequence (1, 2), online bids predicting the number of human protein-coding genes ranged from 30,000 to 150,000 [see (3)]. To the surprise of many (4), initial bioinformatic analyses revealed no more than 35,000 human genes, an estimate that has steadily declined to the present 25,000 genes (5). On the other hand, the largest estimates based on the number of distinct polyadenylated transcript 3'-ends identified through the single-pass sequencing of cDNA libraries (6) [i.e., expressed sequence tags (ESTs)] have not followed a diminishing trend. On the contrary, more transcripts keep being discovered, many of which do not correspond to annotated genes [e.g., (7)], in particular when using the serial analysis of gene expression (SAGE) approach (8) (Claverie 2005:1529).

Estimates of the number of genes keep decreasing, while estimates of the number of protein transcripts keep increasing! The answer? These expressed sequence tags and other transcriptional products are recognized by their poly-A end sequences (that's a chain of adenines at the end of an RNA). You may remember from molecular biology that messenger RNA has these poly-A tags. But now we know that lots of non-protein-coding RNA also shares this polyadenation. The transcription identification project covered in this Science issue identified 181,000 transcripts in mice, comprising 62 percent of the mouse genome (FANTOM 2005). Much of this amount consists of noncoding RNA, and much consists of RNA sequences that partially overlap with protein-coding genes but do not include full reading frames.

These results provide a solution to the discrepancy between the number of (protein-coding) genes and the number of transcriptsnoncoding polyadenylated mRNA contributes to a large fraction of the 3'-EST sequences (and SAGE tags) subsequently clustered or remaining as singletons. Indeed, the noncoding Xist mRNA is abundantly represented in all EST projects. It is thus likely that sequences of noncoding transcripts have been accumulating in EST databases and have for the most part (including singleton and antisense ESTs) been erroneously interpreted as coming from the 3'-untranslated regions of protein-coding transcripts. Noncoding transcripts originating from intergenic regions, introns, or antisense strands have probably been right before our eyes for 8 years without having been discovered!

An interesting side note is that SNP databases were started by finding variants within these ESTs (e.g. Wang et al. 1997). This may mean that today's SNPs are a witches brew of sites in coding sequences, sequences that end up in noncoding RNA transcripts, functional but nonconserved regulatory RNA, and who knows what else. I wonder whether we know anything about their functional constrants as a set.

It will be a long time before anyone figures out what all this noncoding RNA does. Here's a hint:

In contrast, the promoter regions of ncRNAs are generally more conserved than the promoters of the protein-coding mRNA, not only between human and mouse but also down in the evolutionary scale to chicken (Fig. 3, B to F), and they contain binding sites for known transcription factors (18). We conclude that the large majority of ncRNAs that we analyzed display positional conservation across species. In considering function, one might conclude that the act of transcription from the particular location is either important or a consequence of genomic structure or sequence (for example, enhancers such as that of the globin locus can act as promoters), the transcript may function through some kind of sequence-specific interaction with the DNA sequence from which it is derived, or many noncoding RNAs have other targets but are evolving rapidly (19, 20) (FANTOM 2005:1562).

And another, suggesting that widespread transcription of the antisense strand of protein-coding genes may contribute to gene regulation:

Antisense transcription (transcription from the opposite strand to a protein-coding or sense strand) has been ascribed roles in gene regulation involving degradation of the corresponding sense transcripts (RNA interference), as well as gene silencing at the chromatin level. Global transcriptome analysis provides evidence that a large proportion of the genome can produce transcripts from both strands, and that antisense transcripts commonly link neighboring "genes" in complex loci into chains of linked transcriptional units. Expression profiling reveals frequent concordant regulation of sense/antisense pairs. We present experimental evidence that perturbation of an antisense RNA can alter the expression of sense messenger RNAs, suggesting that antisense transcription contributes to control of transcriptional outputs in mammals (RIKEN 2005:1564).

There's much more in there for people willing to wade through the acronyms and jargon. It's pretty clear that that path from DNA to protein has gotten a lot more complicated in the past few years, and we're uncovering the role of non-coding sequences in the genome.

References:

Claverie J-M. 2005. Fewer genes, more noncoding RNA. Science 309:1529-1530. Full text (free)

The FANTOM Consortium. 2005. The transcriptional landscape of the mammalian genome. Science 309:1559-1563. Full text (free)

Mattick JS. 2005. The functional genomics of noncoding RNA. Science 309:1527-1528. Full text (free)

Riddihough G. 2005. In the forests of RNA dark matter. Science 309:1507. Summary

RIKEN Genome Exploration Research Group et al. 2005. Antisense transcription in the mammalian transcriptome. Science 309:1564-1566. Full text (free)

Wang DG. et al. 1997. Large-scale identification, mapping, and genotyping of single-nucleotide polymorphisms in the human genome. Science 280:1077-1082.