Inferring Noncoding RNA Families and Classes by Means of Genome-Scale Structure-Based Clustering

Top Cited Papers
Open Access
Abstract
The RFAM database defines families of ncRNAs by means of sequence similarities that are sufficient to establish homology. In some cases, such as microRNAs and box H/ACA snoRNAs, functional commonalities define classes of RNAs that are characterized by structural similarities, and typically consist of multiple RNA families. Recent advances in high-throughput transcriptomics and comparative genomics have produced very large sets of putative noncoding RNAs and regulatory RNA signals. For many of them, evidence for stabilizing selection acting on their secondary structures has been derived, and at least approximate models of their structures have been computed. The overwhelming majority of these hypothetical RNAs cannot be assigned to established families or classes. We present here a structure-based clustering approach that is capable of extracting putative RNA classes from genome-wide surveys for structured RNAs. The LocARNA (local alignment of RNA) tool implements a novel variant of the Sankoff algorithm that is sufficiently fast to deal with several thousand candidate sequences. The method is also robust against false positive predictions, i.e., a contamination of the input data with unstructured or nonconserved sequences. We have successfully tested the LocARNA-based clustering approach on the sequences of the RFAM-seed alignments. Furthermore, we have applied it to a previously published set of 3,332 predicted structured elements in the Ciona intestinalis genome (Missal K, Rose D, Stadler PF (2005) Noncoding RNAs in Ciona intestinalis. Bioinformatics 21 (Supplement 2): i77–i78). In addition to recovering, e.g., tRNAs as a structure-based class, the method identifies several RNA families, including microRNA and snoRNA candidates, and suggests several novel classes of ncRNAs for which to date no representative has been experimentally characterized. For a long time, it was believed that the control of processes in living organisms is almost only performed by proteins. Only recently, scientists learned that a further class of molecules, namely special RNAs, plays an important role in cell control. In consequence, research on such RNAs enjoys increasing attention over the last few years. These RNAs were called noncoding RNAs (ncRNA), because, unlike most other RNAs, these molecules do not code for proteins. Due to recent research successes, one can predict a lot of potential new ncRNAs by comparing the genomes of related organisms. Technically, comparing such RNAs is challenging and computationally expensive, since related ncRNAs often show only weak similarity on the sequence level, but share similar structures. In the paper, we present the new method LocARNA for fast and accurate comparison of RNAs with respect to their sequence and structure. Using this method, we define a distance measure between pairs of ncRNAs based on sequence and structure. This is then used for combining RNAs into a cluster for identifying groups of similar RNAs in large unorganized sets of RNA. The final aim of such a comparison is to identify new classes of ncRNAs. We applied our clustering procedure to a previously published set of 3,332 predicted ncRNAs in the C. intestinalis genomes. In addition to rediscovering known classes of RNAs, e.g., tRNAs, the method predicts microRNA candidates, and suggests several novel, experimentally uncharacterized classes of ncRNAs. For verification, we clustered about 4,000 RNAs of RFAM, which is a large database that contains RNAs with an already known classification into families. Our results show good performance of the presented structure-based clustering approach.