A Computational Pipeline for High- Throughput Discovery of cis-Regulatory Noncoding RNA in Prokaryotes

Open Access

6 July 2007

journal article
research article
Published by Public Library of Science (PLoS) in PLoS Computational Biology

Vol. 3 (7) , e126
https://doi.org/10.1371/journal.pcbi.0030126

Abstract

Noncoding RNAs (ncRNAs) are important functional RNAs that do not code for proteins. We present a highly efficient computational pipeline for discovering cis-regulatory ncRNA motifs de novo. The pipeline differs from previous methods in that it is structure-oriented, does not require a multiple-sequence alignment as input, and is capable of detecting RNA motifs with low sequence conservation. We also integrate RNA motif prediction with RNA homolog search, which improves the quality of the RNA motifs significantly. Here, we report the results of applying this pipeline to Firmicute bacteria. Our top-ranking motifs include most known Firmicute elements found in the RNA family database (Rfam). Comparing our motif models with Rfam's hand-curated motif models, we achieve high accuracy in both membership prediction and base-pair–level secondary structure prediction (at least 75% average sensitivity and specificity on both tasks). Of the ncRNA candidates not in Rfam, we find compelling evidence that some of them are functional, and analyze several potential ribosomal protein leaders in depth. For decades, scientists believed that, with a few key exceptions, RNA played a secondary role in the cell. Recent discoveries have sharply revised this simple picture, revealing widespread, diverse, and surprisingly sophisticated roles for RNA. For example, many bacteria use RNA elements called “riboswitches” to switch various gene activities on or off in response to extremely sensitive detection of specific molecules. Discovery of new functional RNA elements remains a very challenging task, both computationally and experimentally. It is computationally difficult largely because of the importance of an RNA molecule's 3-D structure, and the fact that molecules with very different nucleotide sequences can fold into the same shape. In this paper, we propose a computational procedure, based on comparing the genomes of multiple bacteria, for discovery of novel RNAs. Unlike most previous approaches, ours does not require a letter-by-letter alignment of these diverse genomes, making it more applicable to RNA elements whose structure, but not nucleotide sequence, has been preserved through evolution. In an extensive test on the Firmicutes, a bacterial phylum containing well-studied organisms such as Bacillus subtilis and important pathogens such as anthrax, we recover most known noncoding RNA elements, as well as making many novel predictions.

Keywords

This publication has 51 references indexed in Scilit:

NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins
Nucleic Acids Research, 2007
Identification of a large noncoding RNA in extremophilic eubacteria
Proceedings of the National Academy of Sciences, 2006
MicroFootPrinter: a tool for phylogenetic footprinting in prokaryotic genomes
Nucleic Acids Research, 2006
Thousands of corresponding human and mouse genomic regions unalignable in primary sequence contain common RNA structure
Genome Research, 2006
Identification and Classification of Conserved RNA Secondary Structures in the Human Genome
PLoS Computational Biology, 2006
NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins
Nucleic Acids Research, 2004
Interaction of the Bacillus stearothermophilus ribosomal protein S15 with its 5′-translational operator mRNA 1 1Edited by I. Tinoco
Journal of Molecular Biology, 2001
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs
Nucleic Acids Research, 1997
Importance of mRNA folding and start codon accessibility in the expression of genes in a ribosomal protein operon of Escherichia coli
Journal of Molecular Biology, 1992
Evolutionary trees from DNA sequences: A maximum likelihood approach
Journal of Molecular Evolution, 1981