Fast Pairwise Structural RNA Alignments by Pruning of the Dynamical Programming Matrix

Open Access

12 October 2007

journal article
research article
Published by Public Library of Science (PLoS) in PLoS Computational Biology

Vol. 3 (10) , e193-1908
https://doi.org/10.1371/journal.pcbi.0030193

Abstract

It has become clear that noncoding RNAs (ncRNA) play important roles in cells, and emerging studies indicate that there might be a large number of unknown ncRNAs in mammalian genomes. There exist computational methods that can be used to search for ncRNAs by comparing sequences from different genomes. One main problem with these methods is their computational complexity, and heuristics are therefore employed. Two heuristics are currently very popular: pre-folding and pre-aligning. However, these heuristics are not ideal, as pre-aligning is dependent on sequence similarity that may not be present and pre-folding ignores the comparative information. Here, pruning of the dynamical programming matrix is presented as an alternative novel heuristic constraint. All subalignments that do not exceed a length-dependent minimum score are discarded as the matrix is filled out, thus giving the advantage of providing the constraints dynamically. This has been included in a new implementation of the FOLDALIGN algorithm for pairwise local or global structural alignment of RNA sequences. It is shown that time and memory requirements are dramatically lowered while overall performance is maintained. Furthermore, a new divide and conquer method is introduced to limit the memory requirement during global alignment and backtrack of local alignment. All branch points in the computed RNA structure are found and used to divide the structure into smaller unbranched segments. Each segment is then realigned and backtracked in a normal fashion. Finally, the FOLDALIGN algorithm has also been updated with a better memory implementation and an improved energy model. With these improvements in the algorithm, the FOLDALIGN software package provides the molecular biologist with an efficient and user-friendly tool for searching for new ncRNAs. The software package is available for download at http://foldalign.ku.dk. FOLDALIGN is an algorithm for making pairwise structural alignments of RNA sequences. It uses a lightweight energy model and sequence similarity to simultaneously fold and align the sequences. The algorithm can make local and global alignments. The power of structural alignment methods is that they can align sequences where the primary sequences have diverged too much for normal alignment methods to be useful. The structures predicted by structural alignment methods are usually better than the structures predicted by single-sequence folding methods since they can take comparative information into account. The main problem for most structural alignment methods is that they are too computationally expensive. In this paper we introduce the dynamical pruning heuristic that makes the FOLDALIGN method significantly faster without lowering the predictive performance. The memory requirements are also significantly lowered, allowing for the analysis of longer sequences. A user-friendly (still command-line based, though) implementation of the algorithm is available at the Web site: http://foldalign.ku.dk

Keywords

This publication has 64 references indexed in Scilit:

Efficient pairwise RNA structure prediction using probabilistic alignment constraints in Dynalign
BMC Bioinformatics, 2007
Inferring Noncoding RNA Families and Classes by Means of Genome-Scale Structure-Based Clustering
PLoS Computational Biology, 2007
Thousands of corresponding human and mouse genomic regions unalignable in primary sequence contain common RNA structure
Genome Research, 2006
Identification and Classification of Conserved RNA Secondary Structures in the Human Genome
PLoS Computational Biology, 2006
RNA regulation: a new genetics?
Nature Reviews Genetics, 2004
Secondary Structure Prediction for Aligned RNA Sequences
Journal of Molecular Biology, 2002
Dynalign: an algorithm for finding the secondary structure common to two RNA sequences
Journal of Molecular Biology, 2002
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs
Nucleic Acids Research, 1997
Selection of representative protein data sets
Protein Science, 1992