Refining motifs by improving information content scores using neighborhood profile search

Open Access

27 November 2006

journal article
research article
Published by Springer Nature in Algorithms for Molecular Biology

Vol. 1 (1) , 23
https://doi.org/10.1186/1748-7188-1-23

Abstract

The main goal of the motif finding problem is to detect novel, over-represented unknown signals in a set of sequences (e.g. transcription factor binding sites in a genome). The most widely used algorithms for finding motifs obtain a generative probabilistic representation of these over-represented signals and try to discover profiles that maximize the information content score. Although these profiles form a very powerful representation of the signals, the major difficulty arises from the fact that the best motif corresponds to the global maximum of a non-convex continuous function. Popular algorithms like Expectation Maximization (EM) and Gibbs sampling tend to be very sensitive to the initial guesses and are known to converge to the nearest local maximum very quickly. In order to improve the quality of the results, EM is used with multiple random starts or any other powerful stochastic global methods that might yield promising initial guesses (like projection algorithms). Global methods do not necessarily give initial guesses in the convergence region of the best local maximum but rather suggest that a promising solution is in the neighborhood region. In this paper, we introduce a novel optimization framework that searches the neighborhood regions of the initial alignment in a systematic manner to explore the multiple local optimal solutions. This effective search is achieved by transforming the original optimization problem into its corresponding dynamical system and estimating the practical stability boundary of the local maximum. Our results show that the popularly used EM algorithm often converges to sub-optimal solutions which can be significantly improved by the proposed neighborhood profile search. Based on experiments using both synthetic and real datasets, our method demonstrates significant improvements in the information content scores of the probabilistic models. The proposed method also gives the flexibility in using different local solvers and global methods depending on their suitability for some specific datasets.

Keywords

This publication has 20 references indexed in Scilit:

A Stability Boundary Based Method for Finding Saddle Points on Potential Energy Surfaces
Journal of Computational Biology, 2006
Assessing computational tools for the discovery of transcription factor binding sites
Nature Biotechnology, 2005
A uniform projection method for motif discovery in DNA sequences
IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2004
A Dynamical Trajectory-Based Methodology for Systematically Computing Multiple Optimal Solutions of General Nonlinear Programming Problems
IEEE Transactions on Automatic Control, 2004
LOGOS: A MODULAR BAYESIAN MODEL FOR DE NOVO MOTIF DETECTION
Journal of Bioinformatics and Computational Biology, 2004
Finding motifs in the twilight zone
Bioinformatics, 2002
Profile hidden Markov models.
Bioinformatics, 1998
A systematic search method for obtaining multiple local optimal solutions of nonlinear programming problems
IEEE Transactions on Circuits and Systems I: Regular Papers, 1996
Detecting Subtle Sequence Signals: a Gibbs Sampling Strategy for Multiple Alignment
Science, 1993
Terminal repeller unconstrained subenergy tunneling (trust) for fast global optimization
Journal of Optimization Theory and Applications, 1993