Refining motifs by improving information content scores using neighborhood profile search
Open Access
- 27 November 2006
- journal article
- research article
- Published by Springer Nature in Algorithms for Molecular Biology
- Vol. 1 (1) , 23
- https://doi.org/10.1186/1748-7188-1-23
Abstract
The main goal of the motif finding problem is to detect novel, over-represented unknown signals in a set of sequences (e.g. transcription factor binding sites in a genome). The most widely used algorithms for finding motifs obtain a generative probabilistic representation of these over-represented signals and try to discover profiles that maximize the information content score. Although these profiles form a very powerful representation of the signals, the major difficulty arises from the fact that the best motif corresponds to the global maximum of a non-convex continuous function. Popular algorithms like Expectation Maximization (EM) and Gibbs sampling tend to be very sensitive to the initial guesses and are known to converge to the nearest local maximum very quickly. In order to improve the quality of the results, EM is used with multiple random starts or any other powerful stochastic global methods that might yield promising initial guesses (like projection algorithms). Global methods do not necessarily give initial guesses in the convergence region of the best local maximum but rather suggest that a promising solution is in the neighborhood region. In this paper, we introduce a novel optimization framework that searches the neighborhood regions of the initial alignment in a systematic manner to explore the multiple local optimal solutions. This effective search is achieved by transforming the original optimization problem into its corresponding dynamical system and estimating the practical stability boundary of the local maximum. Our results show that the popularly used EM algorithm often converges to sub-optimal solutions which can be significantly improved by the proposed neighborhood profile search. Based on experiments using both synthetic and real datasets, our method demonstrates significant improvements in the information content scores of the probabilistic models. The proposed method also gives the flexibility in using different local solvers and global methods depending on their suitability for some specific datasets.Keywords
This publication has 20 references indexed in Scilit:
- A Stability Boundary Based Method for Finding Saddle Points on Potential Energy SurfacesJournal of Computational Biology, 2006
- Assessing computational tools for the discovery of transcription factor binding sitesNature Biotechnology, 2005
- A uniform projection method for motif discovery in DNA sequencesIEEE/ACM Transactions on Computational Biology and Bioinformatics, 2004
- A Dynamical Trajectory-Based Methodology for Systematically Computing Multiple Optimal Solutions of General Nonlinear Programming ProblemsIEEE Transactions on Automatic Control, 2004
- LOGOS: A MODULAR BAYESIAN MODEL FOR DE NOVO MOTIF DETECTIONJournal of Bioinformatics and Computational Biology, 2004
- Finding motifs in the twilight zoneBioinformatics, 2002
- Profile hidden Markov models.Bioinformatics, 1998
- A systematic search method for obtaining multiple local optimal solutions of nonlinear programming problemsIEEE Transactions on Circuits and Systems I: Regular Papers, 1996
- Detecting Subtle Sequence Signals: a Gibbs Sampling Strategy for Multiple AlignmentScience, 1993
- Terminal repeller unconstrained subenergy tunneling (trust) for fast global optimizationJournal of Optimization Theory and Applications, 1993