Relation between weight matrix and substitution matrix: motif search by similarity
Open Access
- 28 October 2004
- journal article
- research article
- Published by Oxford University Press (OUP) in Bioinformatics
- Vol. 21 (7) , 938-943
- https://doi.org/10.1093/bioinformatics/bti090
Abstract
Motivation: The discovery of patterns shared by several sequences that differ greatly is a basic task in sequence analysis, and still a challenge. Several methods have been developed for detecting patterns. Methods commonly used for motif search include the Gibbs sampler, Expectation-Maximization (EM) algorithm and some intuitive greedy approaches. One cannot guarantee the optimality of the result produced by the Gibbs sampler in a single run. The deterministic EM methods tend to get trapped by local optima. Solutions found by greedy approaches are rarely sufficiently good. Results: A simple model describing a motif or a portion of local multiple sequence alignment is the weight matrix model, in which a motif is characterized with position-specific probabilities. Two substitution matrices are proposed to relate the sequence similarity with the weight matrix. Combining the substitution matrix and weight matrix, we examine three typical sets of protein sequences with increasing complexity. At a low score threshold for pair similarity, sliding windows are compared with a seed window to find the score sum, which provides a measure of statistical significance for multiple sequence comparison. Such a similarity analysis reveals many aspects of motifs. Blocks determined by similarity can be used to deduce a primary weight matrix or an improved substitution matrix. The algorithm successfully obtains the optimal solution for the test sets by just greedy iteration. Availability: Softwares and sequence datasets are available on request from the author. Contact:zheng@itp.ac.cnKeywords
This publication has 9 references indexed in Scilit:
- Gibbs Recursive Sampler: finding transcription factor binding sitesNucleic Acids Research, 2003
- Integrating regulatory motif discovery and genome-wide expression analysisProceedings of the National Academy of Sciences, 2003
- Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies 1 1Edited by G. von HeijneJournal of Molecular Biology, 1998
- Hidden Markov models for sequence analysis: extension and analysis of the basic methodBioinformatics, 1996
- Detecting Subtle Sequence Signals: a Gibbs Sampling Strategy for Multiple AlignmentScience, 1993
- Amino acid substitution matrices from protein blocks.Proceedings of the National Academy of Sciences, 1992
- Molecular Sequence Data-Bases and Their UsesPublished by Oxford University Press (OUP) ,1992
- Identification of consensus patterns in unaligned DNA sequences known to be functionally relatedBioinformatics, 1990
- Monte Carlo sampling methods using Markov chains and their applicationsBiometrika, 1970