Natural similarity measures between position frequency matrices with an application to clustering
Open Access
- 2 January 2008
- journal article
- research article
- Published by Oxford University Press (OUP) in Bioinformatics
- Vol. 24 (3) , 350-357
- https://doi.org/10.1093/bioinformatics/btm610
Abstract
Motivation: Transcription factors (TFs) play a key role in gene regulation by binding to target sequences. In silico prediction of potential binding of a TF to a binding site is a well-studied problem in computational biology. The binding sites for one TF are represented by a position frequency matrix (PFM). The discovery of new PFMs requires the comparison to known PFMs to avoid redundancies. In general, two PFMs are similar if they occur at overlapping positions under a null model. Still, most existing methods compute similarity according to probabilistic distances of the PFMs. Here we propose a natural similarity measure based on the asymptotic covariance between the number of PFM hits incorporating both strands. Furthermore, we introduce a second measure based on the same idea to cluster a set of the Jaspar PFMs. Results: We show that the asymptotic covariance can be efficiently computed by a two dimensional convolution of the score distributions. The asymptotic covariance approach shows strong correlation with simulated data. It outperforms three alternative methods. The Jaspar clustering yields distinct groups of TFs of the same class. Furthermore, a representative PFM is given for each class. In contrast to most other clustering methods, PFMs with low similarity automatically remain singletons. Availability: A website to compute the similarity and to perform clustering, the source code and Supplementary Material are available at http://mosta.molgen.mpg.de Contact:utz.pape@molgen.mpg.de Supplementary information: Supplementary data are available at Bioinformatics online.Keywords
This publication has 34 references indexed in Scilit:
- Quantifying similarity between motifsGenome Biology, 2007
- Fast index based algorithms and software for matching position specific scoring matricesBMC Bioinformatics, 2006
- Ensembl 2005Nucleic Acids Research, 2004
- WebLogo: A Sequence Logo Generator: Figure 1Genome Research, 2004
- Local feature frequency profile: A method to measure structural similarity in proteinsProceedings of the National Academy of Sciences, 2004
- Computational detection of cis -regulatory modulesBioinformatics, 2003
- Statistical Methods for Rates and ProportionsPublished by Wiley ,2003
- The statistical significance of nucleotide position-weight matrix matchesBioinformatics, 1996
- Identification of consensus patterns in unaligned DNA sequences known to be functionally relatedBioinformatics, 1990
- Selection of DNA binding sites by regulatory proteinsJournal of Molecular Biology, 1987