Pseudocounts for transcription factor binding sites
Open Access
- 23 December 2008
- journal article
- research article
- Published by Oxford University Press (OUP) in Nucleic Acids Research
- Vol. 37 (3) , 939-944
- https://doi.org/10.1093/nar/gkn1019
Abstract
To represent the sequence specificity of transcription factors, the position weight matrix (PWM) is widely used. In most cases, each element is defined as a log likelihood ratio of a base appearing at a certain position, which is estimated from a finite number of known binding sites. To avoid bias due to this small sample size, a certain numeric value, called a pseudocount, is usually allocated for each position, and its fraction according to the background base composition is added to each element. So far, there has been no consensus on the optimal pseudocount value. In this study, we simulated the sampling process by artificially generating binding sites based on observed nucleotide frequencies in a public PWM database, and then the generated matrix with an added pseudocount value was compared to the original frequency matrix using various measures. Although the results were somewhat different between measures, in many cases, we could find an optimal pseudocount value for each matrix. These optimal values are independent of the sample size and are clearly correlated with the entropy of the original matrices, meaning that larger pseudocount vales are preferable for less conserved binding sites. As a simple representative, we suggest the value of 0.8 for practical uses.Keywords
This publication has 17 references indexed in Scilit:
- PSI-BLAST pseudocounts and the minimum description length principleNucleic Acids Research, 2008
- Natural similarity measures between position frequency matrices with an application to clusteringBioinformatics, 2008
- DNA Familial Binding Profiles Made Easy: Comparison of Various Motif Alignment and Clustering StrategiesPLoS Computational Biology, 2007
- Quantifying similarity between motifsGenome Biology, 2007
- Applied bioinformatics for the identification of regulatory elementsNature Reviews Genetics, 2004
- Computational identification of Cis -regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae 1 1Edited by F. E. CohenJournal of Molecular Biology, 2000
- MATRIX SEARCH 1.0: a computer program that scans DNA sequences for transcriptional elements using a database of weight matricesBioinformatics, 1995
- Theoretical studies of protein folding and unfoldingCurrent Opinion in Structural Biology, 1995
- Sequence logos: a new way to display consensus sequencesNucleic Acids Research, 1990
- Selection of DNA binding sites by regulatory proteinsJournal of Molecular Biology, 1987