Theoretical and empirical quality assessment of transcription factor-binding motifs
Open Access
- 4 October 2010
- journal article
- research article
- Published by Oxford University Press (OUP) in Nucleic Acids Research
- Vol. 39 (3) , 808-824
- https://doi.org/10.1093/nar/gkq710
Abstract
Position-specific scoring matrices (PSSMs) are routinely used to predict transcription factor (TF)-binding sites in genome sequences. However, their reliability to predict novel binding sites can be far from optimum, due to the use of a small number of training sites or the inappropriate choice of parameters when building the matrix or when scanning sequences with it. Measures of matrix quality such as E-value and information content rely on theoretical models, and may fail in the context of full genome sequences. We propose a method, implemented in the program 'matrix-quality', that combines theoretical and empirical score distributions to assess reliability of PSSMs for predicting TF-binding sites. We applied 'matrix-quality' to estimate the predictive capacity of matrices for bacterial, yeast and mouse TFs. The evaluation of matrices from RegulonDB revealed some poorly predictive motifs, and allowed us to quantify the improvements obtained by applying multi-genome motif discovery. Interestingly, the method reveals differences between global and specific regulators. It also highlights the enrichment of binding sites in sequence sets obtained from high-throughput ChIP-chip (bacterial and yeast TFs), and ChIP-seq and experiments (mouse TFs). The method presented here has many applications, including: selecting reliable motifs before scanning sequences; improving motif collections in TFs databases; evaluating motifs discovered using high-throughput data sets.This publication has 40 references indexed in Scilit:
- JASPAR 2010: the greatly expanded open-access database of transcription factor binding profilesNucleic Acids Research, 2009
- Computation for ChIP-seq and RNA-seq studiesNature Methods, 2009
- Assessing phylogenetic motif models for predicting transcription factor binding sitesBioinformatics, 2009
- Integration of External Signaling Pathways with the Core Transcriptional Network in Embryonic Stem CellsCell, 2008
- RSAT: regulatory sequence analysis toolsNucleic Acids Research, 2008
- The Role of DNA-binding Specificity in the Evolution of Bacterial Regulatory NetworksJournal of Molecular Biology, 2008
- RegulonDB (version 6.0): gene regulation model of Escherichia coli K-12 beyond transcription, active (experimental) annotated promoters and Textpresso navigationNucleic Acids Research, 2007
- Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencingNature Methods, 2007
- WebLogo: A Sequence Logo Generator: Figure 1Genome Research, 2004
- Sequence logos: a new way to display consensus sequencesNucleic Acids Research, 1990