A Feature-Based Approach to Modeling Protein–DNA Interactions
Open Access
- 22 August 2008
- journal article
- research article
- Published by Public Library of Science (PLoS) in PLoS Computational Biology
- Vol. 4 (8) , e1000154
- https://doi.org/10.1371/journal.pcbi.1000154
Abstract
Transcription factor (TF) binding to its DNA target site is a fundamental regulatory interaction. The most common model used to represent TF binding specificities is a position specific scoring matrix (PSSM), which assumes independence between binding positions. However, in many cases, this simplifying assumption does not hold. Here, we present feature motif models (FMMs), a novel probabilistic method for modeling TF–DNA interactions, based on log-linear models. Our approach uses sequence features to represent TF binding specificities, where each feature may span multiple positions. We develop the mathematical formulation of our model and devise an algorithm for learning its structural features from binding site data. We also developed a discriminative motif finder, which discovers de novo FMMs that are enriched in target sets of sequences compared to background sets. We evaluate our approach on synthetic data and on the widely used TF chromatin immunoprecipitation (ChIP) dataset of Harbison et al. We then apply our algorithm to high-throughput TF ChIP data from mouse and human, reveal sequence features that are present in the binding specificities of mouse and human TFs, and show that FMMs explain TF binding significantly better than PSSMs. Our FMM learning and motif finder software are available at http://genie.weizmann.ac.il/. Transcription factor (TF) protein binding to its DNA target sequences is a fundamental physical interaction underlying gene regulation. Characterizing the binding specificities of TFs is essential for deducing which genes are regulated by which TFs. Recently, several high-throughput methods that measure sequences enriched for TF targets genomewide were developed. Since TFs recognize relatively short sequences, much effort has been directed at developing computational methods that identify enriched subsequences (motifs) from these sequences. However, little effort has been directed towards improving the representation of motifs. Practically, available motif finding software use the position specific scoring matrix (PSSM) model, which assumes independence between different motif positions. We present an alternative, richer model, called the feature motif model (FMM), that enables the representation of a variety of sequence features and captures dependencies that exist between binding site positions. We show how FMMs explain TF binding data better than PSSMs on both synthetic and real data. We also present a motif finder algorithm that learns FMM motifs from unaligned promoter sequences and show how de novo FMMs, learned from binding data of the human TFs c-Myc and CTCF, reveal intriguing insights about their binding specificities.Keywords
This publication has 69 references indexed in Scilit:
- Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot projectNature, 2007
- Analysis of the Vertebrate Insulator Protein CTCF-Binding Sites in the Human GenomeCell, 2007
- DNA microarray technologies for measuring protein–DNA interactionsCurrent Opinion in Biotechnology, 2006
- Control of Developmental Regulators by Polycomb in Human Embryonic Stem CellsCell, 2006
- The Oct4 and Nanog transcription network regulates pluripotency in mouse embryonic stem cellsNature Genetics, 2006
- Core Transcriptional Regulatory Circuitry in Human Embryonic Stem CellsCell, 2005
- Systematic discovery of regulatory motifs in human promoters and 3′ UTRs by comparison of several mammalsNature, 2005
- Sequencing and comparison of yeast species to identify genes and regulatory elementsNature, 2003
- Genomic binding sites of the yeast cell-cycle transcription factors SBF and MBFNature, 2001
- Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitationNature Biotechnology, 1998