A Feature-Based Approach to Modeling Protein–DNA Interactions

Open Access

22 August 2008

journal article
research article
Published by Public Library of Science (PLoS) in PLoS Computational Biology

Vol. 4 (8) , e1000154
https://doi.org/10.1371/journal.pcbi.1000154

Abstract

Transcription factor (TF) binding to its DNA target site is a fundamental regulatory interaction. The most common model used to represent TF binding specificities is a position specific scoring matrix (PSSM), which assumes independence between binding positions. However, in many cases, this simplifying assumption does not hold. Here, we present feature motif models (FMMs), a novel probabilistic method for modeling TF–DNA interactions, based on log-linear models. Our approach uses sequence features to represent TF binding specificities, where each feature may span multiple positions. We develop the mathematical formulation of our model and devise an algorithm for learning its structural features from binding site data. We also developed a discriminative motif finder, which discovers de novo FMMs that are enriched in target sets of sequences compared to background sets. We evaluate our approach on synthetic data and on the widely used TF chromatin immunoprecipitation (ChIP) dataset of Harbison et al. We then apply our algorithm to high-throughput TF ChIP data from mouse and human, reveal sequence features that are present in the binding specificities of mouse and human TFs, and show that FMMs explain TF binding significantly better than PSSMs. Our FMM learning and motif finder software are available at http://genie.weizmann.ac.il/. Transcription factor (TF) protein binding to its DNA target sequences is a fundamental physical interaction underlying gene regulation. Characterizing the binding specificities of TFs is essential for deducing which genes are regulated by which TFs. Recently, several high-throughput methods that measure sequences enriched for TF targets genomewide were developed. Since TFs recognize relatively short sequences, much effort has been directed at developing computational methods that identify enriched subsequences (motifs) from these sequences. However, little effort has been directed towards improving the representation of motifs. Practically, available motif finding software use the position specific scoring matrix (PSSM) model, which assumes independence between different motif positions. We present an alternative, richer model, called the feature motif model (FMM), that enables the representation of a variety of sequence features and captures dependencies that exist between binding site positions. We show how FMMs explain TF binding data better than PSSMs on both synthetic and real data. We also present a motif finder algorithm that learns FMM motifs from unaligned promoter sequences and show how de novo FMMs, learned from binding data of the human TFs c-Myc and CTCF, reveal intriguing insights about their binding specificities.

Keywords

This publication has 69 references indexed in Scilit:

Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project
Nature, 2007
Analysis of the Vertebrate Insulator Protein CTCF-Binding Sites in the Human Genome
Cell, 2007
DNA microarray technologies for measuring protein–DNA interactions
Current Opinion in Biotechnology, 2006
Control of Developmental Regulators by Polycomb in Human Embryonic Stem Cells
Cell, 2006
The Oct4 and Nanog transcription network regulates pluripotency in mouse embryonic stem cells
Nature Genetics, 2006
Core Transcriptional Regulatory Circuitry in Human Embryonic Stem Cells
Cell, 2005
Systematic discovery of regulatory motifs in human promoters and 3′ UTRs by comparison of several mammals
Nature, 2005
Sequencing and comparison of yeast species to identify genes and regulatory elements
Nature, 2003
Genomic binding sites of the yeast cell-cycle transcription factors SBF and MBF
Nature, 2001
Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation
Nature Biotechnology, 1998