Modeling ChIP Sequencing In Silico with Applications

Open Access

22 August 2008

journal article
research article
Published by Public Library of Science (PLoS) in PLoS Computational Biology

Vol. 4 (8) , e1000158
https://doi.org/10.1371/journal.pcbi.1000158

Abstract

ChIP sequencing (ChIP-seq) is a new method for genomewide mapping of protein binding sites on DNA. It has generated much excitement in functional genomics. To score data and determine adequate sequencing depth, both the genomic background and the binding sites must be properly modeled. To develop a computational foundation to tackle these issues, we first performed a study to characterize the observed statistical nature of this new type of high-throughput data. By linking sequence tags into clusters, we show that there are two components to the distribution of tag counts observed in a number of recent experiments: an initial power-law distribution and a subsequent long right tail. Then we develop in silico ChIP-seq, a computational method to simulate the experimental outcome by placing tags onto the genome according to particular assumed distributions for the actual binding sites and for the background genomic sequence. In contrast to current assumptions, our results show that both the background and the binding sites need to have a markedly nonuniform distribution in order to correctly model the observed ChIP-seq data, with, for instance, the background tag counts modeled by a gamma distribution. On the basis of these results, we extend an existing scoring approach by using a more realistic genomic-background model. This enables us to identify transcription-factor binding sites in ChIP-seq data in a statistically rigorous fashion. ChIP-seq is an apt combination of chromosome immunoprecipitation and next-generation sequencing to identify transcription factor binding sites in vivo on the whole-genome scale. Since its advent, this new method has generated much excitement in the field of functional genomics. Proper computational modeling of the ChIP-seq process is needed for both data scoring and determination of adequate sequencing depth, as it provides the computational foundation for analyzing ChIP-seq data. In our study, we show the characteristics of ChIP-seq data and present in silico ChIP sequencing, a computational method to simulate the experimental outcome. On the basis of our data characterization, we observed transcription factor binding sites with excessive enrichment of sequence tags. Our simulation results reveal that both the genomic background and the binding sites are not uniform. On the basis of our simulation results, we propose a statistical procedure using the more realistic genomic background model to identify binding sites in ChIP-seq data.

Keywords

This publication has 14 references indexed in Scilit:

JASPAR, the open access database of transcription factor-binding profiles: new content and tools in the 2008 update
Nucleic Acids Research, 2007
Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing
Nature Methods, 2007
Genome-Wide Mapping of in Vivo Protein-DNA Interactions
Science, 2007
Tilescope: online analysis pipeline for high-density tiling microarray data
Genome Biology, 2007
A Global Map of p53 Transcription-Factor Binding Sites in the Human Genome
Cell, 2006
Genome-wide Prediction of Mammalian Enhancers Based on Analysis of Transcription-Factor Binding Affinity
Cell, 2006
Analyzing protein function on a genomic scale: the importance of gold-standard positives and negatives for network prediction
Current Opinion in Microbiology, 2004
Transcription Factor Binding and Histone Modifications on the Integrated Proviral Promoter in Human T-cell Leukemia Virus-I-infected T-cells
Journal of Biological Chemistry, 2002
Genomic binding sites of the yeast cell-cycle transcription factors SBF and MBF
Nature, 2001
Genome-Wide Location and Function of DNA Binding Proteins
Science, 2000