Modeling ChIP Sequencing In Silico with Applications
Open Access
- 22 August 2008
- journal article
- research article
- Published by Public Library of Science (PLoS) in PLoS Computational Biology
- Vol. 4 (8) , e1000158
- https://doi.org/10.1371/journal.pcbi.1000158
Abstract
ChIP sequencing (ChIP-seq) is a new method for genomewide mapping of protein binding sites on DNA. It has generated much excitement in functional genomics. To score data and determine adequate sequencing depth, both the genomic background and the binding sites must be properly modeled. To develop a computational foundation to tackle these issues, we first performed a study to characterize the observed statistical nature of this new type of high-throughput data. By linking sequence tags into clusters, we show that there are two components to the distribution of tag counts observed in a number of recent experiments: an initial power-law distribution and a subsequent long right tail. Then we develop in silico ChIP-seq, a computational method to simulate the experimental outcome by placing tags onto the genome according to particular assumed distributions for the actual binding sites and for the background genomic sequence. In contrast to current assumptions, our results show that both the background and the binding sites need to have a markedly nonuniform distribution in order to correctly model the observed ChIP-seq data, with, for instance, the background tag counts modeled by a gamma distribution. On the basis of these results, we extend an existing scoring approach by using a more realistic genomic-background model. This enables us to identify transcription-factor binding sites in ChIP-seq data in a statistically rigorous fashion. ChIP-seq is an apt combination of chromosome immunoprecipitation and next-generation sequencing to identify transcription factor binding sites in vivo on the whole-genome scale. Since its advent, this new method has generated much excitement in the field of functional genomics. Proper computational modeling of the ChIP-seq process is needed for both data scoring and determination of adequate sequencing depth, as it provides the computational foundation for analyzing ChIP-seq data. In our study, we show the characteristics of ChIP-seq data and present in silico ChIP sequencing, a computational method to simulate the experimental outcome. On the basis of our data characterization, we observed transcription factor binding sites with excessive enrichment of sequence tags. Our simulation results reveal that both the genomic background and the binding sites are not uniform. On the basis of our simulation results, we propose a statistical procedure using the more realistic genomic background model to identify binding sites in ChIP-seq data.Keywords
This publication has 14 references indexed in Scilit:
- JASPAR, the open access database of transcription factor-binding profiles: new content and tools in the 2008 updateNucleic Acids Research, 2007
- Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencingNature Methods, 2007
- Genome-Wide Mapping of in Vivo Protein-DNA InteractionsScience, 2007
- Tilescope: online analysis pipeline for high-density tiling microarray dataGenome Biology, 2007
- A Global Map of p53 Transcription-Factor Binding Sites in the Human GenomeCell, 2006
- Genome-wide Prediction of Mammalian Enhancers Based on Analysis of Transcription-Factor Binding AffinityCell, 2006
- Analyzing protein function on a genomic scale: the importance of gold-standard positives and negatives for network predictionCurrent Opinion in Microbiology, 2004
- Transcription Factor Binding and Histone Modifications on the Integrated Proviral Promoter in Human T-cell Leukemia Virus-I-infected T-cellsJournal of Biological Chemistry, 2002
- Genomic binding sites of the yeast cell-cycle transcription factors SBF and MBFNature, 2001
- Genome-Wide Location and Function of DNA Binding ProteinsScience, 2000