A supervised hidden markov model framework for efficiently segmenting tiling array data in transcriptional and chIP-chip experiments: systematically incorporating validated biological knowledge
Open Access
- 12 October 2006
- journal article
- research article
- Published by Oxford University Press (OUP) in Bioinformatics
- Vol. 22 (24) , 3016-3024
- https://doi.org/10.1093/bioinformatics/btl515
Abstract
Motivation: Large-scale tiling array experiments are becoming increasingly common in genomics. In particular, the ENCODE project requires the consistent segmentation of many different tiling array datasets into ‘active regions’ (e.g. finding transfrags from transcriptional data and putative binding sites from ChIP-chip experiments). Previously, such segmentation was done in an unsupervised fashion mainly based on characteristics of the signal distribution in the tiling array data itself. Here we propose a supervised framework for doing this. It has the advantage of explicitly incorporating validated biological knowledge into the model and allowing for formal training and testing. Methodology: In particular, we use a hidden Markov model (HMM) framework, which is capable of explicitly modeling the dependency between neighboring probes and whose extended version (the generalized HMM) also allows explicit description of state duration density. We introduce a formal definition of the tiling-array analysis problem, and explain how we can use this to describe sampling small genomic regions for experimental validation to build up a gold-standard set for training and testing. We then describe various ideal and practical sampling strategies (e.g. maximizing signal entropy within a selected region versus using gene annotation or known promoters as positives for transcription or ChIP-chip data, respectively). Results: For the practical sampling and training strategies, we show how the size and noise in the validated training data affects the performance of an HMM applied to the ENCODE transcriptional and ChIP-chip experiments. In particular, we show that the HMM framework is able to efficiently process tiling array data as well as or better than previous approaches. For the idealized sampling strategies, we show how we can assess their performance in a simulation framework and how a maximum entropy approach, which samples sub-regions with very different signal intensities, gives the maximally performing gold-standard. This latter result has strong implications for the optimum way medium-scale validation experiments should be carried out to verify the results of the genome-scale tiling array experiments. Supplementary information: The supplementary data are available at Contact:mark.gerstein@yale.eduKeywords
This publication has 27 references indexed in Scilit:
- Chipper: discovering transcription-factor targets from chromatin immunoprecipitation microarrays using variance stabilizationGenome Biology, 2005
- Transcriptional Maps of 10 Human Chromosomes at 5-Nucleotide ResolutionScience, 2005
- Global Identification of Human Transcribed Sequences with Genome Tiling ArraysScience, 2004
- The ENCODE (ENCyclopedia Of DNA Elements) ProjectScience, 2004
- ChIP-chip: considerations for the design, analysis, and application of genome-wide chromatin immunoprecipitation experimentsGenomics, 2004
- Unbiased Mapping of Transcription Factor Binding Sites along Human Chromosomes 21 and 22 Points to Widespread Regulation of Noncoding RNAsCell, 2004
- Making sense of microarray data distributionsBioinformatics, 2002
- Genomic binding sites of the yeast cell-cycle transcription factors SBF and MBFNature, 2001
- Profile hidden Markov models.Bioinformatics, 1998
- On the computational complexity of approximating distributions by probabilistic automataMachine Learning, 1992