Summarizing and correcting the GC content bias in high-throughput sequencing
Top Cited Papers
Open Access
- 9 February 2012
- journal article
- research article
- Published by Oxford University Press (OUP) in Nucleic Acids Research
- Vol. 40 (10) , e72
- https://doi.org/10.1093/nar/gks001
Abstract
GC content bias describes the dependence between fragment count (read coverage) and GC content found in Illumina sequencing data. This bias can dominate the signal of interest for analyses that focus on measuring fragment abundance within a genome, such as copy number estimation (DNA-seq). The bias is not consistent between samples; and there is no consensus as to the best methods to remove it in a single sample. We analyze regularities in the GC bias patterns, and find a compact description for this unimodal curve family. It is the GC content of the full DNA fragment, not only the sequenced read, that most influences fragment count. This GC effect is unimodal: both GC-rich fragments and AT-rich fragments are underrepresented in the sequencing results. This empirical evidence strengthens the hypothesis that PCR is the most important cause of the GC bias. We propose a model that produces predictions at the base pair level, allowing strand-specific GC-effect correction regardless of the downstream smoothing or binning. These GC modeling considerations can inform other high-throughput sequencing analyses such as ChIP-seq and RNA-seq.Keywords
This publication has 19 references indexed in Scilit:
- Sequence-specific error profile of Illumina sequencersNucleic Acids Research, 2011
- ReadDepth: A Parallel R Package for Detecting Copy Number Alterations from Short Sequencing ReadsPLOS ONE, 2011
- Control-free calling of copy number alterations in deep-sequencing data using GC-content normalizationBioinformatics, 2010
- CNAseg—a novel framework for identification of copy number changes in cancer from second-generation sequencing dataBioinformatics, 2010
- Biases in Illumina transcriptome sequencing caused by random hexamer primingNucleic Acids Research, 2010
- Impact of Chromatin Structures on DNA Processing for Genomic AnalysesPLOS ONE, 2009
- Sensitive and accurate detection of copy number variants using read depth of coverageGenome Research, 2009
- A large genome center's improvements to the Illumina sequencing systemNature Methods, 2008
- Mapping short DNA sequencing reads and calling variants using mapping quality scoresGenome Research, 2008
- Substantial biases in ultra-short read data sets from high-throughput DNA sequencingNucleic Acids Research, 2008