Detecting and correcting systematic variation in large-scale RNA sequencing data
Open Access
- 24 August 2014
- journal article
- research article
- Published by Springer Nature in Nature Biotechnology
- Vol. 32 (9) , 888-895
- https://doi.org/10.1038/nbt.3000
Abstract
Li et al. identify the top-performing methods to improve cross-site differential gene expression analysis with RNA-seq. High-throughput RNA sequencing (RNA-seq) enables comprehensive scans of entire transcriptomes, but best practices for analyzing RNA-seq data have not been fully defined, particularly for data collected with multiple sequencing platforms or at multiple sites. Here we used standardized RNA samples with built-in controls to examine sources of error in large-scale RNA-seq studies and their impact on the detection of differentially expressed genes (DEGs). Analysis of variations in guanine-cytosine content, gene coverage, sequencing error rate and insert size allowed identification of decreased reproducibility across sites. Moreover, commonly used methods for normalization (cqn, EDASeq, RUV2, sva, PEER) varied in their ability to remove these systematic biases, depending on sample complexity and initial data quality. Normalization methods that combine data from genes across sites are strongly recommended to identify and remove site-specific effects and can substantially improve RNA-seq studies.Keywords
This publication has 58 references indexed in Scilit:
- Comparative RNA-Seq and Microarray Analysis of Gene Expression Changes in B-Cell Lymphomas of Canis familiarisPLOS ONE, 2013
- The NIH Roadmap Epigenomics Mapping ConsortiumNature Biotechnology, 2010
- The MicroArray Quality Control (MAQC)-II study of common practices for the development and validation of microarray-based predictive modelsNature Biotechnology, 2010
- Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiationNature Biotechnology, 2010
- Understanding mechanisms underlying human gene expression variation with RNA sequencingNature, 2010
- edgeR: a Bioconductor package for differential expression analysis of digital gene expression dataBioinformatics, 2009
- HTqPCR: high-throughput analysis and visualization of quantitative real-time PCR data in RBioinformatics, 2009
- Disease signatures are robust across tissues and experimentsMolecular Systems Biology, 2009
- The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurementsNature Biotechnology, 2006
- Data quality in genomics and microarraysNature Biotechnology, 2006