Length bias correction for RNA-seq data in gene set analyses
Open Access
- 19 January 2011
- journal article
- research article
- Published by Oxford University Press (OUP) in Bioinformatics
- Vol. 27 (5) , 662-669
- https://doi.org/10.1093/bioinformatics/btr005
Abstract
Motivation: Next-generation sequencing technologies are being rapidly applied to quantifying transcripts (RNA-seq). However, due to the unique properties of the RNA-seq data, the differential expression of longer transcripts is more likely to be identified than that of shorter transcripts with the same effect size. This bias complicates the downstream gene set analysis (GSA) because the methods for GSA previously developed for microarray data are based on the assumption that genes with same effect size have equal probability (power) to be identified as significantly differentially expressed. Since transcript length is not related to gene expression, adjusting for such length dependency in GSA becomes necessary. Results: In this article, we proposed two approaches for transcript-length adjustment for analyses based on Poisson models: (i) At individual gene level, we adjusted each gene's test statistic using the square root of transcript length followed by testing for gene set using the Wilcoxon rank-sum test. (ii) At gene set level, we adjusted the null distribution for the Fisher's exact test by weighting the identification probability of each gene using the square root of its transcript length. We evaluated these two approaches using simulations and a real dataset, and showed that these methods can effectively reduce the transcript-length biases. The top-ranked GO terms obtained from the proposed adjustments show more overlaps with the microarray results. Availability: R scripts are at http://www.soph.uab.edu/Statgenetics/People/XCui/r-codes/. Contact:xcui@uab.edu Supplementary information: Supplementary data are available at Bioinformatics online.Keywords
This publication has 20 references indexed in Scilit:
- Differential expression analysis for sequence count dataGenome Biology, 2010
- Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experimentsBMC Bioinformatics, 2010
- RNA-seq: from technology to biologyCellular and Molecular Life Sciences, 2009
- Statistical inferences for isoform expression in RNA-SeqBioinformatics, 2009
- Stem cell transcriptome profiling via massive-scale mRNA sequencingNature Methods, 2008
- A statistical framework for testing functional categories in microarray dataThe Annals of Applied Statistics, 2008
- Noise in protein expression scales with natural protein abundanceNature Genetics, 2006
- Microarray data analysis: from disarray to consolidation and consensusNature Reviews Genetics, 2006
- Exploration, normalization, and summaries of high density oligonucleotide array probe level dataBiostatistics, 2003
- A comparison of normalization methods for high density oligonucleotide array data based on variance and biasBioinformatics, 2003