The effects of normalization on the correlation structure of microarray data

Open Access

16 May 2005

journal article
research article
Published by Springer Nature in BMC Bioinformatics

Vol. 6 (1) , 120
https://doi.org/10.1186/1471-2105-6-120

Abstract

Background: Stochastic dependence between gene expression levels in microarray data is of critical importance for the methods of statistical inference that resort to pooling test-statistics across genes. It is frequently assumed that dependence between genes (or tests) is suffciently weak to justify the proposed methods of testing for differentially expressed genes. A potential impact of between-gene correlations on the performance of such methods has yet to be explored. Results: The paper presents a systematic study of correlation between the t-statistics associated with different genes. We report the effects of four different normalization methods using a large set of microarray data on childhood leukemia in addition to several sets of simulated data. Our findings help decipher the correlation structure of microarray data before and after the application of normalization procedures. Conclusion: A long-range correlation in microarray data manifests itself in thousands of genes that are heavily correlated with a given gene in terms of the associated t-statistics. By using normalization methods it is possible to significantly reduce correlation between the t-statistics computed for different genes. Normalization procedures affect both the true correlation, stemming from gene interactions, and the spurious correlation induced by random noise. When analyzing real world biological data sets, normalization procedures are unable to completely remove correlation between the test statistics. The long-range correlation structure also persists in normalized data.

Keywords

This publication has 20 references indexed in Scilit:

Getting the Noise Out of Gene Arrays
Published by American Association for the Advancement of Science (AAAS) ,2004
A simple procedure for estimating the false discovery rate
Bioinformatics, 2004
A mixture model-based strategy for selecting sets of genes in multiclass response microarray experiments
Bioinformatics, 2004
Large-Scale Simultaneous Hypothesis Testing
Journal of the American Statistical Association, 2004
Strong Control, Conservative Point Estimation and Simultaneous Conservative Consistency of False Discovery Rates: A Unified Approach
Journal of the Royal Statistical Society Series B: Statistical Methodology, 2003
Estimation of False Discovery Rates in Multiple Testing: Application to Gene Microarray Data
Biometrics, 2003
Resampling-based multiple testing for microarray data analysis
TEST, 2003
Robbins, empirical Bayes and microarrays
The Annals of Statistics, 2003
On Differential Variability of Expression Ratios: Improving Statistical Inference about Gene Expression Changes from Microarray Data
Journal of Computational Biology, 2001
Testing for Differentially-Expressed Genes by Maximum-Likelihood Analysis of Microarray Data
Journal of Computational Biology, 2000