Models for microarray gene expression data

Abstract
This paper describes a general methodology for the analysis of differential gene expression based on microarray data. First, we characterize the data by a linear statistical model that accounts for relevant sources of variation in the data and then we consider estimation of the model parameters. Because microarray studies typically involve thousands of genes, we propose a two-stage method for parameter estimation. The interaction terms for genes and experimental conditions in this model capture all relevant information about differential gene expression in the microarray data. We propose a mixture distribution model for a summary statistic of differential expression that consists of null and alternative component distributions. The mixture model suggests two methods for identifying genes exhibiting differential expression. One is a frequentist method that identifies distinguished genes and the other an empirical Bayes procedure that yields estimated posterior probabilities of differential expression, conditional on observed microarray readings. An extensive case application involving juvenile cystic kidney disease in mice is used to illustrate the methodology. The application controls for variation arising from array, color channel, experimental condition (tissue type), and gene, with the analysis of variance (ANOVA) model including both main effects to normalize the expression data and all interaction terms involving genes. The gene expression profile is found to vary by tissue type as expected, but also by color channel, which was less expected. A concluding section discusses some outstanding research questions related to the analysis of microarray data.