Mixture models with multiple levels, with application to the analysis of multifactor gene expression data
Open Access
- 5 February 2008
- journal article
- research article
- Published by Oxford University Press (OUP) in Biostatistics
- Vol. 9 (3) , 540-554
- https://doi.org/10.1093/biostatistics/kxm051
Abstract
Model-based clustering is a popular tool for summarizing high-dimensional data. With the number of high-throughput large-scale gene expression studies still on the rise, the need for effective data- summarizing tools has never been greater. By grouping genes according to a common experimental expression profile, we may gain new insight into the biological pathways that steer biological processes of interest. Clustering of gene profiles can also assist in assigning functions to genes that have not yet been functionally annotated. In this paper, we propose 2 model selection procedures for model-based clustering. Model selection in model-based clustering has to date focused on the identification of data dimensions that are relevant for clustering. However, in more complex data structures, with multiple experimental factors, such an approach does not provide easily interpreted clustering outcomes. We propose a mixture model with multiple levels, , that provides sparse representations both “within” and “between” cluster profiles. We explore various flexible “within-cluster” parameterizations and discuss how efficient parameterizations can greatly enhance the objective interpretability of the generated clusters. Moreover, we allow for a sparse “between-cluster” representation with a different number of clusters at different levels of an experimental factor of interest. This enhances interpretability of clusters generated in multiple-factor contexts. Interpretable cluster profiles can assist in detecting biologically relevant groups of genes that may be missed with less efficient parameterizations. We use our multilevel mixture model to mine a proliferating cell line expression data set for annotational context and regulatory motifs. We also investigate the performance of the multilevel clustering approach on several simulated data sets.Keywords
This publication has 12 references indexed in Scilit:
- A Unified Approach for Simultaneous Gene Clustering and Differential Expression IdentificationBiometrics, 2006
- Variable Selection for Model-Based ClusteringJournal of the American Statistical Association, 2006
- Clustering Based on a Multilayer Mixture ModelJournal of Computational and Graphical Statistics, 2005
- Bayesian Variable Selection in Clustering High-Dimensional DataJournal of the American Statistical Association, 2005
- Simultaneous feature selection and clustering using mixture modelsPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2004
- Clustering and classification based on the L1 data depthJournal of Multivariate Analysis, 2004
- GOstat: find statistically overrepresented Gene Ontologies within a group of genesBioinformatics, 2004
- Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray ExperimentsStatistical Applications in Genetics and Molecular Biology, 2004
- Model-Based Clustering, Discriminant Analysis, and Density EstimationJournal of the American Statistical Association, 2002
- Maximum likelihood estimation via the ECM algorithm: A general frameworkBiometrika, 1993