Clustering gene expression data using a graph-theoretic approach: an application of minimum spanning trees
Open Access
- 1 April 2002
- journal article
- research article
- Published by Oxford University Press (OUP) in Bioinformatics
- Vol. 18 (4) , 536-545
- https://doi.org/10.1093/bioinformatics/18.4.536
Abstract
Motivation: Gene expression data clustering provides a powerful tool for studying functional relationships of genes in a biological process. Identifying correlated expression patterns of genes represents the basic challenge in this clustering problem. Results: This paper describes a new framework for representing a set of multi-dimensional gene expression data as a Minimum Spanning Tree (MST), a concept from the graph theory. A key property of this representation is that each cluster of the expression data corresponds to one subtree of the MST, which rigorously converts a multi-dimensional clustering problem to a tree partitioning problem. We have demonstrated that though the inter-data relationship is greatly simplified in the MST representation, no essential information is lost for the purpose of clustering. Two key advantages in representing a set of multi-dimensional data as an MST are: (1) the simple structure of a tree facilitates efficient implementations of rigorous clustering algorithms, which otherwise are highly computationally challenging; and (2) as an MST-based clustering does not depend on detailed geometric shape of a cluster, it can overcome many of the problems faced by classical clustering algorithms. Based on the MST representation, we have developed a number of rigorous and efficient clustering algorithms, including two with guaranteed global optimality. We have implemented these algorithms as a computer software EXpression data Clustering Analysis and VisualizATiOn Resource (EXCAVATOR). To demonstrate its effectiveness, we have tested it on three data sets, i.e. expression data from yeast Saccharomyces cerevisiae , expression data in response of human fibroblasts to serum, and Arabidopsis expression data in response to chitin elicitation. The test results are highly encouraging. Availability: EXCAVATOR is available on request from the authors. Contact: xyn@ornl.govKeywords
This publication has 3 references indexed in Scilit:
- Large-Scale Clustering of cDNA-Fingerprinting DataGenome Research, 1999
- Interpreting patterns of gene expression with self-organizing maps: Methods and application to hematopoietic differentiationProceedings of the National Academy of Sciences, 1999
- Minimum Spanning Trees and Single Linkage Cluster AnalysisJournal of the Royal Statistical Society Series C: Applied Statistics, 1969