Clustering gene expression data using a graph-theoretic approach: an application of minimum spanning trees

Open Access

1 April 2002

journal article
research article
Published by Oxford University Press (OUP) in Bioinformatics

Vol. 18 (4) , 536-545
https://doi.org/10.1093/bioinformatics/18.4.536

Abstract

Motivation: Gene expression data clustering provides a powerful tool for studying functional relationships of genes in a biological process. Identifying correlated expression patterns of genes represents the basic challenge in this clustering problem. Results: This paper describes a new framework for representing a set of multi-dimensional gene expression data as a Minimum Spanning Tree (MST), a concept from the graph theory. A key property of this representation is that each cluster of the expression data corresponds to one subtree of the MST, which rigorously converts a multi-dimensional clustering problem to a tree partitioning problem. We have demonstrated that though the inter-data relationship is greatly simplified in the MST representation, no essential information is lost for the purpose of clustering. Two key advantages in representing a set of multi-dimensional data as an MST are: (1) the simple structure of a tree facilitates efficient implementations of rigorous clustering algorithms, which otherwise are highly computationally challenging; and (2) as an MST-based clustering does not depend on detailed geometric shape of a cluster, it can overcome many of the problems faced by classical clustering algorithms. Based on the MST representation, we have developed a number of rigorous and efficient clustering algorithms, including two with guaranteed global optimality. We have implemented these algorithms as a computer software EXpression data Clustering Analysis and VisualizATiOn Resource (EXCAVATOR). To demonstrate its effectiveness, we have tested it on three data sets, i.e. expression data from yeast Saccharomyces cerevisiae , expression data in response of human fibroblasts to serum, and Arabidopsis expression data in response to chitin elicitation. The test results are highly encouraging. Availability: EXCAVATOR is available on request from the authors. Contact: xyn@ornl.gov

Keywords

This publication has 3 references indexed in Scilit:

Large-Scale Clustering of cDNA-Fingerprinting Data
Genome Research, 1999
Interpreting patterns of gene expression with self-organizing maps: Methods and application to hematopoietic differentiation
Proceedings of the National Academy of Sciences, 1999
Minimum Spanning Trees and Single Linkage Cluster Analysis
Journal of the Royal Statistical Society Series C: Applied Statistics, 1969