Tight Clustering: A Resampling‐Based Approach for Identifying Stable and Tight Patterns in Data

28 February 2005

journal article
research article
Published by Oxford University Press (OUP) in Biometrics

Vol. 61 (1) , 10-16
https://doi.org/10.1111/j.0006-341x.2005.031032.x

Abstract

In this article, we propose a method for clustering that produces tight and stable clusters without forcing all points into clusters. The methodology is general but was initially motivated from cluster analysis of microarray experiments. Most current algorithms aim to assign all genes into clusters. For many biological studies, however, we are mainly interested in identifying the most informative, tight, and stable clusters of sizes, say, 20-60 genes for further investigation. W want to avoid the contamination of tightly regulated expression patterns of biologically relevant genes due to other genes whose expressions are only loosely compatible with these patterns. "Tight clustering" has been developed specifically to address this problem. It applies K-means clustering as an intermediate clustering engine. Early truncation of a hierarchical clustering tree is used to overcome the local minimum problem in K-means clustering. The tightest and most stable clusters are identified in a sequential manner through an analysis of the tendency of genes to be grouped together under repeated resampling. We validated this method in a simulated example and applied it to analyze a set of expression profiles in the study of embryonic stem cells.

Keywords

This publication has 18 references indexed in Scilit:

Modelling high-dimensional data by mixtures of factor analyzers
Computational Statistics & Data Analysis, 2003
Gene Expression During the Life Cycle of Drosophila melanogaster
Science, 2002
Exploring the new world of the genome with DNA microarrays
Nature Genetics, 1999
How Many Clusters? Which Clustering Method? Answers Via Model-Based Cluster Analysis
The Computer Journal, 1998
Mouse MCM proteins: complex formation and transportation to the nucleus
Genes to Cells, 1996
The hot hand in basketball: On the misperception of random sequences
Cognitive Psychology, 1985
Algorithm AS 136: A K-Means Clustering Algorithm
Journal of the Royal Statistical Society Series C: Applied Statistics, 1979
Estimating the Dimension of a Model
The Annals of Statistics, 1978
Estimating the components of a mixture of normal distributions
Biometrika, 1969
Some Applications of Monotone Operators in Markov Processes
The Annals of Mathematical Statistics, 1965