A Cautionary Note on using Internal Cross Validation to Select the Number of Clusters

1 September 1999

journal article
Published by Cambridge University Press (CUP) in Psychometrika

Vol. 64 (3) , 341-353
https://doi.org/10.1007/bf02294300

Abstract

A highly popular method for examining the stability of a data clustering is to split the data into two parts, cluster the observations in Part A, assign the objects in Part B to their nearest centroid in Part A, and then independently cluster the Part B objects. One then examines how close the two partitions are (say, by the Rand measure). Another proposal is to split the data into k parts, and see how their centroids cluster. By means of synthetic data analyses, we demonstrate that these approaches fail to identify the appropriate number of clusters, particularly as sample size becomes large and the variables exhibit higher correlations.

Keywords

This publication has 15 references indexed in Scilit:

Comparative Evaluation of Two Superior Stopping Rules for Hierarchical Cluster Analysis
Psychometrika, 1994
Replication as a Rule for Determining the Number of Clusters in Hierarchial Cluster Analysis
Applied Psychological Measurement, 1992
A Computational Study of Replicated Clustering with an Application to Market Segmentation*
Decision Sciences, 1991
Replicating Cluster Analysis: Method, Consistency, and Validity
Multivariate Behavioral Research, 1989
Methodology Review: Clustering Methods
Applied Psychological Measurement, 1987
Comparing partitions
Journal of Classification, 1985
An Examination of Procedures for Determining the Number of Clusters in a Data Set
Psychometrika, 1985
An Examination of the Effect of Six Types of Error Perturbation on Fifteen Clustering Algorithms
Psychometrika, 1980
Multivariate analyses of the MMPI profiles of low back pain patients
Journal of Behavioral Medicine, 1978
Hierarchical Grouping to Optimize an Objective Function
Journal of the American Statistical Association, 1963