A Cautionary Note on using Internal Cross Validation to Select the Number of Clusters
- 1 September 1999
- journal article
- Published by Cambridge University Press (CUP) in Psychometrika
- Vol. 64 (3) , 341-353
- https://doi.org/10.1007/bf02294300
Abstract
A highly popular method for examining the stability of a data clustering is to split the data into two parts, cluster the observations in Part A, assign the objects in Part B to their nearest centroid in Part A, and then independently cluster the Part B objects. One then examines how close the two partitions are (say, by the Rand measure). Another proposal is to split the data into k parts, and see how their centroids cluster. By means of synthetic data analyses, we demonstrate that these approaches fail to identify the appropriate number of clusters, particularly as sample size becomes large and the variables exhibit higher correlations.Keywords
This publication has 15 references indexed in Scilit:
- Comparative Evaluation of Two Superior Stopping Rules for Hierarchical Cluster AnalysisPsychometrika, 1994
- Replication as a Rule for Determining the Number of Clusters in Hierarchial Cluster AnalysisApplied Psychological Measurement, 1992
- A Computational Study of Replicated Clustering with an Application to Market Segmentation*Decision Sciences, 1991
- Replicating Cluster Analysis: Method, Consistency, and ValidityMultivariate Behavioral Research, 1989
- Methodology Review: Clustering MethodsApplied Psychological Measurement, 1987
- Comparing partitionsJournal of Classification, 1985
- An Examination of Procedures for Determining the Number of Clusters in a Data SetPsychometrika, 1985
- An Examination of the Effect of Six Types of Error Perturbation on Fifteen Clustering AlgorithmsPsychometrika, 1980
- Multivariate analyses of the MMPI profiles of low back pain patientsJournal of Behavioral Medicine, 1978
- Hierarchical Grouping to Optimize an Objective FunctionJournal of the American Statistical Association, 1963