Computational cluster validation in post-genomic data analysis
Top Cited Papers
Open Access
- 1 August 2005
- journal article
- review article
- Published by Oxford University Press (OUP) in Bioinformatics
- Vol. 21 (15) , 3201-3212
- https://doi.org/10.1093/bioinformatics/bti517
Abstract
The discovery of novel biological knowledge from the ab initio analysis of post-genomic data relies upon the use of unsupervised processing methods, in particular clustering techniques. Much recent research in bioinformatics has therefore been focused on the transfer of clustering methods introduced in other scientific fields and on the development of novel algorithms specifically designed to tackle the challenges posed by post-genomic data. The partitions returned by a clustering algorithm are commonly validated using visual inspection and concordance with prior biological knowledge—whether the clusters actually correspond to the real structure in the data is somewhat less frequently considered. Suitable computational cluster validation techniques are available in the general data-mining literature, but have been given only a fraction of the same attention in bioinformatics. This review paper aims to familiarize the reader with the battery of techniques available for the validation of clustering results, with a particular focus on their application to post-genomic data analysis. Synthetic and real biological datasets are used to demonstrate the benefits, and also some of the perils, of analytical cluster validation. The software used in the experiments is available at http://dbkgroup.org/handl/clustervalidation/ J.Handl@postgrad.manchester.ac.uk Enlarged colour plots are provided in the Supplementary Material, which is available at http://dbkgroup.org/handl/clustervalidation/Keywords
This publication has 60 references indexed in Scilit:
- An integrated tool for microarray data clustering and cluster validity assessmentBioinformatics, 2004
- Biclustering algorithms for biological data analysis: a surveyIEEE/ACM Transactions on Computational Biology and Bioinformatics, 2004
- Here is the evidence, now what is the hypothesis? The complementary roles of inductive and hypothesis‐driven science in the post‐genomic eraBioEssays, 2003
- Resampling Method for Unsupervised Estimation of Cluster ValidityNeural Computation, 2001
- Estimating the Number of Clusters in a Data Set Via the Gap StatisticJournal of the Royal Statistical Society Series B: Statistical Methodology, 2001
- Computational analysis of microarray dataNature Reviews Genetics, 2001
- Discrimination of the variety and region of origin of extra virgin olive oils using 13C NMR and multivariate calibration with variable reductionAnalytica Chimica Acta, 1997
- Silhouettes: A graphical aid to the interpretation and validation of cluster analysisJournal of Computational and Applied Mathematics, 1987
- Comparing partitionsJournal of Classification, 1985
- Objective Criteria for the Evaluation of Clustering MethodsJournal of the American Statistical Association, 1971