A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis
Top Cited Papers
Open Access
- 16 September 2004
- journal article
- research article
- Published by Oxford University Press (OUP) in Bioinformatics
- Vol. 21 (5) , 631-643
- https://doi.org/10.1093/bioinformatics/bti033
Abstract
Motivation: Cancer diagnosis is one of the most important emerging clinical applications of gene expression microarray technology. We are seeking to develop a computer system for powerful and reliable cancer diagnostic model creation based on microarray data. To keep a realistic perspective on clinical applications we focus on multicategory diagnosis. To equip the system with the optimum combination of classifier, gene selection and cross-validation methods, we performed a systematic and comprehensive evaluation of several major algorithms for multicategory classification, several gene selection methods, multiple ensemble classifier methods and two cross-validation designs using 11 datasets spanning 74 diagnostic categories and 41 cancer types and 12 normal tissue types. Results: Multicategory support vector machines (MC-SVMs) are the most effective classifiers in performing accurate cancer diagnosis from gene expression data. The MC-SVM techniques by Crammer and Singer, Weston and Watkins and one-versus-rest were found to be the best methods in this domain. MC-SVMs outperform other popular machine learning algorithms, such as k-nearest neighbors, backpropagation and probabilistic neural networks, often to a remarkable degree. Gene selection techniques can significantly improve the classification performance of both MC-SVMs and other non-SVM learning algorithms. Ensemble classifiers do not generally improve performance of the best non-ensemble models. These results guided the construction of a software system GEMS (Gene Expression Model Selector) that automates high-quality model construction and enforces sound optimization and performance estimation procedures. This is the first such system to be informed by a rigorous comparative analysis of the available algorithms and datasets. Availability: The software system GEMS is available for download from http://www.gems-system.org for non-commercial use. Contact:alexander.statnikov@vanderbilt.eduKeywords
This publication has 36 references indexed in Scilit:
- GeneCluster 2.0: an advanced toolset for bioarray analysisBioinformatics, 2004
- A comparison of methods for multiclass support vector machinesIEEE Transactions on Neural Networks, 2002
- Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression DataJournal of the American Statistical Association, 2002
- Prediction of central nervous system embryonal tumour outcome based on gene expressionNature, 2002
- Multiclass cancer diagnosis using tumor gene expression signaturesProceedings of the National Academy of Sciences, 2001
- MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemiaNature Genetics, 2001
- 10.1162/153244303322753715Applied Physics Letters, 2000
- Three-way ROCsMedical Decision Making, 1999
- Boosting a Weak Learning Algorithm by MajorityInformation and Computation, 1995
- Decision combination in multiple classifier systemsPublished by Institute of Electrical and Electronics Engineers (IEEE) ,1994