A Classification-Based Machine Learning Approach for the Analysis of Genome-Wide Expression Data
Open Access
- 1 March 2003
- journal article
- Published by Cold Spring Harbor Laboratory in Genome Research
- Vol. 13 (3) , 503-512
- https://doi.org/10.1101/gr.104003
Abstract
Three important areas of data analysis for global gene expression analysis are class discovery, class prediction, and finding dysregulated genes (biomarkers). The clinical application of microarray data will require marker genes whose expression patterns are sufficiently well understood to allow accurate predictions on disease subclass membership. Commonly used methods of analysis include hierarchical clustering algorithms, t-, F-, and Z-tests, and machine learning approaches. We describe an approach called the maximum difference subset (MDSS) algorithm that combines classification algorithms, classical statistics, and elements of machine learning and provides a coherent framework. By integrating prediction accuracy, the MDSS algorithm learns the critical threshold of statistical significance (the α orP-value), eliminating the arbitrariness of setting a threshold of statistical significance and minimizing the effect of the normality assumptions. To reduce the false positive rate and to increase external validity of the predictive gene set, a jackknife step is used. This step identifies and removes genes in the initial MDSS with low combined predictive utility. The overall MDSS provides a prediction that is less dependent on an arbitrary study design (sample inclusion or exclusion) and should thus have high external validity. We demonstrate that this approach, unlike other published methods, identifies biomarkers capable of predicting the outcome of anthracycline-cytarabine chemotherapy in cases of acute myeloid leukemia. By incorporating two criteria—statistical significance and predictive utility—the approach learns the significance level relevant for a given data set. The MDSS approach can be used with any test and classifier operator pair.Keywords
This publication has 43 references indexed in Scilit:
- Assessing Gene Significance from cDNA Microarray Expression Data via Mixed ModelsJournal of Computational Biology, 2001
- Identifying Differentially Expressed Genes in cDNA Microarray ExperimentsJournal of Computational Biology, 2001
- The Clinical Significance of Caspase Regulation in Acute LeukemiaLeukemia & Lymphoma, 2001
- Analysis of Variance for Gene Expression Microarray DataJournal of Computational Biology, 2000
- Distinct types of diffuse large B-cell lymphoma identified by gene expression profilingNature, 2000
- Hoxa9 transforms primary bone marrow cells through specific collaboration with Meis1a but not Pbx1bThe EMBO Journal, 1998
- Exploring the Metabolic and Genetic Control of Gene Expression on a Genomic ScaleScience, 1997
- Patterns of ordination and classification instability resulting from changes in input data orderJournal of Vegetation Science, 1995
- Quantitative Monitoring of Gene Expression Patterns with a Complementary DNA MicroarrayScience, 1995
- Reactivity of anti-neutrophil cytoplasmic autoantibodies with HL-60 cellsClinical Immunology and Immunopathology, 1989