Identifying marker genes in transcription profiling data using a mixture of feature relevance experts

8 March 2001

journal article
research article
Published by American Physiological Society in Physiological Genomics

Vol. 5 (2) , 99-111
https://doi.org/10.1152/physiolgenomics.2001.5.2.99

Abstract

Transcription profiling experiments permit the expression levels of many genes to be measured simultaneously. Given profiling data from two types of samples, genes that most distinguish the samples (marker genes) are good candidates for subsequent in-depth experimental studies and developing decision support systems for diagnosis, prognosis, and monitoring. This work proposes a mixture of feature relevance experts as a method for identifying marker genes and illustrates the idea using published data from samples labeled as acute lymphoblastic and myeloid leukemia (ALL, AML). A feature relevance expert implements an algorithm that calculates how well a gene distinguishes samples, reorders genes according to this relevance measure, and uses a supervised learning method [here, support vector machines (SVMs)] to determine the generalization performances of different nested gene subsets. The mixture of three feature relevance experts examined implement two existing and one novel feature relevance measures. For each expert, a gene subset consisting of the top 50 genes distinguished ALL from AML samples as completely as all 7,070 genes. The 125 genes at the union of the top 50s are plausible markers for a prototype decision support system. Chromosomal aberration and other data support the prediction that the three genes at the intersection of the top 50s, cystatin C, azurocidin, and adipsin, are good targets for investigating the basic biology of ALL/AML. The same data were employed to identify markers that distinguish samples based on their labels of T cell/B cell, peripheral blood/bone marrow, and male/female. Selenoprotein W may discriminate T cells from B cells. Results from analysis of transcription profiling data from tumor/nontumor colon adenocarcinoma samples support the general utility of the aforementioned approach. Theoretical issues such as choosing SVM kernels and their parameters, training and evaluating feature relevance experts, and the impact of potentially mislabeled samples on marker identification (feature selection) are discussed.

Keywords

This publication has 16 references indexed in Scilit:

Analysis of molecular profile data using generative and discriminative methods
Physiological Genomics, 2000
Support vector machine classification and validation of cancer tissue samples using microarray expression data
Bioinformatics, 2000
Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling
Nature, 2000
Knowledge-based analysis of microarray gene expression data by using support vector machines
Proceedings of the National Academy of Sciences, 2000
Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring
Science, 1999
The Diverse Role of Selenium within Selenoproteins
Journal of the American Dietetic Association, 1999
Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays
Proceedings of the National Academy of Sciences, 1999
Orphan selenoproteins
BioEssays, 1999
Cluster analysis and display of genome-wide expression patterns
Proceedings of the National Academy of Sciences, 1998
Identification of Human Neutrophil-derived Cathepsin G and Azurocidin/CAP37 as Chemoattractants for Mononuclear Cells and Neutrophils
The Journal of Experimental Medicine, 1997