Systematic Learning of Gene Functional Classes From DNA Array Expression Data by Using Multilayer Perceptrons
Open Access
- 1 November 2002
- journal article
- research article
- Published by Cold Spring Harbor Laboratory in Genome Research
- Vol. 12 (11) , 1703-1715
- https://doi.org/10.1101/gr.192502
Abstract
Recent advances in microarray technology have opened new ways for functional annotation of previously uncharacterised genes on a genomic scale. This has been demonstrated by unsupervised clustering of co-expressed genes and, more importantly, by supervised learning algorithms. Using prior knowledge, these algorithms can assign functional annotations based on more complex expression signatures found in existing functional classes. Previously, support vector machines (SVMs) and other machine-learning methods have been applied to a limited number of functional classes for this purpose. Here we present, for the first time, the comprehensive application of supervised neural networks (SNNs) for functional annotation. Our study is novel in that we report systematic results for ∼100 classes in the Munich Information Center for Protein Sequences (MIPS) functional catalog. We found that only ∼10% of these are learnable (based on the rate of false negatives). A closer analysis reveals that false positives (and negatives) in a machine-learning context are not necessarily “false” in a biological sense. We show that the high degree of interconnections among functional classes confounds the signatures that ought to be learned for a unique class. We term this the “Borges effect” and introduce two new numerical indices for its quantification. Our analysis indicates that classification systems with a lower Borges effect are better suitable for machine learning. Furthermore, we introduce a learning procedure for combining false positives with the original class. We show that in a few iterations this process converges to a gene set that is learnable with considerably low rates of false positives and negatives and contains genes that are biologically related to the original class, allowing for a coarse reconstruction of the interactions between associated biological pathways. We exemplify this methodology using the well-studied tricarboxylic acid cycle.Keywords
This publication has 27 references indexed in Scilit:
- Genomic Analysis of Gene Expression in C. elegansScience, 2000
- Global Gene Expression Profiling in Escherichia coliK12Journal of Biological Chemistry, 2000
- Distinct types of diffuse large B-cell lymphoma identified by gene expression profilingNature, 2000
- MIPS: a database for genomes and protein sequencesNucleic Acids Research, 2000
- Clustering Gene Expression PatternsJournal of Computational Biology, 1999
- Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methodsJournal of Molecular Biology, 1998
- A DNA microarray system for analyzing complex DNA samples using two-color fluorescent probe hybridization.Genome Research, 1996
- Neural NetworksPublished by Taylor & Francis ,1996
- Quantitative Monitoring of Gene Expression Patterns with a Complementary DNA MicroarrayScience, 1995
- The perceptron: A probabilistic model for information storage and organization in the brain.Psychological Review, 1958