Extreme re-balancing for SVMs
Top Cited Papers
- 1 June 2004
- journal article
- Published by Association for Computing Machinery (ACM) in ACM SIGKDD Explorations Newsletter
- Vol. 6 (1) , 60-69
- https://doi.org/10.1145/1007730.1007739
Abstract
There are many practical applications where learning from single class examples is either, the only possible solution, or has a distinct performance advantage. The first case occurs when obtaining examples of a second class is difficult, e.g., classifying sites of "interest" based on web accesses. The second situation is exemplified by the gene knock-out experiments for understanding Aryl Hydrocarbon Receptor signalling pathway that provided the data for the second task of the KDD 2002 Cup, where minority one-class SVMs significantly outperform models learnt using examples from both classes.This paper explores the limits of supervised learning of a two class discrimination from data with heavily unbalanced class proportions. We focus on the case of supervised learning with support vector machines. We consider the impact of both sampling and weighting imbalance compensation techniques and then extend the balancing to extreme situations when one of the classes is ignored completely and the learning is accomplished using examples from a single class.Our investigation with the data for KDD 2002 Cup as well as text benchmarks such as Reuters Newswire shows that there is a consistent pattern of performance differences between one and two-class learning for all SVMs investigated, and these patterns persist even with aggressive dimensionality reduction through automated feature selection. Using insight gained from the above analysis, we generate synthetic data showing similar pattern of performance.Keywords
This publication has 7 references indexed in Scilit:
- The genomics of a signaling pathwayACM SIGKDD Explorations Newsletter, 2002
- One class SVM for yeast regulation predictionACM SIGKDD Explorations Newsletter, 2002
- A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification ProblemsMachine Learning, 2001
- Inductive learning algorithms and representations for text categorizationPublished by Association for Computing Machinery (ACM) ,1998
- Signal DetectabilityMedical Decision Making, 1991
- Induction of Decision TreesMachine Learning, 1986
- The area above the ordinal dominance graph and the area below the receiver operating characteristic graphJournal of Mathematical Psychology, 1975