Feature selection, perceptron learning, and a usability case study for text categorization

1 July 1997

journal article
Published by Association for Computing Machinery (ACM) in ACM SIGIR Forum

Vol. 31 (SI) , 67-73
https://doi.org/10.1145/278459.258537

Abstract

In this paper, we describe an automated learning approach to text categorization based on perceptron learning and a new feature selection metric, called correlation coefficient. Our approach has been tested on the standard Reuters text categorization collection. Empirical results indicate that our approach outperforms the best published results on this Reuters collection. In particular, our new feature selection method yields considerable improvement. We also investigate the usability of our automated learning approach by actually developing a system that categorizes texts into a tree of categories. We compare the accuracy of our learning approach to a rule-based, expert system approach that uses a text categorization shell built by Carnegie Group. Although our automated learning approach still gives a lower accuracy, by appropriately incorporating a set of manually chosen words to use as features, the combined, semi-automated approach yields accuracy close to the rule-based approach.

Keywords

This publication has 6 references indexed in Scilit:

Context-sensitive learning methods for text categorization
Published by Association for Computing Machinery (ACM) ,1996
Training algorithms for linear text classifiers
Published by Association for Computing Machinery (ACM) ,1996
A comparison of classifiers and document representations for the routing problem
Published by Association for Computing Machinery (ACM) ,1995
Automated learning of decision rules for text categorization
ACM Transactions on Information Systems, 1994
Classifying news stories using memory based reasoning
Published by Association for Computing Machinery (ACM) ,1992
The perceptron: A probabilistic model for information storage and organization in the brain.
Psychological Review, 1958