A study of thresholding strategies for text categorization
- 1 September 2001
- proceedings article
- Published by Association for Computing Machinery (ACM)
- p. 137-145
- https://doi.org/10.1145/383952.383975
Abstract
Thresholding strategies in automated text categorization are an underexplored area of research. This paper presents an examination of the effect of thresholding strategies on the performance of a classifier under various conditions. Using k-Nearest Neighbor (kNN) as the classifier and five evaluation benchmark collections as the testbets, three common thresholding methods were investigated, including rank-based thresholding (RCut), proportion-based assignments (PCut) and score-based local optimization (SCut); in addition, new variants of these methods are proposed to overcome significant problems in the existing approaches. Experimental results show that the choice of thresholding strategy can significantly influence the performance of kNN, and that the ``optimal'' strategy may vary by application. SCut is potentially better for fine-tuning but risks overfitting. PCut copes better with rare categories and exhibits a smoother trade-off in recall versus precision, but is not suitable for online decision making. RCut is most natural for online response but is too coarse-grained for global or local optimization. RTCut, a new method combining the strength of category ranking and scoring, outperforms both PCut and RCut significantly.Keywords
This publication has 9 references indexed in Scilit:
- A Study of Approaches to Hypertext CategorizationJournal of Intelligent Information Systems, 2002
- Improving text categorization methods for event trackingPublished by Association for Computing Machinery (ACM) ,2000
- A probabilistic description-oriented approach for categorizing web documentsPublished by Association for Computing Machinery (ACM) ,1999
- A re-examination of text categorization methodsPublished by Association for Computing Machinery (ACM) ,1999
- An Evaluation of Statistical Approaches to Text CategorizationInformation Retrieval Journal, 1999
- Boosting and Rocchio applied to text filteringPublished by Association for Computing Machinery (ACM) ,1998
- Context-sensitive learning methods for text categorizationPublished by Association for Computing Machinery (ACM) ,1996
- Training algorithms for linear text classifiersPublished by Association for Computing Machinery (ACM) ,1996
- An evaluation of phrasal and clustered representations on a text categorization taskPublished by Association for Computing Machinery (ACM) ,1992