An analysis of the relative hardness of Reuters‐21578 subsets

4 February 2005

journal article
research article
Published by Wiley in Journal of the American Society for Information Science and Technology

Vol. 56 (6) , 584-596
https://doi.org/10.1002/asi.20147

Abstract

The existence, public availability, and widespread acceptance of a standard benchmark for a given information retrieval (IR) task are beneficial to research on this task, because they allow different researchers to experimentally compare their own systems by comparing the results they have obtained on this benchmark. The Reuters‐21578 test collection, together with its earlier variants, has been such a standard benchmark for the text categorization (TC) task throughout the last 10 years. However, the benefits that this has brought about have somehow been limited by the fact that different researchers have “carved” different subsets out of this collection and tested their systems on one of these subsets only; systems that have been tested on different Reuters‐21578 subsets are thus not readily comparable. In this article, we present a systematic, comparative experimental study of the three subsets of Reuters‐21578 that have been most popular among TC researchers. The results we obtain allow us to determine the relative hardness of these subsets, thus establishing an indirect means for comparing TC systems that have, or will be, tested on these different subsets.

Keywords

This publication has 25 references indexed in Scilit:

A maximal figure-of-merit learning approach to text categorization
Published by Association for Computing Machinery (ACM) ,2003
Using asymmetric distributions to improve text classifier probability estimates
Published by Association for Computing Machinery (ACM) ,2003
Discretizing Continuous Attributes in AdaBoost for Text Categorization
Published by Springer Nature ,2003
Bayesian online classifiers for text classification and filtering
Published by Association for Computing Machinery (ACM) ,2002
Machine learning in automated text categorization
ACM Computing Surveys, 2002
Experiments on the Use of Feature Selection and Negative Evidence in Automated Text Categorization
Published by Springer Nature ,2000
10.1162/153244302760185243
Applied Physics Letters, 2000
Distributional clustering of words for text classification
Published by Association for Computing Machinery (ACM) ,1998
Text categorization with Support Vector Machines: Learning with many relevant features
Published by Springer Nature ,1998
Improving Text Retrieval for the Routing Problem using Latent Semantic Indexing
Published by Springer Nature ,1994