The effect of topic set size on retrieval experiment error

Top Cited Papers

11 August 2002

proceedings article
Published by Association for Computing Machinery (ACM)

p. 316-323
https://doi.org/10.1145/564376.564432

Abstract

Retrieval mechanisms are frequently compared by computing the respective average scores for some effectiveness metric across a common set of information needs or topics, with researchers concluding one method is superior based on those averages. Since comparative retrieval system behavior is known to be highly variable across topics, good experimental design requires that a "sufficient" number of topics be used in the test. This paper uses TREC results to empirically derive error rates based on the number of topics used in a test and the observed difference in the average scores. The error rates quantify the likelihood that a different set of topics of the same size would lead to a different conclusion. We directly compute error rates for topic sets up to size 25, and extrapolate those rates for larger topic set sizes. The error rates found are larger than anticipated, indicating researchers need to take care when concluding one method is better than another, especially if few topics are used.

Keywords

This publication has 8 references indexed in Scilit:

Variations in relevance judgments and the measurement of retrieval effectiveness
Information Processing & Management, 2000
Evaluating evaluation measure stability
Published by Association for Computing Machinery (ACM) ,2000
Overview of the Sixth Text REtrieval Conference (TREC-6)
Information Processing & Management, 2000
Blind Men and Elephants: Six Approaches to TREC data
Information Retrieval Journal, 1999
Presenting results of experimental retrieval comparisons
Information Processing & Management, 1992
The state of retrieval system evaluation
Information Processing & Management, 1992
INFORMATION RETRIEVAL TEST COLLECTIONS
Journal of Documentation, 1976
AUTOMATIC INDEXING
Journal of Documentation, 1974