The effect of topic set size on retrieval experiment error
Top Cited Papers
- 11 August 2002
- proceedings article
- Published by Association for Computing Machinery (ACM)
- p. 316-323
- https://doi.org/10.1145/564376.564432
Abstract
Retrieval mechanisms are frequently compared by computing the respective average scores for some effectiveness metric across a common set of information needs or topics, with researchers concluding one method is superior based on those averages. Since comparative retrieval system behavior is known to be highly variable across topics, good experimental design requires that a "sufficient" number of topics be used in the test. This paper uses TREC results to empirically derive error rates based on the number of topics used in a test and the observed difference in the average scores. The error rates quantify the likelihood that a different set of topics of the same size would lead to a different conclusion. We directly compute error rates for topic sets up to size 25, and extrapolate those rates for larger topic set sizes. The error rates found are larger than anticipated, indicating researchers need to take care when concluding one method is better than another, especially if few topics are used.Keywords
This publication has 8 references indexed in Scilit:
- Variations in relevance judgments and the measurement of retrieval effectivenessInformation Processing & Management, 2000
- Evaluating evaluation measure stabilityPublished by Association for Computing Machinery (ACM) ,2000
- Overview of the Sixth Text REtrieval Conference (TREC-6)Information Processing & Management, 2000
- Blind Men and Elephants: Six Approaches to TREC dataInformation Retrieval Journal, 1999
- Presenting results of experimental retrieval comparisonsInformation Processing & Management, 1992
- The state of retrieval system evaluationInformation Processing & Management, 1992
- INFORMATION RETRIEVAL TEST COLLECTIONSJournal of Documentation, 1976
- AUTOMATIC INDEXINGJournal of Documentation, 1974