Understanding Scoring Reliability: Experiments in Calibrating Essay Readers
- 1 March 1988
- journal article
- Published by American Educational Research Association (AERA) in Journal of Educational Statistics
- Vol. 13 (1) , 1-18
- https://doi.org/10.3102/10769986013001001
Abstract
Scoring reliability of essays and other free-response questions is of considerable concern, especially in large, national administrations. This report describes a statistically designed experiment that was carried out in an operational setting to determine the contributions of different sources of variation to the unreliability of scoring. The experiment made novel use of partially balanced incomplete block designs that facilitated the unbiased estimation of certain main effects without requiring readers to assess the same paper several times. In addition, estimates were obtained of the improvement in reliability that results from removing variability from systematic sources of variation by an appropriate adjustment of the raw scores. This statistical calibration appears to be a cost-effective approach to enhancing scoring reliability when compared to simply increasing the number of readings per paper. The results of the experiment also provide a framework for examining other, simpler calibration strategies. One such strategy is briefly considered.Keywords
This publication has 11 references indexed in Scilit:
- A PRELIMINARY STUDY OF RATERS FOR THE TEST OF SPOKEN ENGLISHETS Research Report Series, 1985
- ESTIMATING THE RELIABILITY, VALIDITY, AND INVALIDITY OF ESSAY RATINGSJournal of Educational Measurement, 1985
- Two Simple Models for Rater EffectsApplied Psychological Measurement, 1984
- Bayesian methods for calibration of examinersBritish Journal of Mathematical and Statistical Psychology, 1981
- Balanced Incomplete Block Designs for Inter-Rater Reliability StudiesApplied Psychological Measurement, 1981
- Analysis-of-Variance Principles Applied to the Grading of Essay TestsThe Journal of Experimental Education, 1962
- Analysis of unreplicated three-way classifications, with applications to rater bias and trait independencePsychometrika, 1961
- Estimation of the Reliability of RatingsPsychometrika, 1951
- THE RELIABILITY OF THE MARKING OF ESSAYS*British Journal of Educational Psychology, 1951
- Theory of mental tests.Published by American Psychological Association (APA) ,1950