Sources of unreliability and bias in standardized‐patient rating

1 January 1991

journal article
research article
Published by Taylor & Francis in Teaching and Learning in Medicine

Vol. 3 (2) , 74-85
https://doi.org/10.1080/10401339109539486

Abstract

In tests of clinical competence, standardized patients (SPs) can be used to present the clinical problem and rate actions taken by the examinee in the patient encounter. Both these aspects of the “test”; have the potential to contribute to unreliability and bias in measurement. In 1987, two universities collaborated to develop and execute the same SP test to clinical clerks in their respective institutions. This provided us with the opportunity to evaluate rating bias attributable to test site and three sources of rating unreliability within the same population of raters: those attributable to inconsistencies within the same rater (within‐rater reliability), those attributable to inconsistencies between two raters trained in the same test site (between‐raters reliability—same site), and those attributable to inconsistencies between two raters trained in different test sites (between‐raters reliability—different sites). A stratified random sample of 537 of the 2,560 examinee‐patient encounters that occurred in the inter‐university examination was videotaped, providing equivalent representation of the 16 cases used in the test and the two universities. Videotaped encounters from both universities were rated by 44 SPs who presented and rated the case during the examination. Videotape and examination ratings were used to estimate systematic rating bias and the three types of rater reliability. Overall, rater reliability of individual items and overall encounter score were fair to good (.37 to .52). Consistent with these results, raters within cases accounted for 20% of the observed variance in student scores. Within‐rater reliability was better than both types of between‐raters reliability. Rater agreement was not influenced by test site, but systematic differences in score were present between test sites. Site 1 raters scored the same students, on average, 6.7% lower than Site 2 raters. These differences had an impact on the proportion of students who would have failed the check list portion of the test. In Site 1,50% of the students rated had data‐collection scores below 60%, whereas, inSite2, only 33% had scores below the 60% cutoff. The implications of these findings for single‐ and multi‐site SP‐based tests of competence are explored, and additional areas for research are identified.

Keywords

This publication has 11 references indexed in Scilit:

Binomial Regression with Monotone Splines: A Psychometric Application
Journal of the American Statistical Association, 1989
Training and experience of examiners
Medical Education, 1989
What Is… Normative versus Criterion-referenced Assessment
Medical Teacher, 1989
Factors influencing reproducibility of tests using standardized patients
Teaching and Learning in Medicine, 1989
Evaluation of physical examination skills. Reliability of faculty observers and patient instructors
JAMA, 1987
An objective measure of clinical performance
The American Journal of Medicine, 1987
Ensuring the clinical competence of medical school graduates through standardized patients
Archives of internal medicine (1960), 1987
Simulated patients in general practice: a different look at the consultation.
BMJ, 1987
Ratings of videotaped simulated patient interviews and four other methods of evaluating a psychiatry clerkship
American Journal of Psychiatry, 1987
The selection and training of examiners for clinical examinations
Medical Education, 1980