Sources of unreliability and bias in standardized‐patient rating

Abstract
In tests of clinical competence, standardized patients (SPs) can be used to present the clinical problem and rate actions taken by the examinee in the patient encounter. Both these aspects of the “test”; have the potential to contribute to unreliability and bias in measurement. In 1987, two universities collaborated to develop and execute the same SP test to clinical clerks in their respective institutions. This provided us with the opportunity to evaluate rating bias attributable to test site and three sources of rating unreliability within the same population of raters: those attributable to inconsistencies within the same rater (within‐rater reliability), those attributable to inconsistencies between two raters trained in the same test site (between‐raters reliability—same site), and those attributable to inconsistencies between two raters trained in different test sites (between‐raters reliability—different sites). A stratified random sample of 537 of the 2,560 examinee‐patient encounters that occurred in the inter‐university examination was videotaped, providing equivalent representation of the 16 cases used in the test and the two universities. Videotaped encounters from both universities were rated by 44 SPs who presented and rated the case during the examination. Videotape and examination ratings were used to estimate systematic rating bias and the three types of rater reliability. Overall, rater reliability of individual items and overall encounter score were fair to good (.37 to .52). Consistent with these results, raters within cases accounted for 20% of the observed variance in student scores. Within‐rater reliability was better than both types of between‐raters reliability. Rater agreement was not influenced by test site, but systematic differences in score were present between test sites. Site 1 raters scored the same students, on average, 6.7% lower than Site 2 raters. These differences had an impact on the proportion of students who would have failed the check list portion of the test. In Site 1,50% of the students rated had data‐collection scores below 60%, whereas, inSite2, only 33% had scores below the 60% cutoff. The implications of these findings for single‐ and multi‐site SP‐based tests of competence are explored, and additional areas for research are identified.