A Reliability Study for Evaluating Information Extraction from Radiology Reports

Open Access

1 March 1999

journal article
Published by Oxford University Press (OUP) in Journal of the American Medical Informatics Association

Vol. 6 (2) , 143-150
https://doi.org/10.1136/jamia.1999.0060143

Abstract

Goal: To assess the reliability of a reference standard for an information extraction task. Setting: Twenty-four physician raters from two sites and two specialities judged whether clinical conditions were present based on reading chest radiograph reports. Methods: Variance components, generalizability (reliability) coefficients, and the number of expert raters needed to generate a reliable reference standard were estimated. Results: Per-rater reliability averaged across conditions was 0.80 (95% CI, 0.79–0.81). Reliability for the nine individual conditions varied from 0.67 to 0.97, with central line presence and pneumothorax the most reliable, and pleural effusion (excluding CHF) and pneumonia the least reliable. One to two raters were needed to achieve a reliability of 0.70, and six raters, on average, were required to achieve a reliability of 0.95. This was far more reliable than a previously published per-rater reliability of 0.19 for a more complex task. Differences between sites were attributable to changes to the condition definitions. Conclusion: In these evaluations, physician raters were able to judge very reliably the presence of clinical conditions based on text reports. Once the reliability of a specific rater is confirmed, it would be possible for that rater to create a reference standard reliable enough to assess aggregate measures on a system. Six raters would be needed to create a reference standard sufficient to assess a system on a case-by-case basis. These results should help evaluators design future information extraction studies for natural language processors and other knowledge-based systems.

Keywords

This publication has 14 references indexed in Scilit:

Respiratory Isolation of Tuberculosis Patients Using Clinical Guidelines and an Automated Clinical Decision Support System
Infection Control & Hospital Epidemiology, 1998
Extracting Findings from Narrative Reports: Software Transferability and Sources of Physician Disagreement
Methods of Information in Medicine, 1998
Knowledge discovery and data mining to assist natural language understanding.
1998
An evaluation of natural language processing methodologies.
1998
Unlocking Clinical Data from Narrative Reports: A Study of Natural Language Processing
Annals of Internal Medicine, 1995
Natural language processing in an operational clinical information system
Natural Language Engineering, 1995
Performance of Four Computer-Based Diagnostic Systems
New England Journal of Medicine, 1994
Validation of the medical expert system PNEUMON-IA
Computers and Biomedical Research, 1992
Comparison of computer-aided and human review of general practitioners' management of hypertension
The Lancet, 1991
Generalizability theory.
American Psychologist, 1989