Classification of Cancer Stage from Free-text Histology Reports

Abstract
This article investigates the classification of a patient's lung cancer stage based on analysis of their free-text medical reports. The system uses natural language processing to transform the report text, including identification of UMLS terms and detection of negated findings. The transformed report is then classified using statistical machine learning techniques. A support vector machine is trained for each stage category based on word occurrences in a corpus of histology reports for pathologically staged patients. New reports can be classified according to the most likely stage, allowing the collection of population stage data for analysis of outcomes. While the system could in principle be applied to stage different cancer types, the current work focuses on lung cancer due to data availability. The article presents initial experiments quantifying system performance for T and N staging on a corpus of histology reports from more than 700 lung cancer patients

This publication has 13 references indexed in Scilit: