Imitating Manual Curation of Text-Mined Facts in Biomedicine
Open Access
- 8 September 2006
- journal article
- research article
- Published by Public Library of Science (PLoS) in PLoS Computational Biology
- Vol. 2 (9) , e118
- https://doi.org/10.1371/journal.pcbi.0020118
Abstract
Text-mining algorithms make mistakes in extracting facts from natural-language texts. In biomedical applications, which rely on use of text-mined data, it is critical to assess the quality (the probability that the message is correctly extracted) of individual facts—to resolve data conflicts and inconsistencies. Using a large set of almost 100,000 manually produced evaluations (most facts were independently reviewed more than once, producing independent evaluations), we implemented and tested a collection of algorithms that mimic human evaluation of facts provided by an automated information-extraction system. The performance of our best automated classifiers closely approached that of our human evaluators (ROC score close to 0.95). Our hypothesis is that, were we to use a larger number of human experts to evaluate any given sentence, we could implement an artificial-intelligence curator that would perform the classification job at least as accurately as an average individual human evaluator. We illustrated our analysis by visualizing the predicted accuracy of the text-mined relations involving the term cocaine. Current automated approaches for extracting biologically important facts from scientific articles are imperfect: while being capable of efficient, fast, and inexpensive analysis of enormous quantities of scientific prose, they make errors. To emulate the human experts evaluating the quality of the automatically extracted facts, we have developed an artificial intelligence program (“a robotic curator”) that closely approaches human experts in the quality of distinguishing the correctly extracted facts from the incorrectly extracted ones.Keywords
This publication has 27 references indexed in Scilit:
- Probability theory: the logic of scienceThe Mathematical Intelligencer, 2005
- GenBankNucleic Acids Research, 2004
- GeneWays: a system for extracting, analyzing, visualizing, and integrating molecular pathway dataJournal of Biomedical Informatics, 2003
- The potential use of SUISEKI as a protein interaction discovery tool.2001
- Presenilins, Processing of β-Amyloid Precursor Protein, and Notch SignalingNeuron, 1999
- The Middle and the EndNeuron, 1999
- The meaning and use of the area under a receiver operating characteristic (ROC) curve.Radiology, 1982
- Information Theory and Statistical MechanicsPhysical Review B, 1957
- Probabilistic Logics and the Synthesis of Reliable Organisms From Unreliable ComponentsPublished by Walter de Gruyter GmbH ,1956
- THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMSAnnals of Eugenics, 1936