Information extraction as a basis for high-precision text classification

1 July 1994

journal article
Published by Association for Computing Machinery (ACM) in ACM Transactions on Information Systems

Vol. 12 (3) , 296-333
https://doi.org/10.1145/183422.183428

Abstract

We describe an approach to text classification that represents a compromise between traditional word-based techniques and in-depth natural language processing. Our approach uses a natural language processing task called “information extraction” as a basis for high-precision text classification. We present three algorithms that use varying amounts of extracted information to classify texts. The relevancy signatures algorithm uses linguistic phrases; the augmented relevancy signatures algorithm uses phrases and local context; and the case-based text classification algorithm uses larger pieces of context. Relevant phrases and contexts are acquired automatically using a training corpus. We evaluate the algorithms on the basis of two test sets from the MUC-4 corpus. All three algorithms achieved high precision on both test sets, with the augmented relevancy signatures algorithm and the case-based algorithm reaching 100% precision with over 60% recall on one set. Additionally, we compare the algorithms on a larger collection of 1700 texts and describe an automated method for empirically deriving appropriate threshold values. The results suggest that information extraction techniques can support high-precision text classification and, in general, that using more extracted information improves performance. As a practical matter, we also explain how the text classification system can be easily ported across domains.

Keywords

This publication has 12 references indexed in Scilit:

Information filtering and information retrieval
Communications of the ACM, 1992
University of Massachusetts
Published by Association for Computational Linguistics (ACL) ,1992
Computational aspects of discourse in the context of MUC-3
Published by Association for Computational Linguistics (ACL) ,1991
The MEDIATOR: Analysis of an Early Case‐Based Problem Solver⁴
Cognitive Science, 1989
Language-oriented information retrieval
International Journal of Intelligent Systems, 1989
The effectiveness of a nonsyntactic approach to automatic phrase indexing for document retrieval
Journal of the American Society for Information Science, 1989
FASIT: A fully automatic syntactically based indexing system
Journal of the American Society for Information Science, 1983
Automatic indexing and generation of classification systems by algorithm
Information Storage and Retrieval, 1973
Automatic Document Classification
Journal of the ACM, 1963
Automatic Indexing: An Experimental Inquiry
Journal of the ACM, 1961