Information extraction as a basis for high-precision text classification
- 1 July 1994
- journal article
- Published by Association for Computing Machinery (ACM) in ACM Transactions on Information Systems
- Vol. 12 (3) , 296-333
- https://doi.org/10.1145/183422.183428
Abstract
We describe an approach to text classification that represents a compromise between traditional word-based techniques and in-depth natural language processing. Our approach uses a natural language processing task called “information extraction” as a basis for high-precision text classification. We present three algorithms that use varying amounts of extracted information to classify texts. The relevancy signatures algorithm uses linguistic phrases; the augmented relevancy signatures algorithm uses phrases and local context; and the case-based text classification algorithm uses larger pieces of context. Relevant phrases and contexts are acquired automatically using a training corpus. We evaluate the algorithms on the basis of two test sets from the MUC-4 corpus. All three algorithms achieved high precision on both test sets, with the augmented relevancy signatures algorithm and the case-based algorithm reaching 100% precision with over 60% recall on one set. Additionally, we compare the algorithms on a larger collection of 1700 texts and describe an automated method for empirically deriving appropriate threshold values. The results suggest that information extraction techniques can support high-precision text classification and, in general, that using more extracted information improves performance. As a practical matter, we also explain how the text classification system can be easily ported across domains.Keywords
This publication has 12 references indexed in Scilit:
- Information filtering and information retrievalCommunications of the ACM, 1992
- University of MassachusettsPublished by Association for Computational Linguistics (ACL) ,1992
- Computational aspects of discourse in the context of MUC-3Published by Association for Computational Linguistics (ACL) ,1991
- The MEDIATOR: Analysis of an Early Case‐Based Problem Solver4Cognitive Science, 1989
- Language-oriented information retrievalInternational Journal of Intelligent Systems, 1989
- The effectiveness of a nonsyntactic approach to automatic phrase indexing for document retrievalJournal of the American Society for Information Science, 1989
- FASIT: A fully automatic syntactically based indexing systemJournal of the American Society for Information Science, 1983
- Automatic indexing and generation of classification systems by algorithmInformation Storage and Retrieval, 1973
- Automatic Document ClassificationJournal of the ACM, 1963
- Automatic Indexing: An Experimental InquiryJournal of the ACM, 1961