Customized information extraction as a basis for resource discovery
- 1 May 1996
- journal article
- Published by Association for Computing Machinery (ACM) in ACM Transactions on Computer Systems
- Vol. 14 (2) , 171-199
- https://doi.org/10.1145/227695.227697
Abstract
Indexing file contents is a powerful means of helping users locate documents, software, and other types of data among large repositories. In environments that contain many different types of data, content indexing requires type-specific processing to extract information effectively. We present a model for type-specific, user-customizable information extraction, and a system implementation calledEssence. This software structure allows users to associate specialized extraction methods with ordinary files, providing the illusion of an object-oriented file system that encapsulates indexing methods within files. By exploiting the semantics of common file types, Essence generates compact yet representative file summaries that can be used to improve both browsing and indexing in resource discovery systems. Essence can extract information from most of the types of files found in common file systems, including files with nested structure (such as compressed “tar” files). Essence interoperates with a number of different search/index systems (such as WAIS and Glimpse), as part of the Harvest system.Keywords
This publication has 11 references indexed in Scilit:
- The Harvest information discovery and access systemComputer Networks and ISDN Systems, 1995
- Scalable Internet resource discoveryCommunications of the ACM, 1994
- MIME (Multipurpose Internet Mail Extensions) Part One: Mechanisms for Specifying and Describing the Format of Internet Message BodiesPublished by RFC Editor ,1993
- The MD5 Message-Digest AlgorithmPublished by RFC Editor ,1992
- World‐Wide Web: The Information UniverseInternet Research, 1992
- Another look at automatic text-retrieval systemsCommunications of the ACM, 1986
- A trace-driven analysis of the UNIX 4.2 BSD file systemPublished by Association for Computing Machinery (ACM) ,1985
- File Transfer ProtocolPublished by RFC Editor ,1985
- An evaluation of retrieval effectiveness for a full-text document-retrieval systemCommunications of the ACM, 1985
- A Technique for High-Performance Data CompressionComputer, 1984