Optimal signature extraction and information loss
- 1 September 1987
- journal article
- Published by Association for Computing Machinery (ACM) in ACM Transactions on Database Systems
- Vol. 12 (3) , 395-428
- https://doi.org/10.1145/27629.214285
Abstract
Signature files seem to be a promising access method for text and attributes. According to this method, the documents (or records) are stored sequentially in one file ("text file"), while abstractions of the documents ("signatures") are stored sequentially in another file ("signature file"). In order to resolve a query, the signature file is scanned first, and many nonqualifying documents are immediately rejected. We develop a framework that includes primary key hashing, multiattribute hashing, and signature files. Our effort is to find the optimal signature extraction method. The main contribution of this paper is that we present optimal and efficient suboptimal algorithms for assigning words to signatures in several environments. Another contribution is that we use information theory, and study the relationship of the false drop probability F d and the information that is lost during signature extraction. We give tight lower bounds on the achievable F d and show that a simple relationship holds between the two quantities in the case of optimal signature extraction with uniform occurrence and query frequencies. We examine hashing as a method to map words to signatures (instead of the optimal way), and show that the same relationship holds between F d and loss , indicating that an invariant may exist between these two quantities for every signature extraction method.Keywords
This publication has 23 references indexed in Scilit:
- Signature filesACM Transactions on Information Systems, 1984
- The study of an ordered minimal perfect hashing schemeCommunications of the ACM, 1984
- Design Considerations for a Message File ServerIEEE Transactions on Software Engineering, 1984
- Partial-match retrieval for dynamic filesBIT Numerical Mathematics, 1982
- Reciprocal hashingCommunications of the ACM, 1981
- Optimal partial-match retrievalBIT Numerical Mathematics, 1980
- Optimal partial-match retrieval when fields are independently specifiedACM Transactions on Database Systems, 1979
- Universal classes of hash functionsJournal of Computer and System Sciences, 1979
- Implementation of the substring test by hashingCommunications of the ACM, 1971
- A Method for the Construction of Minimum-Redundancy CodesProceedings of the IRE, 1952