Efficient and tumble similar set retrieval
- 1 May 2001
- journal article
- Published by Association for Computing Machinery (ACM) in ACM SIGMOD Record
- Vol. 30 (2) , 247-258
- https://doi.org/10.1145/376284.375689
Abstract
Set value attributes are a concise and natural way to model complex data sets. Modern Object Relational systems support set value attributes and allow various query capabilities on them. In this paper we initiate a formal study of indexing techniques for set value attributes based on similarity, for suitably defined notions of similarity between sets. Such techniques are necessary in modern applications such as recommendations through collaborative filtering and automated advertising. Our techniques are probabilistic and approximate in nature. As a design principle we create structures that make use of well known and widely used data structuring techniques, as a means to ease integration with existing infrastructure. We show how the problem of indexing a collection of sets based on similarity can be reduced to the problem of indexing suitably encoded (in a way that preserves similarity) binary vectors in Hamming space thus, reducing the problem to one of similarity query processing in Hamming space. Then, we introduce and analyze two data structure primitives that we use in cooperation to perform similarity query processing in a Hamming space. We show how the resulting indexing technique can be optimized for properties of interest by formulating constraint optimization problems based on the space one is willing to devote for indexing. Finally we present experimental results from a prototype implementation of our techniques using real life datasets exploring the accuracy and efficiency of our overall approach as well as the quality of our solutions to problems related to the optimization of the indexing scheme.Keywords
This publication has 8 references indexed in Scilit:
- Selectively estimation for Boolean queriesPublished by Association for Computing Machinery (ACM) ,2000
- Multidimensional access methodsACM Computing Surveys, 1998
- Approximate nearest neighborsPublished by Association for Computing Machinery (ACM) ,1998
- Min-wise independent permutations (extended abstract)Published by Association for Computing Machinery (ACM) ,1998
- Size-Estimation Framework with Applications to Transitive Closure and ReachabilityJournal of Computer and System Sciences, 1997
- Fast subsequence matching in time-series databasesPublished by Association for Computing Machinery (ACM) ,1994
- Evaluation of signature files as set access facilities in OODBsPublished by Association for Computing Machinery (ACM) ,1993
- Access methods for textACM Computing Surveys, 1985