Efficient set joins on similarity predicates
Top Cited Papers
- 13 June 2004
- proceedings article
- Published by Association for Computing Machinery (ACM)
- p. 743-754
- https://doi.org/10.1145/1007568.1007652
Abstract
In this paper we present an efficient, scalable and general algorithm for performing set joins on predicates involving various similarity measures like intersect size, Jaccard-coefficient, cosine similarity, and edit-distance. This expands the existing suite of algorithms for set joins on simpler predicates such as, set containment, equality and non-zero overlap. We start with a basic inverted index based probing method and add a sequence of optimizations that result in one to two orders of magnitude improvement in running time. The algorithm folds in a data partitioning strategy that can work efficiently with an index compressed to fit in any available amount of main memory. The optimizations used in our algorithm generalize to several weighted and unweighted measures of partial word overlap between sets.Keywords
This publication has 12 references indexed in Scilit:
- Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree ApproachData Mining and Knowledge Discovery, 2004
- Robust and efficient fuzzy match for online data cleaningPublished by Association for Computing Machinery (ACM) ,2003
- Efficient processing of joins on set-valued attributesPublished by Association for Computing Machinery (ACM) ,2003
- Efficient single‐pass index construction for text databasesJournal of the American Society for Information Science and Technology, 2003
- Interactive deduplication using active learningPublished by Association for Computing Machinery (ACM) ,2002
- Finding interesting associations without support pruningIEEE Transactions on Knowledge and Data Engineering, 2001
- Data integration using similarity joins and a word-based information representation languageACM Transactions on Information Systems, 2000
- Selectively estimation for Boolean queriesPublished by Association for Computing Machinery (ACM) ,2000
- Approximating Matrix Multiplication for Pattern Recognition TasksJournal of Algorithms, 1999
- Text compression for dynamic document databasesIEEE Transactions on Knowledge and Data Engineering, 1997