GESS
- 26 August 2001
- conference paper
- Published by Association for Computing Machinery (ACM)
Abstract
The similarity join is an important operation for mining high-dimensional feature spaces. Given two data sets, the similarity join computes all tuples (x, y) that are within a distance &egr;.One of the most efficient algorithms for processing similarity-joins is the Multidimensional-Spatial Join (MSJ) by Koudas and Sevcik. In our previous work --- pursued for the two-dimensional case --- we found however that MSJ has several performance shortcomings in terms of CPU and I/O cost as well as memory-requirements. Therefore, MSJ is not generally applicable to high-dimensional data.In this paper, we propose a new algorithm named Generic External Space Sweep (GESS). GESS introduces a modest rate of data replication to reduce the number of expensive distance computations. We present a new cost-model for replication, an I/O model, and an inexpensive method for duplicate removal. The principal component of our algorithm is a highly flexible replication engine.Our analytical model predicts a tremendous reduction of the number of expensive distance computations by several orders of magnitude in comparison to MSJ (factor 107). In addition, the memory requirements of GESS are shown to be lower by several orders of magnitude. Furthermore, the I/O cost of our algorithm is by factor 2 better (independent from the fact whether replication occurs or not). Our analytical results are confirmed by a large series of simulations and experiments with synthetic and real high-dimensional data sets.Keywords
This publication has 12 references indexed in Scilit:
- High performance clustering based on the similarity joinPublished by Association for Computing Machinery (ACM) ,2000
- Data integration using similarity joins and a word-based information representation languageACM Transactions on Information Systems, 2000
- javax.XXLPublished by Association for Computing Machinery (ACM) ,2000
- Spatial join selectivity using power lawsPublished by Association for Computing Machinery (ACM) ,2000
- High dimensional similarity joins: algorithms and performance evaluationIEEE Transactions on Knowledge and Data Engineering, 2000
- Self-spacial join selectivity estimation using fractal conceptsACM Transactions on Information Systems, 1998
- Size separation spatial joinPublished by Association for Computing Machinery (ACM) ,1997
- Partition based spatial-merge joinPublished by Association for Computing Machinery (ACM) ,1996
- Spatial hash-joinsPublished by Association for Computing Machinery (ACM) ,1996
- The query by image content (QBIC) systemPublished by Association for Computing Machinery (ACM) ,1995