High-dimensional similarity joins
- 7 August 2002
- journal article
- Published by Institute of Electrical and Electronics Engineers (IEEE) in IEEE Transactions on Knowledge and Data Engineering
- Vol. 14 (1) , 156-171
- https://doi.org/10.1109/69.979979
Abstract
Many emerging data mining applications require a similarity join between points in a high-dimensional domain. We present a new algorithm that utilizes a new index structure, called the /spl epsi/ tree, for fast spatial similarity joins on high-dimensional points. This index structure reduces the number of neighboring leaf nodes that are considered for the join test, as well as the traversal cost of finding appropriate branches in the internal nodes. The storage cost for internal nodes is independent of the number of dimensions. Hence, the proposed index structure scales to high-dimensional data. We analyze the cost of the join for the /spl epsi/ tree and the R-tree family, and show that the /spl epsi/ tree will perform better for high-dimensional joins. Empirical evaluation, using synthetic and real-life data sets, shows that similarity join using the /spl epsi/ tree is twice to an order of magnitude faster than the R/sup +/ tree, with the performance gap increasing with the number of dimensions. We also discuss how some of the ideas of the /spl epsi/ tree can be applied to the R-tree family. These biased R-trees perform better than the corresponding traditional R-trees for high-dimensional similarity joins, but do not match the performance of the /spl epsi/ tree.Keywords
This publication has 10 references indexed in Scilit:
- Independent quantization: an index compression technique for high-dimensional data spacesPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2002
- FastMapPublished by Association for Computing Machinery (ACM) ,1995
- The TV-tree: An index structure for high-dimensional dataThe VLDB Journal, 1994
- Spatial joins using seeded treesPublished by Association for Computing Machinery (ACM) ,1994
- Efficient similarity search in sequence databasesPublished by Springer Nature ,1993
- The R*-tree: an efficient and robust access method for points and rectanglesPublished by Association for Computing Machinery (ACM) ,1990
- The Grid FileACM Transactions on Database Systems, 1984
- R-treesPublished by Association for Computing Machinery (ACM) ,1984
- The K-D-B-treePublished by Association for Computing Machinery (ACM) ,1981
- Multidimensional binary search trees used for associative searchingCommunications of the ACM, 1975