On the "dimensionality curse" and the "self-similarity blessing"
- 1 January 2001
- journal article
- Published by Institute of Electrical and Electronics Engineers (IEEE) in IEEE Transactions on Knowledge and Data Engineering
- Vol. 13 (1) , 96-111
- https://doi.org/10.1109/69.908983
Abstract
Spatial queries in high-dimensional spaces have been studied extensively. Among them, nearest neighbor queries are important in many settings, including spatial databases (Find the k closest cities) and multimedia databases (Find the k most similar images). Previous analyses have concluded that nearest-neighbor search is hopeless in high dimensions due to the notorious "curse of dimensionality". We show that this may be overpessimistic. We show that what determines the search performance (at least for R-tree-like structures) is the intrinsic dimensionality of the data set and not the dimensionality of the address space (referred to as the embedding dimensionality). The typical (and often implicit) assumption in many previous studies is that the data is uniformly distributed, with independence between attributes. However, real data sets overwhelmingly disobey these assumptions; rather, they typically are skewed and exhibit intrinsic ("fractal") dimensionalities that are much lower than their embedding dimension, e.g. due to subtle dependencies between attributes. We show how the Hausdorff and Correlation fractal dimensions of a data set can yield extremely accurate formulas that can predict the I/O performance to within one standard deviation on multiple real and synthetic data sets.Keywords
This publication has 24 references indexed in Scilit:
- Fast nearest neighbor search in high-dimensional spacePublished by Institute of Electrical and Electronics Engineers (IEEE) ,2002
- STR: a simple and efficient algorithm for R-tree packingPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2002
- Independent quantization: an index compression technique for high-dimensional data spacesPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2002
- A cost model for nearest neighbor search in high-dimensional data spacePublished by Association for Computing Machinery (ACM) ,1997
- On the analysis of indexing schemesPublished by Association for Computing Machinery (ACM) ,1997
- Fast parallel similarity search in multimedia databasesPublished by Association for Computing Machinery (ACM) ,1997
- The SR-treePublished by Association for Computing Machinery (ACM) ,1997
- Intelligent access to digital video: Informedia projectComputer, 1996
- Eigenfaces for RecognitionJournal of Cognitive Neuroscience, 1991
- R-treesPublished by Association for Computing Machinery (ACM) ,1984