A parallel algorithm for record clustering
- 1 December 1990
- journal article
- Published by Association for Computing Machinery (ACM) in ACM Transactions on Database Systems
- Vol. 15 (4) , 599-624
- https://doi.org/10.1145/99935.99947
Abstract
We present an efficient heuristic algorithm for record clustering that can run on a SIMD machine. We introduce the P-tree, and its associated numbering scheme, which in the split phase allows each processor independently to compute the unique cluster number of a record satisfying an arbitrary query. We show that by restricting ourselves in the merge phase to combining only sibling clusters, we obtain a parallel algorithm whose speedup ratio is optimal in the number of processors used. Finally, we report on experiments showing that our method produces substantial savings in an enviornment with relatively little overlap among the queries.Keywords
This publication has 16 references indexed in Scilit:
- A taxonomy of parallel sortingACM Computing Surveys, 1984
- Parallel graph algorithmsACM Computing Surveys, 1984
- The Grid FileACM Transactions on Database Systems, 1984
- Parallel algorithms for the execution of relational database operationsACM Transactions on Database Systems, 1983
- Linear sorting withO(logn) processorsBIT Numerical Mathematics, 1983
- Disk allocation for Cartesian product files on multiple-disk systemsACM Transactions on Database Systems, 1982
- Reducing block accesses in inverted files by partial clusteringInformation Systems, 1980
- Concepts and capabilities of a database computer\ACM Transactions on Database Systems, 1978
- Fast parallel sorting algorithmsCommunications of the ACM, 1978
- Optimal Sorting Algorithms for Parallel ComputersIEEE Transactions on Computers, 1978