A parallel sort merge join algorithm for managing data skew
- 1 January 1993
- journal article
- Published by Institute of Electrical and Electronics Engineers (IEEE) in IEEE Transactions on Parallel and Distributed Systems
- Vol. 4 (1) , 70-86
- https://doi.org/10.1109/71.205654
Abstract
A parallel sort-merge-join algorithm which uses a divide-and-conquer approach to address the data skew problem is proposed. The proposed algorithm adds an extra, low-cost scheduling phase to the usual sort, transfer, and join phases. During the scheduling phase, a parallelizable optimization algorithm, using the output of the sort phase, attempts to balance the load across the multiple processors in the subsequent join phase. The algorithm naturally identifies the largest skew elements, and assigns each of them to an optimal number of processors. Assuming a Zipf-like distribution of data skew, the algorithm is demonstrated to achieve very good load balancing for the join phase, and is shown to be very robust relative, among other things, to the degree of data skew and the total number of processors.<>Keywords
This publication has 25 references indexed in Scilit:
- Optimal buffer partitioning for the nested block join algorithmPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2002
- Comparative performance of parallel join algorithmsPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2002
- A performance evaluation of four parallel join algorithms in a shared-nothing multiprocessor environmentPublished by Association for Computing Machinery (ACM) ,1989
- Optimal allocation of multiple class resources in computer systemsPublished by Association for Computing Machinery (ACM) ,1988
- On multisystem coupling through function request shippingIEEE Transactions on Software Engineering, 1986
- Performance Evaluation of a Database System in Multiple Backend ConfigurationsPublished by Springer Nature ,1985
- The Equi-Join Operation on a Multiprocessor Database Machine: Algorithms and the Evaluation of their PerformancePublished by Springer Nature ,1985
- The Genesis of a Database ComputerComputer, 1984
- Join and Semijoin Algorithms for a Multiprocessor Database MachineACM Transactions on Database Systems, 1984
- Performance Modeling of the DBMAC ArchitecturePublished by Springer Nature ,1983