A parallel hash join algorithm for managing data skew

1 December 1993

journal article
Published by Institute of Electrical and Electronics Engineers (IEEE) in IEEE Transactions on Parallel and Distributed Systems

Vol. 4 (12) , 1355-1371
https://doi.org/10.1109/71.250117

Abstract

Presents a parallel hash join algorithm that is based on the concept of hierarchical hashing, to address the problem of data skew. The proposed algorithm splits the usual hash phase into a hash phase and an explicit transfer phase, and adds an extra scheduling phase between these two. During the scheduling phase, a heuristic optimization algorithm, using the output of the hash phase, attempts to balance the load across the multiple processors in the subsequent join phase. The algorithm naturally identifies the hash partitions with the largest skew values and splits them as necessary, assigning each of them to an optimal number of processors. Assuming for concreteness a Zipf-like distribution of the values in the join column, a join phase which is CPU-bound, and a shared nothing environment, the algorithm is shown to achieve good join phase load balancing, and to be robust relative to the degree of data skew and the total number of processors. The overall speedup due to this algorithm is compared to some existing parallel hash join methods. The proposed method does considerably better in high skew situations.<>

Keywords

This publication has 25 references indexed in Scilit:

A parallel sort merge join algorithm for managing data skew
IEEE Transactions on Parallel and Distributed Systems, 1993
Effectiveness of parallel joins
IEEE Transactions on Knowledge and Data Engineering, 1990
The Gamma database machine project
IEEE Transactions on Knowledge and Data Engineering, 1990
Prototyping Bubba, a highly parallel database system
IEEE Transactions on Knowledge and Data Engineering, 1990
Join and Semijoin Algorithms for a Multiprocessor Database Machine
ACM Transactions on Database Systems, 1984
Estimating record selectivities
Information Systems, 1983
Application of hash to data base machine and its architecture
New Generation Computing, 1983
Performance Modeling of the DBMAC Architecture
Published by Springer Nature ,1983
An Application of Bin-Packing to Multiprocessor Scheduling
SIAM Journal on Computing, 1978
Bounds on Multiprocessing Timing Anomalies
SIAM Journal on Applied Mathematics, 1969