Supporting the hypercube programming model on mesh architectures

1 June 1992

conference paper
Published by Association for Computing Machinery (ACM)

p. 148-157
https://doi.org/10.1145/140901.140917

Abstract

Many combinatorial problems have simple solutions for parallel processing on highly-connected networks such as the butterfly or the hypercube, whereas the fastest processor-to-processor intercon- nections are realized in parallel machines with low dimensional mesh or torus topology. This paper presents a method for map- ping binary hypercube- algorithms onto lower dimensional meshes and analyzes this method in a model derived from the architecture of modern mesh machines. We outline the criteria used to evalu- ate graph embeddings for mapping supercomputer communication networks. Our work was motivated by the need for fast library routines to do parallel sorting, fast Fouriertransformation and processor syn- chronization. During the design effort of these building blocks, we developed and analyzed a new technique to support a hypercube network embedded onto a two dimensional torus. A direct imple- mentation of the embedding is made possible by logical channels and pathways. A fast merge sorter based on the bitonic network serves as an example to show how a simple hypercube algorithm can outperform most of the asymptotically optimal mesh algorithms for practical machine sizes. In the conventional mesh computation model, processors are allowed to exchange one unit of data with a neighbor in each step. This model needs to be refined since modern mesh computers, such as the iWarp system, have hardware support for fast non-neighbor communication. The bitonic merge sort, a simple hypercube algorithm, contains a fair amount of fine grain parallelism not found in standard mesh algorithms. This form of parallelism includes pipelined communi- cation, computation overlapped with communication, use of wide instruction words and operands directly read from the communica- tion system through systolic gates. The measured sorting rate of more than 2 106 keys/sec on an well with much larger parallel computers. In our analysis of the relative performancewe compare our approach to different sorting methods on meshes. The mapped hypercube algorithm is shown to be best for a wide range of machine and problem sizes. For the readers mainly interested in complexity results, our ap- proach may seem somewhat surprising, but the analysis of the algo- rithm in an accurate model for the iWarp machine shows how good speed and good parallel efficiency is obtained from both forms of parallelism, large and fine grain.

Keywords

This publication has 12 references indexed in Scilit:

A comparison of sorting algorithms for the connection machine CM-2
Published by Association for Computing Machinery (ACM) ,1991
Improved sorting networks withO(logN) depth
Algorithmica, 1990
Supporting systolic and memory communication in iWarp
Published by Association for Computing Machinery (ACM) ,1990
Optimal and Sublogarithmic Time Randomized Parallel Sorting Algorithms
SIAM Journal on Computing, 1989
Communication in iWarp systems
Published by Association for Computing Machinery (ACM) ,1989
The Connection Machine model CM-1 architecture
IEEE Transactions on Systems, Man, and Cybernetics, 1989
An optimal sorting algorithm for mesh connected computers
Published by Association for Computing Machinery (ACM) ,1986
An 0(n log n) sorting network
Published by Association for Computing Machinery (ACM) ,1983
Sorting on a mesh-connected parallel computer
Communications of the ACM, 1977
Sorting networks and their applications
Published by Association for Computing Machinery (ACM) ,1968