Supporting the hypercube programming model on mesh architectures
- 1 June 1992
- conference paper
- Published by Association for Computing Machinery (ACM)
- p. 148-157
- https://doi.org/10.1145/140901.140917
Abstract
Many combinatorial problems have simple solutions for parallel processing on highly-connected networks such as the butterfly or the hypercube, whereas the fastest processor-to-processor intercon- nections are realized in parallel machines with low dimensional mesh or torus topology. This paper presents a method for map- ping binary hypercube- algorithms onto lower dimensional meshes and analyzes this method in a model derived from the architecture of modern mesh machines. We outline the criteria used to evalu- ate graph embeddings for mapping supercomputer communication networks. Our work was motivated by the need for fast library routines to do parallel sorting, fast Fouriertransformation and processor syn- chronization. During the design effort of these building blocks, we developed and analyzed a new technique to support a hypercube network embedded onto a two dimensional torus. A direct imple- mentation of the embedding is made possible by logical channels and pathways. A fast merge sorter based on the bitonic network serves as an example to show how a simple hypercube algorithm can outperform most of the asymptotically optimal mesh algorithms for practical machine sizes. In the conventional mesh computation model, processors are allowed to exchange one unit of data with a neighbor in each step. This model needs to be refined since modern mesh computers, such as the iWarp system, have hardware support for fast non-neighbor communication. The bitonic merge sort, a simple hypercube algorithm, contains a fair amount of fine grain parallelism not found in standard mesh algorithms. This form of parallelism includes pipelined communi- cation, computation overlapped with communication, use of wide instruction words and operands directly read from the communica- tion system through systolic gates. The measured sorting rate of more than 2 106 keys/sec on an well with much larger parallel computers. In our analysis of the relative performancewe compare our approach to different sorting methods on meshes. The mapped hypercube algorithm is shown to be best for a wide range of machine and problem sizes. For the readers mainly interested in complexity results, our ap- proach may seem somewhat surprising, but the analysis of the algo- rithm in an accurate model for the iWarp machine shows how good speed and good parallel efficiency is obtained from both forms of parallelism, large and fine grain.Keywords
This publication has 12 references indexed in Scilit:
- A comparison of sorting algorithms for the connection machine CM-2Published by Association for Computing Machinery (ACM) ,1991
- Improved sorting networks withO(logN) depthAlgorithmica, 1990
- Supporting systolic and memory communication in iWarpPublished by Association for Computing Machinery (ACM) ,1990
- Optimal and Sublogarithmic Time Randomized Parallel Sorting AlgorithmsSIAM Journal on Computing, 1989
- Communication in iWarp systemsPublished by Association for Computing Machinery (ACM) ,1989
- The Connection Machine model CM-1 architectureIEEE Transactions on Systems, Man, and Cybernetics, 1989
- An optimal sorting algorithm for mesh connected computersPublished by Association for Computing Machinery (ACM) ,1986
- An 0(n log n) sorting networkPublished by Association for Computing Machinery (ACM) ,1983
- Sorting on a mesh-connected parallel computerCommunications of the ACM, 1977
- Sorting networks and their applicationsPublished by Association for Computing Machinery (ACM) ,1968