Design and performance of multipath MIN architectures

1 June 1992

proceedings article
Published by Association for Computing Machinery (ACM)

p. 286-295
https://doi.org/10.1145/140901.141890

Abstract

In this paper, we discuss the use of multipath multistage intercon- nection networks (MINs) in the design of a fault-tolerant parallel computer. Multipath networks have multiple paths between any input and any output. In particular, we examine networks with ei- ther the property of expansion or maximal-fanout. We present an lower time bound for a worst-case permutation on deter- ministic maximal-fanout networks. We further show how a ran- domized approach to maximal-fanout avoids the regularity from which this worst case arises. Unlike most previous work, we examine systems which can tolerate node failure and isolation. We describe mechanisms for fault identification and system reconfiguration. In reconfiguring a faulty system, a naive approach is to preserve processing power by maximizing the number of processing nodes left in operation. However, our results show that the synchronization requirements of applications make it critical to eliminate nodes with poor net- work connections. We find that a conservative fault-propagation algorithm for reconfiguration, adapted from work by Leighton and Maggs (LM92), performs well for all of our multipath networks. We also address some practical issues of network construction and present performance simulations based upon the MIT Transit architecture (DeH90). Simulation results for 1024 node systems demonstrate that multipath networks, reconfigured with our fault- propagation algorithm, perform well not only in theory, but also in practice. In fact, our systems suffer only a small decrease in performance from network faults; the degradation is linear in the percentage of network failure.

Keywords

This publication has 9 references indexed in Scilit:

Fast algorithms for routing around faults in multibutterflies and randomly-wired splitter networks
IEEE Transactions on Computers, 1992
The message-driven processor: a multicomputer processing node with efficient mechanisms
IEEE Micro, 1992
The impact of communication locality on large-scale multiprocessor performance
Published by Association for Computing Machinery (ACM) ,1992
Directory-based cache coherence in large-scale multiprocessors
Computer, 1990
On-line algorithms for path selection in a nonblocking network
Published by Association for Computing Machinery (ACM) ,1990
An O(logN) deterministic packet routing scheme
Published by Association for Computing Machinery (ACM) ,1989
Efficient distributed recovery using message logging
Published by Association for Computing Machinery (ACM) ,1989
Expanders might be practical: fast algorithms for routing around faults on multibutterflies
Published by Institute of Electrical and Electronics Engineers (IEEE) ,1989
Optimistic recovery in distributed systems
ACM Transactions on Computer Systems, 1985