Design and performance of multipath MIN architectures

Abstract
In this paper, we discuss the use of multipath multistage intercon- nection networks (MINs) in the design of a fault-tolerant parallel computer. Multipath networks have multiple paths between any input and any output. In particular, we examine networks with ei- ther the property of expansion or maximal-fanout. We present an lower time bound for a worst-case permutation on deter- ministic maximal-fanout networks. We further show how a ran- domized approach to maximal-fanout avoids the regularity from which this worst case arises. Unlike most previous work, we examine systems which can tolerate node failure and isolation. We describe mechanisms for fault identification and system reconfiguration. In reconfiguring a faulty system, a naive approach is to preserve processing power by maximizing the number of processing nodes left in operation. However, our results show that the synchronization requirements of applications make it critical to eliminate nodes with poor net- work connections. We find that a conservative fault-propagation algorithm for reconfiguration, adapted from work by Leighton and Maggs (LM92), performs well for all of our multipath networks. We also address some practical issues of network construction and present performance simulations based upon the MIT Transit architecture (DeH90). Simulation results for 1024 node systems demonstrate that multipath networks, reconfigured with our fault- propagation algorithm, perform well not only in theory, but also in practice. In fact, our systems suffer only a small decrease in performance from network faults; the degradation is linear in the percentage of network failure.

This publication has 9 references indexed in Scilit: