Design and performance of multipath MIN architectures
- 1 June 1992
- proceedings article
- Published by Association for Computing Machinery (ACM)
- p. 286-295
- https://doi.org/10.1145/140901.141890
Abstract
In this paper, we discuss the use of multipath multistage intercon- nection networks (MINs) in the design of a fault-tolerant parallel computer. Multipath networks have multiple paths between any input and any output. In particular, we examine networks with ei- ther the property of expansion or maximal-fanout. We present an lower time bound for a worst-case permutation on deter- ministic maximal-fanout networks. We further show how a ran- domized approach to maximal-fanout avoids the regularity from which this worst case arises. Unlike most previous work, we examine systems which can tolerate node failure and isolation. We describe mechanisms for fault identification and system reconfiguration. In reconfiguring a faulty system, a naive approach is to preserve processing power by maximizing the number of processing nodes left in operation. However, our results show that the synchronization requirements of applications make it critical to eliminate nodes with poor net- work connections. We find that a conservative fault-propagation algorithm for reconfiguration, adapted from work by Leighton and Maggs (LM92), performs well for all of our multipath networks. We also address some practical issues of network construction and present performance simulations based upon the MIT Transit architecture (DeH90). Simulation results for 1024 node systems demonstrate that multipath networks, reconfigured with our fault- propagation algorithm, perform well not only in theory, but also in practice. In fact, our systems suffer only a small decrease in performance from network faults; the degradation is linear in the percentage of network failure.Keywords
This publication has 9 references indexed in Scilit:
- Fast algorithms for routing around faults in multibutterflies and randomly-wired splitter networksIEEE Transactions on Computers, 1992
- The message-driven processor: a multicomputer processing node with efficient mechanismsIEEE Micro, 1992
- The impact of communication locality on large-scale multiprocessor performancePublished by Association for Computing Machinery (ACM) ,1992
- Directory-based cache coherence in large-scale multiprocessorsComputer, 1990
- On-line algorithms for path selection in a nonblocking networkPublished by Association for Computing Machinery (ACM) ,1990
- An O(logN) deterministic packet routing schemePublished by Association for Computing Machinery (ACM) ,1989
- Efficient distributed recovery using message loggingPublished by Association for Computing Machinery (ACM) ,1989
- Expanders might be practical: fast algorithms for routing around faults on multibutterfliesPublished by Institute of Electrical and Electronics Engineers (IEEE) ,1989
- Optimistic recovery in distributed systemsACM Transactions on Computer Systems, 1985