Application-bypass broadcast in MPICH over GM

Abstract
Processes of a parallel program can become unsynchronized, or skewed, during the course of running an application. Processes can become skewed as a result of unbalanced or asymmetric rode, or through the use of heterogeneous systems, where nodes in the system have different performance characteristics, as well as random, unpredictable effects such as the processes not being started at exactly the same time, or processors receiving interrupts during computation. Geographically distributed systems may have more severe skew because of variable communication times. Such skew can have a significant impact on the performance of collective communication operations which impose an implicit synchronization. The broadcast operation in MPICH is one such operation. An application-bypass broadcast operation is one which does not depend on the application running at a process to make progress. Such an operation would not be as sensitive to process skew. This paper describes the design and implementation of an application-bypass broadcast operation. We evaluated the implementation and find a factor of improvement of up to 16 for application-bypass broadcast compared to non-application-bypass broadcast when processes are skewed. Furthermore we see that as the system size increases, the effects of skew on non-application-bypass broadcast also increase. The application-bypass broadcast is much less sensitive to process skew which makes it more scalable than the non-application-bypass broadcast operation.

This publication has 8 references indexed in Scilit: