On the use and implementation of message logging
- 17 December 2002
- conference paper
- Published by Institute of Electrical and Electronics Engineers (IEEE)
- p. 298-307
- https://doi.org/10.1109/ftcs.1994.315630
Abstract
Message logging has long been advocated as offering bet- ter failure-free performance than coordinated checkpoint- ing. On the contrary, we present a number of experiments showing that for compute-intensive applications executing in parallel on clusters of workstations, message logging has higher failure-free overhead than coordinated checkpoint- ing. Message logging protocols, however, result in much shorter output latency than coordinated checkpointing. Therefore, message logging should be used for applications involving substantial interactions with the outside world, while coordinated checkpointing should be used otherwise. We also present an unorthodox message logging de- sign that uses coordinated checkpointing with message logging, departing from the conventional approaches that use independent checkpointing. This combination of message logging and coordinated checJcpointing offers sev- eral advantages, including improved failure-free perfor- mance, bounded recovery time, simplified garbage collec- tion, and reduced complexity. Meanwhile, the new pro- tocols retain the advantages of the conventional message logging protocols with respect to output commit. Finally, we discuss three "lessons learned" from an im- plementation of various message logging protocols. First, during output commit, only the dependency information for the messages in the log needs to be written to the stable storage. It is not necessary to write the message data to stable storage, leading to faster output commit. Second, the use of copy-on-write in the implementation of message logging substantially reduces the logging over- head for communication-intensive programs. Finally, we provide quantitative evidence supporting previous qualita- tive claims about the superiority of sender-based message logging over receiver-based logging.Keywords
This publication has 31 references indexed in Scilit:
- A low overhead checkpointing and rollback recovery scheme for distributed systemsPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2003
- Volatile logging in n-fault-tolerant distributed systemsPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2003
- The performance of consistent checkpointingPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2003
- Optimistic message logging for independent checkpointing in message-passing systemsPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2003
- Global checkpointing for distributed programsPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2003
- Efficient distributed recovery using message loggingPublished by Association for Computing Machinery (ACM) ,1989
- Optimal checkpointing and local recording for domino-free rollback recoveryInformation Processing Letters, 1987
- Optimistic recovery in distributed systemsACM Transactions on Computer Systems, 1985
- PublishingPublished by Association for Computing Machinery (ACM) ,1983
- System structure for software fault toleranceIEEE Transactions on Software Engineering, 1975