Low-Overhead Fault Tolerance for High-Throughput Data Processing Systems

1 June 2011

conference paper
Published by Institute of Electrical and Electronics Engineers (IEEE)

p. 689-699
https://doi.org/10.1109/icdcs.2011.29

Abstract

The MapReduce programming paradigm proved to be a useful approach for building highly scalable data processing systems. One important reason for its success is simplicity, including the fault tolerance mechanisms. However, this simplicity comes at a price: efficiency. MapReduce's fault tolerance scheme stores too much intermediate information on disk. This inefficiency negatively affects job completion time. Furthermore, this inefficiency in particular forbids the application of MapReduce in near real-time scenarios where jobs need to produce results quickly. In this paper, we discuss an alternative fault tolerance scheme that is inspired by virtual synchrony. The key feature of our approach is a low-overhead deterministic execution. Deterministic execution reduces the amount of persistently stored information. In addition, because persisting intermediate results are no longer required for fault tolerance, we use more efficient communication techniques that considerably improve job completion time and throughput. Our contribution is twofold: (i) we enable the use of MapReduce for jobs ranging from seconds to a few tens of seconds, satisfying these deadlines even in the case of failures, (ii) we considerably reduce the fault tolerance overhead and as such the overhead of MapReduce in general. Our modifications are transparent to the application.

Keywords

This publication has 12 references indexed in Scilit:

Multithreading-Enabled Active Replication for Event Stream Processing Operators
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2009
Minimizing Latency in Fault-Tolerant Distributed Stream Processing Systems
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2009
Deterministic Replay for Transparent Recovery in Component-Oriented Middleware
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2009
MapReduce
Communications of the ACM, 2008
Active replication of multithreaded applications
IEEE Transactions on Parallel and Distributed Systems, 2006
A survey of rollback-recovery protocols in message-passing systems
ACM Computing Surveys, 2002
Efficient atomic broadcast using deterministic merge
Published by Association for Computing Machinery (ACM) ,2000
Deterministic scheduling for transactional multithreaded replicas
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2000
Implementing fault-tolerant services using the state machine approach: a tutorial
ACM Computing Surveys, 1990
Exploiting virtual synchrony in distributed systems
Published by Association for Computing Machinery (ACM) ,1987