Failure diagnosis using decision trees

Top Cited Papers

10 June 2004

conference paper
Published by Institute of Electrical and Electronics Engineers (IEEE)

p. 36-43
https://doi.org/10.1109/icac.2004.1301345

Abstract

We present a decision tree learning approach to diagnosing failures in large Internet sites. We record runtime properties of each request and apply automated machine learning and data mining techniques to identify the causes of failures. We train decision trees on the request traces from time periods in which user-visible failures are present. Paths through the tree are ranked according to their degree of correlation with failure, and nodes are merged according to the observed partial order of system components. We evaluate this approach using actual failures from eBay, and find that, among hundreds of potential causes, the algorithm successfully identifies 13 out of 14 true causes of failure, along with 2 false positives. We discuss some results in applying simplified decision trees on eBay's production site for several months. In addition, we give a cost-benefit analysis of manual vs. automated diagnosis systems. Our contributions include the statistical learning approach, the adaptation of decision trees to the context of failure diagnosis, and the deployment and evaluation of our tools on a high-volume production service.

Keywords

This publication has 7 references indexed in Scilit:

Real-time problem determination in distributed systems using active probing
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2004
End-to-end service failure diagnosis using belief networks
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2003
An active approach to characterizing dynamic dependencies for problem determination in a distributed environment
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2002
Selection of relevant features and examples in machine learning
Artificial Intelligence, 1997
High speed and robust event correlation
IEEE Communications Magazine, 1996
Irrelevant Features and the Subset Selection Problem
Published by Elsevier ,1994
Mining association rules between sets of items in large databases
Published by Association for Computing Machinery (ACM) ,1993