Pinpoint: problem determination in large, dynamic Internet services
Top Cited Papers
- 25 June 2003
- conference paper
- Published by Institute of Electrical and Electronics Engineers (IEEE)
- p. 595-604
- https://doi.org/10.1109/dsn.2002.1029005
Abstract
Traditional problem determination techniques rely on static dependency models that are difficult to generate accurately in today's large, distributed, and dynamic application environments such as e-commerce systems. We present a dynamic analysis methodology that automates problem determination in these environments by 1) coarse-grained tagging of numerous real client requests as they travel through the system and 2) using data mining techniques to correlate the believed failures and successes of these requests to determine which components are most likely to be at fault. To validate our methodology, we have implemented Pinpoint, a framework for root cause analysis on the J2EE platform that requires no knowledge of the application components. Pinpoint consists of three parts: a communications layer that traces client requests, a failure detector that uses traffic-sniffing and middleware instrumentation, and a data analysis engine. We evaluate Pinpoint by injecting faults into various application components and show that Pinpoint identifies the faulty components with high accuracy and produces few false-positives.Keywords
This publication has 7 references indexed in Scilit:
- An alarm correlation and fault identification scheme based on OSI managed object classesPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2003
- Automatic alarm correlation for fault identificationPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2002
- An active approach to characterizing dynamic dependencies for problem determination in a distributed environmentPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2002
- Architecture and dependability of large-scale internet servicesIEEE Internet Computing, 2002
- Making distributed applications manageable through instrumentationJournal of Systems and Software, 1999
- High speed and robust event correlationIEEE Communications Magazine, 1996
- Alarm correlation and fault identification in communication networksIEEE Transactions on Communications, 1994