Adaptive Diagnosis in Distributed Systems

Copyright © (2005) by IEEE. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distrubuted for profit. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee.

Real-time problem diagnosis in large distributed computer systems and networks is a challenging task that requires fast and accurate inferences from potentially huge data volumes. In this paper, we propose a cost-efficient, adaptive diagnostic technique called active probing. Probes are end-to-end test transactions that collect information about the performance of a distributed system. Active probing uses probabilistic reasoning techniques combined with information-theoretic approach, and allows a fast online inference about the current system state via active selection of only a small number of most-informative tests. We demonstrate empirically that the active probing scheme greatly reduces both the number of probes (from 60% to 75% in most of our real-life applications), and the time needed for localizing the problem when compared with non-adaptive (pre-planned) probing schemes. We also provide some theoretical results on the complexity of probe selection, and the effect of “noisy” probes on the accuracy of diagnosis. Finally, we discuss how to model the system’s dynamics using Dynamic Bayesian networks, and an efficient approximate approach called sequential multifault; empirical results demonstrate clear advantage of such approaches over ”static” techniques that do not handle system’s changes.

By: Irina Rish; Mark Brodie; Sheng Ma; Natalia Odintsova; Alina Beygelzimer; Genady Grabarnik; Karina Hernandez

Published in: IEEE Transactions on Neural Networks, volume 16, (no 5), pages 1088-1109 in 2005


This Research Report is available. This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. It has been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g., payment of royalties). I have read and understand this notice and am a member of the scientific community outside or inside of IBM seeking a single copy only.


Questions about this service can be mailed to .