Efficient fault diagnosis using probing

In this paper, we address the problem of efficient diagnosis in real-time systems capable of on-line information gathering, such as sending ”probes” (i.e., test transactions, such as "traceroute" or "ping") in order to identify network faults and evaluate performance of distributed computer systems. We use a Bayesian network to model probabilistic relations between the problems (faults, performance degradation) and symptoms (probe outcomes). Due to intractability of exact probabilistic inference in large systems, we investigated approximation techniques, such as a local-inference scheme called mini-buckets(Dechter & Rish 1997). Our empirical study demonstrates advantages of local approximations for large diagnostic problems: the approximation is very efficient and "degrades gracefully" with noise; also, the approximation error gets smaller on networks with higher confidence (probability) of the exact diagnosis. Since the accuracy of diagnosis depends on how much information the probes can provide about the system states, the second part of our work is focused on the probe selection task. Small probe sets are desirable in order to minimize the costs imposed by probing, such as additional network load and data management requirements. Our results show that, although finding the optimal collection of probes is expensive for large networks, efficient approximation algorithms can be used to find a nearly-optimal set.

By: Irina Rish, Mark A. Brodie, Sheng Ma

Published in: RC22369 in 2001

LIMITED DISTRIBUTION NOTICE:

This Research Report is available. This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. It has been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g., payment of royalties). I have read and understand this notice and am a member of the scientific community outside or inside of IBM seeking a single copy only.

RC22369.pdf

Questions about this service can be mailed to reports@us.ibm.com .