Problem determination is one of the most important tasks in managing distributed systems. Probing (both at the transaction and network levels) has been widely used for assessing compliance with Service Level Agreements and locating problems in distributed systems. However a probing scheme which uses a fixed set of regularly scheduled probes can be expensive in terms of the number of synthetic transactions needed, especially for the task of problem determination. This paper introduces an active probing scheme to reduce the number of probes needed. Our key idea is to divide the problem determination task into two steps. We first use a relatively small number of fixed, regularly scheduled, probes for detecting that a problem has occurred. In the second phase, once occurrence of a problem is detected, additional probes are issued on-the-fly to acquire additional information until the problem is localized. We develop algorithms for selecting an optimal set of probes for problem detection and choosing which probes to send next based on what is currently known. We demonstrate through both analysis and simulation that the active probing scheme can greatly reduce the number of probes and the time needed for localizing the problem when compared with a non-active probing scheme.

By: Mark A. Brodie, Irina Rish, Sheng Ma, Genady Grabarnik, NATALIA V. ODINTSOVA

Published in: RC22817 in 2003

LIMITED DISTRIBUTION NOTICE:

This Research Report is available. This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. It has been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g., payment of royalties). I have read and understand this notice and am a member of the scientific community outside or inside of IBM seeking a single copy only.

RC22817.pdf

Questions about this service can be mailed to reports@us.ibm.com .