An Approach to Selecting Metrics for Detecting Performance Problems in Information Systems

Early detection of performance problems is essential to limit the scope and impact of performance degradations. Most commonly, performance problems are detected by applying threshold tests to a set of detection metrics. Unfortunately, the ad hoc manner in which these metrics are selected often results in false alarms and/or failing to detect problems until serious performance degradations result. To address this situation, we construct rules for metric selection based on analytic comparisons of power equations for five widely used metrics: departure counts ($D$), number in system ($L$), response times ($R$), service times ($S$), and utilizations ($U$). Examples of selection rules include: $L$ is preferred to $U$; $L$ is preferred to $R$ if the performance problem is dominated by an increase in expected arrival rates; and $R$ is preferred to $L$ if the performance problem is dominated by an increase in expected service times. These rules are assessed in the context of performance problems in the CPU and paging sub-systems of a production computer system.

Application of these rules in practice requires additional considerations. For example, several rules depend on the type of performance problem, such as its being dominated by an increase in expected arrival rates. Unfortunately, such prior knowledge is rarely available. Thus, we consider hybrid tests in which multiple metrics are used so that many kinds of performance problems are covered. Additional considerations are of importance in queueing networks.

By: Joseph L. Hellerstein

Published in: Proceedings of the 2nd International Workshop on Systems Management. Los Alamitos, CA, IEEE Computer Society Press, 1996. p. 30-9 , IEEE in 1995

Please obtain a copy of this paper from your local library. IBM cannot distribute this paper externally.

Questions about this service can be mailed to reports@us.ibm.com .