Autonomic Computing Features for Large-Scale Server Management and Control

A computer system would satisfy the requirements of “autonomic computing”, if the system can configure and reconfigure itself by knowing the operating environments, protect and heal itself from various failures or malfunctions. In order to know the environments and detect failure, an autonomic system needs the capability of acquiring the information through self-monitoring.Once the sequence of events leading to a series of disasters are figured out, it is required to predict and control the system management process through a number of automated learning and proactive actions. In this paper, we address the cluster system RAS (Reliability Availability and Serviceability) by analyzing the realistic system event log history, collected from a 250 node large-scale cluster. Based on the analysis of these events through a number of machine-learning and artificial intelligence techniques, we have established the self-management and control. While the time-series methods can be used effectively for predicting system performance parameters, the rule-based classification algorithms effectively implemented to predict future critical events up to 70 % accuracy. Bayesian network based algorithms can be used for root-cause analysis through adaptive probing and establishing probe managers. We also cover some of the ongoing efforts to provide an online prediction and control mechanism through a hybrid model combining the selected artificial intelligence and machine learning techniques including active probing and triggers.

By: Ramendra K. Sahoo, Irina Rish, A.J. Oliner, Manish Gupta, Jose E. Moreira, Sheng Ma, Ricardo Vilalta, A. Sivasubramaniam

Published in: RC22830 in 2003

LIMITED DISTRIBUTION NOTICE:

This Research Report is available. This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. It has been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g., payment of royalties). I have read and understand this notice and am a member of the scientific community outside or inside of IBM seeking a single copy only.

RC22830.pdf

Questions about this service can be mailed to reports@us.ibm.com .