Autonomic Computing Features for Large-Scale Server Management and Control

A computer system would satisfy the requirements of “autonomic computing”, if the system can configure and reconfigure itself by knowing the operating environments, protect and heal itself from various failures or malfunctions. In order to know the environments and detect failure, an autonomic system needs the capability of acquiring the information through self-monitoring.Once the sequence of events leading to a series of disasters are figured out, it is required to predict and control the system management process through a number of automated learning and proactive actions. In this paper, we address the cluster system RAS (Reliability Availability and Serviceability) by analyzing the realistic system event log history, collected from a 250 node large-scale cluster. Based on the analysis of these events through a number of machine-learning and artificial intelligence techniques, we have established the self-management and control. While the time-series methods can be used effectively for predicting system performance parameters, the rule-based classification algorithms effectively implemented to predict future critical events up to 70 % accuracy. Bayesian network based algorithms can be used for root-cause analysis through adaptive probing and establishing probe managers. We also cover some of the ongoing efforts to provide an online prediction and control mechanism through a hybrid model combining the selected artificial intelligence and machine learning techniques including active probing and triggers.

By: Ramendra K. Sahoo, Irina Rish, A.J. Oliner, Manish Gupta, Jose E. Moreira, Sheng Ma, Ricardo Vilalta, A. Sivasubramaniam

Published in: RC22830 in 2003


