Failure Management for Large Scaled Systems

Large scaled systems confront relatively frequent failures which degrade the productivity of the system. An occurrence of failure incurs job restart and administrator’s work like problem determination. Job restart deteriorates substantial performance of the system, and increase of administrator’s work will be reflected to maintenance cost. Further improvement in the productivity of large scaled systems depends on how failure is managed well. The goal of this study is to improve productivity of large scaled systems from the system management point of view, i.e. substantial performance and administration cost. Substantial performance is measured by average job slowdown and system utilization.
This document describes failure prevention as a usable technique to improve substantial performance, and an approach for problem determination of large scaled systems as a means to lower the administration cost. Failure prevention is to take proactive actions based on fault prediction and anomaly detection. This contributes to evade performance degradation caused by faults. The cost of problem determination is another impact of failures. This should be lowered by advanced tools that can provide informative information to administrators based on the data collected from the system. In addition, this document describes the data analysis platform which becomes a common infrastructure for the analyses required by failure prevention and problem determination tools.

By: Hideki Tai; Takayuki Kushida

Published in: RT0705 in 2007

LIMITED DISTRIBUTION NOTICE:

This Research Report is available. This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. It has been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g., payment of royalties). I have read and understand this notice and am a member of the scientific community outside or inside of IBM seeking a single copy only.

RT0705.pdf

Questions about this service can be mailed to reports@us.ibm.com .