Failure Management for Large Scaled Systems

Large scaled systems confront relatively frequent failures which degrade the productivity of the system. An occurrence of failure incurs job restart and administrator’s work like problem determination. Job restart deteriorates substantial performance of the system, and increase of administrator’s work will be reflected to maintenance cost. Further improvement in the productivity of large scaled systems depends on how failure is managed well. The goal of this study is to improve productivity of large scaled systems from the system management point of view, i.e. substantial performance and administration cost. Substantial performance is measured by average job slowdown and system utilization.
This document describes failure prevention as a usable technique to improve substantial performance, and an approach for problem determination of large scaled systems as a means to lower the administration cost. Failure prevention is to take proactive actions based on fault prediction and anomaly detection. This contributes to evade performance degradation caused by faults. The cost of problem determination is another impact of failures. This should be lowered by advanced tools that can provide informative information to administrators based on the data collected from the system. In addition, this document describes the data analysis platform which becomes a common infrastructure for the analyses required by failure prevention and problem determination tools.

By: Hideki Tai; Takayuki Kushida

Published in: RT0705 in 2007


