Quantitative Study of the Performance and Reliability of a Resilient 3-D Mesh-Based Server

The task of managing large servers is becoming ever more complex as systems are growing. Human errors in reconfiguring and maintaining systems is a dominant cause of outages and data loss. In this paper we present an analysis of a system that is architected around an alternative service model, called "Fail-in-Place" or "Deferred Maintenance".

In such a system resources are either initially over-provisioned or assumes timely addition of new resources. By delaying service actions - possibly for the entire lifetime of the system - management of the system can be simplified.

In this paper we study the effects of progressive resource failures on the capacity, performance, and reliability of a 3-D, mesh-based cube. We assume that the cube is used to store data in a distributed manner across multiple bricks, using an arbitrary data redundancy algorithm. A prototype of such a system is currently being built by our team.

We quantify what percentage of the original bricks may fail before the system becomes unusable. We also quantify the degradation of network performance both within the cube and to external hosts as a function of brick failures. The results show that a 3D mesh-based cube can remain operational until about 40% of the bricks fail, assuming sufficient inter-brick network bandwidth and sufficient connectivity to a cube's surface by external hosts or clients. Finally, we quantify the reliability of the system as a function of brick failure rates. By building bricks based on commodity part, one can design a 3-D mesh-based cube that should remain operational without maintenance for 3 - 7 years with 11% - 25% brick over-provisioning.

By: Claudio Fleiner; Deepak R. Kenchammana Hosekote; Robert B. Garner; Winfried Wilcke

Published in: RJ10308 in 2003

LIMITED DISTRIBUTION NOTICE:

This Research Report is available. This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. It has been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g., payment of royalties). I have read and understand this notice and am a member of the scientific community outside or inside of IBM seeking a single copy only.

rj10308.pdf

Questions about this service can be mailed to reports@us.ibm.com .