Notes on Reliability Models for Non-MDS Erasure Codes

We discuss two variations on the standard model for determining the reliability (or Mean Time To Data Loss, MTTDL) for storage arrays with erasure codes. The standard model assumes the erasure code is MDS and has a certain erasure fault tolerance t. The “Hamming fault tolerance” t is one less than the Hamming distance of the code. Such codes can tolerate all instances of k failures for k t but no instances of more than t failures. The first variation extends the model to non-MDS codes that have resilience to some (but not all) instances of failures that exceed the Hamming fault tolerance. We say such codes have “elastic” fault tolerance. We apply this model to LDPC and WEAVER codes which have high average fault tolerance, but have very low Hamming fault tolerance. A second application of this model is to the case of multiple instances of arrays, each with an independent MDS code. Additionally, the standard model also assumes that rebuild occurs incrementally, that is, one disk at a time. We vary the model to better reflect some actual systems where rebuild is done in parallel on all failed disks.

By: James Lee Hafner; KK Rao

Published in: RJ10391 in 2006

LIMITED DISTRIBUTION NOTICE:

This Research Report is available. This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. It has been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g., payment of royalties). I have read and understand this notice and am a member of the scientific community outside or inside of IBM seeking a single copy only.

rj10391.pdf

Questions about this service can be mailed to reports@us.ibm.com .