Reliability Modeling of RAID Storage Systems with Latent Errors

The reliability of disk storage systems is adversely affected by the presence of latent sector errors. Disk scrubbing and intradisk redundancy are two schemes proposed to cope with unrecoverable or latent media errors and enhance the reliability of RAID storage systems. Two recent studies have investigated the effectiveness of these schemes, but they have reached opposing conclusions. These studies were conducted using two different modeling approaches. We present a detailed investigation which reveals that this discrepancy originates from the difference in the accuracy offered by the two models. We find that one model provides quite accurate reliability results, whereas the other provides only coarse approximations which may differ by orders of magnitude from the actual values and lead to erroneous conclusions. In the process of investigating the details, merits, weaknesses, and applicability of each model, we derive enhanced models that provide reliability results that are in good agreement. We subsequently reassess the reliability results and conclusions presented in previous studies regarding the disk scrubbing and intradisk redundancy schemes.

An extended version of the report has appeared in: Proceedings 17th Annual Meeting of the IEEE/ACM Int'l Symp. on Modelling, Analysis and Simulation of Computer and Telecommunication Systems "MASCOTS 2009," London, UK, (IEEE, September 2009) pp. 111-122.

By: Ilias Iliadis

Published in: 2009 IEEE Int'l Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication SystemsPiscataway, NJ, IEEEI in 2009

Please obtain a copy of this paper from your local library. IBM cannot distribute this paper externally.

Questions about this service can be mailed to reports@us.ibm.com .