Failure Characterization of the NFS Using Fault Injection

        This paper studies the failure characteristics of the Network File System (NFS) using the fault-injection methodology. The experimental setup consists of a number of clients making file system requests to a server, simulating engineering/scientific workload. The goal of this experiment is to characterize NFS server failures under software faults typical in file systems. The fault-model used for the fault-injection is derived from programming errors observed in field failure data. The failure probabilities of server and client applications are estimated by a series of fault-injection experiments. The experimental setup is also instrumented to detect error propagation. This paper finds that error propogation in the system decreases with increasing workload. Intuitively, one may believe that increased load might cause increased error propogation. However, after a careful examination, it becomes evident that increased load shortens error latency and consequently the changes for error propogation with increased load, although counter-intuitive, is indeed consistent. To the best of our knowledge, these error propagation measurements represent first such result. The result is significant since error propagation affects the overall ability of a system to recover from the error, and has strong implications on recovery design, especially under low load conditions. The paper also makes suggestions on areas for improvement in the server code.

By: M. Devarakonda, K. Goswami and R. Chillarege

Published in: RC16342 in 1990

This Research Report is not available electronically. Please request a copy from the contact listed below. IBM employees should contact ITIRC for a copy.

Questions about this service can be mailed to reports@us.ibm.com .