A Checkpointing Strategy for Scalable Recovery on Distributed Parallel Systems

In this paper, we describe a new scheme for checkpointing parallel
applications on message-passing scalable distributed memory systems.
The novelty of our scheme is that a checkpointed application can be
restored, from its checkpointed state, in a reconfigured form. Thus, a
parallel application may be checkpointed while executing with $t_1$
tasks on $p_1$ processors, and then restarted from the checkpointed
state with $t_2$ tasks on $p_2$ processors. As a result, applications
can recover from partial failures in the underlying system. Also, the
reconfigurable checkpointed states can be migrated from one parallel
system to another even if they do not have the same number of
processors. We describe a new programming model for implementing a
reconfigurable checkpointing scheme for parallel programs. This new
model is derived from the DRMS programming model, developed in the
context of run-time reconfiguration of parallel applications. A key
component of our implementation is the distribution-independent
representation of application array data structures in persistent
storage. For further optimizing the performance of checkpoint/restart
operations, we provide parallel array section streaming operations for
such distributed arrays. We present performance data for the
reconfigurable checkpointing and restarting of parallel applications
and compare that with the performance of conventional forms of
checkpointing. Our results demonstrate the advantages of the new
scheme we describe.

By: Vijay K. Naik, Samuel P. Midkiff, Jos'e E. Moreira

Published in: RC20964 in 1997

LIMITED DISTRIBUTION NOTICE:

This Research Report is available. This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. It has been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g., payment of royalties). I have read and understand this notice and am a member of the scientific community outside or inside of IBM seeking a single copy only.

8722.ps.gz

Questions about this service can be mailed to reports@us.ibm.com .