Reliable Hardware Barrier Synchronization Schemes

        Barrier synchronization is a crucial operation for parallel systems. Many schemes have been proposed in the literature to achieve fast barrier through software, hardware, or a combination of these mechanisms. However, none of these schemes emphasize fault-tolerant barrier operations. In this paper, we describe inexpensive support that can be added to network switches for achieving reliable hardware-based barrier synchronization while recovering from lost or corrupted messages. Necessary modifications to the switch architecture and the associated fault-tolerant message-passing protocols are presented. The protocols are optimized for the no-fault case while providing means to detect the failure of any step of the operation and to recover from it. The proposed scheme is evaluated with and without specialized support at the network interface and compared with similar approaches usinng software-based schemes. It promises significant potential to be applied to switch-based parallel systems, especially the emerging networks of workstations.

By: Rajeev Sivaram (The Ohio State Univ.), Craig B. Stunkel, Dhabaleswar K. Panda (The Ohio State Univ.)

Published in: RC20653 in 1996

This Research Report is not available electronically. Please request a copy from the contact listed below. IBM employees should contact ITIRC for a copy.

Questions about this service can be mailed to reports@us.ibm.com .