A New Approach to Fault-Tolerant Wormhole Routing for Mesh-Connected Parallel Computers

Copyright © (2004) by IEEE. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distrubuted for profit. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee.

A new method for fault-tolerant routing in arbitrary dimensional meshes is introduced. The method was motivated by certain routing requirements of an initial design of the Blue Gene supercomputer in IBM Research. The machine is organized a s 3-dimensional mesh containing many thousands of nodes. Among the requirements were to provide deterministic deadlock-free wormhole routing in a 3-dimensional mesh, in the presence of many faults (up to a few percent of the number of nodes in the machine), while using two virtual channels. It was also desired to minimize the number of "turns" in each route, i.e., the number of times that the route changes direction. There has been much work on routing methods for meshes that route messages around faults or regions of faults. The new method is to declare certain nonfaulty nodes to be "lambs"; a lamb is used to routing but not processing, so a lamb is neither the source nor the destination of a message. The lambs are chosen so that every "survivor node", a node that is neither faulty nor a lamb, can reach every survivor node by a t most tow rounds of dimension-ordered (such a e-cube) routing. An algorithm for finding a set of lambs is presented. The results of simulations on 2D and 3D meshes of various sizes with various numbers of random node faults are given. For example, on a 32 x 32 x 32 3D mesh with 3% random faults, and using at most two rounds of e-cube routing for each message, the average number of lambs is less than 68, which is less than 7% of the number 983 of faults and less than 0.21% of the number 32768 of nodes. The computational complexity of finding the minimum number of lambs for a given fault set is also explored, and this problem is shown to be NP-hard for 3-dimensional meshes with two rounds of 3-cube routing.

By: Ching-Tien Ho, Larry Stockmeyer

Published in: IEEE Transactions on Computers, volume 53, (no 4), pages 427-38 in 2004

LIMITED DISTRIBUTION NOTICE:

This Research Report is available. This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. It has been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g., payment of royalties). I have read and understand this notice and am a member of the scientific community outside or inside of IBM seeking a single copy only.

rj10265.pdf

Questions about this service can be mailed to reports@us.ibm.com .