Optimizing MPI Collectives Using Efficient Intra-node Communication Techniques over the Blue Gene/P Supercomputer

The Blue Gene/P (BG/P) system is the second generation in the line of massively large supercomputers that IBM built with a petaflop scalability footprint and with greater emphasis on maximizing the efficiency in the areas of power, cooling, and space consumption. The system consists of thousands of compute nodes interconnected by multiple networks, of which a 3D torus–equipped with direct memory access (DMA) engine–is the primary. BG/P also features a collective network which supports hardware accelerated collective operations such as broadcast, allreduce etc. BG/P nodes consist of four cache coherent symmetric multi-processor (SMP) cores. The message passing interface (MPI) is the popular method of programming parallel applications on these large supercomputers. One of BG/P’s operating modes is the quad mode where the four cores can be active MPI tasks, performing inter-node and intra-node communication.

In this paper, we propose software techniques to enhance MPI collective communication primitives, MPI Bcast and MPI Allreduce in BG/P quad mode by using cache coherent memory subsystem as the communication method inside the node. Specifically, we propose techniques utilizing shared address space wherein a process can access the peer’s memory by specialized system calls. Apart from cutting down the copy costs, such techniques allow for designing light weight synchronizing structures in software such as message counters. These counters are used to effectively pipeline data across the network and intra-node interfaces. Further, the shared address capability allows for easy means of core specialization where the different tasks in a collective operation can be delegated to specific cores. This is critical for efficiently using the hardware collective network on BG/P as different cores are needed for injection and reception of data from the network and for copy operation within the node. We also propose a concurrent data structure, Broadcast (Bcast) FIFO which is designed using atomic operations such as Fetch and Increment. We demonstrate the utility and benefits of these mechanisms using benchmarks which measure the performance of MPI Bcast and MPI Allreduce. Our optimization provides up to 2.9 times improvement for MPI Bcast over the current approaches on the 3D torus. Further, we see improvements up to 44% and 33% for MPI Bcast using the collective tree and MPI Allreduce over the 3D Torus, respectively.

By: Amith Mamidala, Ahmad Faraj, Sameer Kumar, Douglas Miller, Michael Blocksome, Thomas Gooding, Philip Heidelberger, Gabor Dozsa

Published in: RC25088 in 2010

LIMITED DISTRIBUTION NOTICE:

This Research Report is available. This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. It has been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g., payment of royalties). I have read and understand this notice and am a member of the scientific community outside or inside of IBM seeking a single copy only.

rc25088.pdf

Questions about this service can be mailed to reports@us.ibm.com .