Optimization of MPI Collective Communication on BlueGene/L Systems

BlueGene/L is currently the world’s fastest supercomputer. It consists of a large number of low power dual-processor compute nodes interconnected by high speed torus and tree networks. Because the nodes do not have shared memory, MPI is the the natural programming model for this machine. The BlueGene/L MPI library is a port of MPICH2.

In this paper we discuss the implementation of MPI collectives on BlueGene/L. The MPICH2 implementation of MPI collectives is based on point-to-point communication primitives. This turns out to be suboptimal for a number of reasons. Machine-optimized MPI collectives are necessary to harness the performance of BlueGene/L. We discuss these optimized MPI collectives, describing the algorithms and presenting performance results measured with targeted micro-benchmarks on real BlueGene/L hardware with up to 4096 compute nodes.

By: Gheorghe Almasi; Charles J. Archer; C. Chris Erway; Philip Heidelberger; Xavier Martorell; Jose E. Moreira; B. Steinmacher-Burow; Yili Zheng

Published in: RC23554 in 2005

LIMITED DISTRIBUTION NOTICE:

This Research Report is available. This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. It has been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g., payment of royalties). I have read and understand this notice and am a member of the scientific community outside or inside of IBM seeking a single copy only.

rc23554.pdf

Questions about this service can be mailed to reports@us.ibm.com .