Performance Analysis Of All-to-All Communication on the Blue Gene/L Supercomputer

All-to-all communication is a well known performance bottleneck for many applications. For such applications to scale to a large number of processors, optimizing all-to-all communication is critical. In this paper, we analyze the performance of all-to-all communication on the Blue Gene/L torus interconnection network, which has limited bisection bandwidth. The torus interconnect topology has link contention even for all-to-all communication operations with short messages. We observed that the performance of all-to-all communication also depends on the shape of the processor partition. We present a performance analysis of all-to-all communication on mesh and torus partitions of various shapes and sizes. We then present optimization schemes to enhance the performance of all-to-all communication. The large message optimization substantially improves all-to-all performance on an asymmetric torus. In particular, performance improved from about 70% to over 99% of peak on a 20,480 (40 × 32 × 16) node configuration, which was the largest machine to which we had access. The short message optimization can double all-to-all performance for very short messages.

By: Sameer Kumar; Philip Heidelberger

Published in: RC24327 in 2007


