CellSort: High Performance Sorting on the Cell Processor

In this paper we describe the design and implementation of CellSort − a high performance distributed sort algorithm for the Cell processor. We design CellSort as a distributed bitonic merge with a data-parallel bitonic sorting kernel. In order to best exploit the architecture of the Cell processor and make use of all available forms of parallelism to achieve good scalability, we structure CellSort as a three-tiered sort. The first tier is a SIMD (single-instruction multiple data) optimized bitonic sort, which sorts up to 128KB of items that cat fit into one SPE’s (a co-processor on Cell) local store. We design a comprehensive SIMDization scheme that employs data parallelism even for the most fine-grained steps of the bitonic sorting kernel. Our results show that, SIMDized bitonic sorting kernel is vastly superior to other alternatives on the SPE and performs up to 1.7 times faster compared to quick sort on 3.2GHz Intel Xeon. The second tier is an in-core bitonic merge optimized for cross-SPE data transfers via asynchronous DMAs, and sorts enough number of items that can fit into the cumulative space available on the local stores of the participating SPEs. We design data transfer and synchronization patters that minimize serial sections of the code by taking advantage of the high aggregate cross-SPE bandwidth available on Cell. Results show that, in-core bitonic sort scales well on the Cell processor with increasing number of SPEs, and performs up to 10 times faster with 16 SPEs compared to parallel quick sort on dual-3.2GHz Intel Xeon. The third tier is an out-of-core1 bitonic merge which sorts large number of items stored in the main memory. Results show that, when properly implemented, distributed out-of-core bitonic sort on Cell can significantly outperform the asymptotically (average case) superior quick sort for large number of memory resident items (up to 4 times faster when sorting 0.5GB of data with 16 SPEs, compared to dual-3.2GHz Intel Xeon).

By: Bugra Gedik; Rajesh R. Bordawekar; Philip S. Yu

Published in: RC24161 in 2007

LIMITED DISTRIBUTION NOTICE:

This Research Report is available. This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. It has been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g., payment of royalties). I have read and understand this notice and am a member of the scientific community outside or inside of IBM seeking a single copy only.

rc24161.pdf

Questions about this service can be mailed to reports@us.ibm.com .