Cell GC: Using the Cell Synergistic Processor as a Garbage Collection Coprocessor

In recent years, scaling of single-core superscalar processor performance has slowed due to complexity and power considerations. To improve program performance, designs are increasingly adopting chip multiprocessing with homogeneous or heterogeneous CMPs. By trading off features from a modern aggressive superscalar core, CMPs often offer better scaling characteristics in terms of aggregate performance, complexity and power, but often require additional software investment to rewrite, retune or recompile programs to take advantage of the new designs. The Cell Broadband Engine is a modern example of a heterogeneous CMP with coprocessors (accelerators) which can be found in supercomputers (Roadrunner), blade servers (IBM QS20/21), and video game consoles (SCEI PS3). A Cell BE processor has a host Power RISC processor (PPE) and eight Synergistic Processor Elements (SPE), each consisting of a Synergistic Processor Unit (SPU) and Memory Flow Controller (MFC). In this work, we explore the idea of offloading Automatic Dynamic Garbage Collection (GC) from the host processor onto accelerator processors using the coprocessor paradigm. Offloading part or all of GC to a coprocessor offers potential performance benefits, because while the coprocessor is running GC, the host processor can continue running other independent, more general computations.

We implement BDWgarbage collection on a Cell system and offload the mark phase to the SPE co-processor. We show mark phase execution on the SPE accelerator to be competitive with execution on a full fledged PPE processor. We also explore object-based and block-based caching strategies for explicitly managed memory hierarchies, and explore to effectiveness of several prefetching schemes in the context of garbage collection. Finally, we implement Capitulative Loads using the DMA by extending software caches and quantify its performance impact on the coprocessor.

By: Chen-Yong Cher; Michael Gschwind

Published in: RC24520 in 2007


This Research Report is available. This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. It has been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g., payment of royalties). I have read and understand this notice and am a member of the scientific community outside or inside of IBM seeking a single copy only.


Questions about this service can be mailed to reports@us.ibm.com .