Runtime Address Disambiguation for Local Memory Management

In heterogeneous multi-core systems, such as the Cell/B.E. and certain embedded systems, the accelerator core has its own fast local memory without hardware supported coherence between the local and global memories. It is the software's responsibility to dynamically transfer in and out the working set to the local memory when the total data set is too large to fit in the local memory. Similar to the hardware cache approach, a software-controlled cache can be set up to automatically manage the local memory for data transfer and reuse. However, software cache may bring about extra overheads, such as cache lookup and cache directory maintenance. Usually, the regular references in the program can be optimized with direct buffering, which replaces the references with local buffers directly at compile time to minimize the runtime overhead. However, such optimization relies on the precise alias or dependence information generated by the compiler, or directives provided by users. In applications where such information is absent, the compiler may lose opportunities for direct buffering. In this work, we explore the runtime address disambiguation in the local memory management. We propose a framework in which the references from software cache and direct buffers are checked at runtime with an overlapping detection method optimized for our purpose and hardware. If the addresses overlap, one local copy is kept to solve the coherence problem. Two directories, one for direct buffers and one for software cache are used together with their interaction carefully devised. As a result, our solution keeps the advantages of both software cache and direct buffer, and is able to disambiguate memory accesses efficiently at runtime. We have implemented this method in the XL compiler for acceleration on Cell, and have conducted experiments with small kernels and the NAS OpenMP benchmarks. The results show that our method maintains correctness while increasing the opportunities for direct buffering optimization. The performance of some benchmarks can be improved up to a factor of 3 while no slowdown has been found on any benchmark.

By: Tong Chen, Tao Zhang, Haibo Lin, Tao Liu, Kevin O'Brien, Marc Gonzalez Tallada

Published in: RC24750 in 2009

LIMITED DISTRIBUTION NOTICE:

This Research Report is available. This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. It has been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g., payment of royalties). I have read and understand this notice and am a member of the scientific community outside or inside of IBM seeking a single copy only.

rc24750.pdf

Questions about this service can be mailed to reports@us.ibm.com .