Parallelizing Loops in Parallel Programs

The advent of multi-core systems has taken away the luxury of automatic speedups from legacy applications, which was otherwise possible in the era of uni-core systems (the single core performance used to double every 18 months or so, thus improving the performance of the application). These applications have to be explicitly re-targeted to derive benefits from the increasing speedups of the multi-core systems. One of the key requirements of achieving better utilization of multiple available cores is that of (further) parallelization of code. This requirement is also pertinent in the context of new task-parallel languages (such as Cilk++, Intel Thread Building Blocks, Java Concurrency, OpenMP, Chappel, Fortress, X10, HJ and so on), where the user may start with a partially parallelized version of the program, and then proceed to further parallelize it in an incremental fashion. In this paper, we demonstrate the insufficiency of traditional loop parallelization techniques for programs that contain explicit parallelism and then present an extension to the classical loop parallelization techniques for semantic preserving parallelization of loops in parallel programs. We present our techniques in the context of a refactoring framework which can help an application developer to incrementally parallelize loops in explicitly parallel programs. In the process, we extend the state of the art in two aspects: (a) Our parallelization framework depends on the May-Happen-in-Parallel (MHP) information, which needs to be recomputed after every refactoring, and thus the complexity of MHP computation gets added to the cost of our refactoring. Owing to the high computational complexity of the MHP algorithm of Agarwal et al [2] (O(N3H) to compute the set of all statements that may run in parallel with a given statement, where N is the number of statements in the program and H is the height of the program structure tree (PST)), the complexity of our proposed approach can become O(N3H). To improve the complexity of our overall algorithm, we present a novel incremental MHP analysis, whose complexity is O(N2)), which in turn helps reduce the complexity of our overall algorithm to O(N2). (b) To improve the opportunities for parallelization we present a scheme of partial privatization, where we have a handle on the space overhead associated with the traditional privatization techniques. The techniques presented in the paper are not fully automatic - may use some manual intervention to fine tune the scope and effectiveness of the transformation, thereby bypassing some of the pitfalls of automatic parallelization techniques. We have used our techniques on varied parallel benchmark programs and have been able to derive encouraging results.

By: Soham S. Chakraborty and V. Krishna Nandivada

Published in: RI10007 in 2010

LIMITED DISTRIBUTION NOTICE:

This Research Report is available. This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. It has been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g., payment of royalties). I have read and understand this notice and am a member of the scientific community outside or inside of IBM seeking a single copy only.

RI10007.pdf

Questions about this service can be mailed to reports@us.ibm.com .