Most implementations of the Single Instruction Multiple Data (SIMD) model available today require that data elements be packed in vector registers. Operations on disjoint vector elements are not supported directly, and require explicit data reorganization manipulations. Computations on non-contiguous and especially interleaved data appear in important applications, which can greatly benefit from SIMD instructions once the data is reorganized properly. Vectorizing such computations efficiently is therefore an ambitious challenge for both programmers and vectorizing compilers. In this paper we demonstrate an automatic compilation scheme that supports effective vectorization in the presence of interleaved data with strides that are power of 2, facilitating data reorganization We demonstrate how our vectorization scheme applies to SIMD architectures that are dominant today, and present experimental results on a wide range of key kernels, showing speedups up to 3.7 for interleaving level (stride) as high as 8.
By: Dorit Nuzman; Ira Rosen; Ayal Zaks
Published in: H-0235 in 2005
LIMITED DISTRIBUTION NOTICE:
This Research Report is available. This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. It has been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g., payment of royalties). I have read and understand this notice and am a member of the scientific community outside or inside of IBM seeking a single copy only.
Questions about this service can be mailed to reports@us.ibm.com .