High Performance Distributed Co-clustering and Collaborative Filtering

Petascale Analytics is a hot research area both in academia and industry. It envisages processing massive amounts of data at extremely high rates and generating new scientific insights along with positive impact (for both users and providers) of industries such as E-commerce, Telecom, Finance, Life Sciences and so forth. We consider Collaborative filtering (CF) and Clustering algorithms that are key fundamental analytic kernels that help in achieving these aims. Real-time CF and co-clustering on highly sparse massive datasets, while achieving a high prediction accuracy, is a computationally challenging problem. In this report, we present the novel designs for soft real-time (less than 1 min.) distributed co-clustering based Collaborative Filtering algorithm. Our distributed algorithms have been optimized for multi-core cluster architectures. Theoretical analysis of time complexity of our algorithms proves the efficacy of our approach. Using Netflix (900M training ratings with replication) as well as the Yahoo KDD Cup 1 (4.6B training ratings with replication) datasets , we demonstrate the performance and scalability of our flat and hierarchical algorithm on a 1024 & 4096-node multi-core cluster architecture respectively. Our hybrid distributed flat algorithm performs 2x better than the previous flat algorithms on 1024 nodes of BlueGene/P while our batch mode hierarchical distributed algorithm (implemented using OpenMP with MPI) demonstrates around 4x better performance (on Blue Gene/P) as compared to the our hybrid distributed flat algorithm [best prior (flat MPI+OMP based algorithm) work, along with high accuracy (26± 4 RMSE for Yahoo KDD Cup data and 87± 0.02 for Netflix data). For online execution, we demonstrate around 3x gain vs the online baseline MPI algorithm. To the best of our knowledge, these are the best known performance results for collaborative filtering, at high prediction accuracy, for multi-core cluster architectures.

By: Ankur Narang, Abhinav Srivastava and Naga Praveen Kumar Katta

Published in: RI11019 in 2011

LIMITED DISTRIBUTION NOTICE:

This Research Report is available. This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. It has been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g., payment of royalties). I have read and understand this notice and am a member of the scientific community outside or inside of IBM seeking a single copy only.

parallel.cf.bgp.pdf

Questions about this service can be mailed to reports@us.ibm.com .