Understanding Systems and Architecture for Big Data

The use of Big Data underpins critical activities in all sectors of our society. Achieving the full transformative potential of Big Data in this increasingly digital world requires both new data analysis algorithms and a new class of systems to handle the dramatic data growth, the demand to integrate structured and unstructured data analytics, and the increasing computing needs of massive-scale analytics. In this paper, we discuss several Big Data research activities at IBM Research: (1) Big Data benchmarking and methodology; (2) workload optimized systems for Big Data; and (3) a case study of Big Data workloads on IBM Power systems. Our case study shows that preliminary infrastructure tuning results in sorting 1TB data in 8 minutes¹ on 10 PowerLinux^TM 7R2 with POWER7+ systems [5] running IBM InfoSphere BigInsights^TM. This translates to sorting 12.8GB/node/minute for the IO intensive sort benchmark. We also show that 4 PowerLinux 7R2 with POWER7+ nodes can sort 1TB input with around 21 minutes. Further improvements are expected as we continue full-stack optimizations on both IBM software and hardware.

By: William M. Buros, Guan Cheng Chen, Mei-Mei Fu, Anne E. Gattiker, Fadi H. Gebara, Ahmed Gheith, H. Peter Hofstee, Damir A. Jamsek, Thomas G. Lendacky, Jian Li, Yan Li, John S. Poelman, Steven Pratt, Ju Wei Shi, Evan Speight, Peter W. Wong

Published in: RC25281 Revised in 2013

rc25281revised.pdf

Questions about this service can be mailed to reports@us.ibm.com .