Workload Characterization and Optimization of TPC-H Queries on Apache Spark

Besides being an in-memory oriented computing framework, Spark runs on top of a Java Virtual Machine (JVM), so JVM parameters must be tuned to improve Spark application performance. Misconfigured parameters and settings degrade performance. For example, using Java heaps that are too large often causes long garbage collection pause time, which accounts for over 10-20\% of application execution time. Moreover, recent modern computing nodes have many cores and support running multiple threads simultaneously with SMT technology. Thus, optimization in full stack is also important. Not only JVM parameters but also OS parameters, Spark configuration, and application code itself based on CPU characteristics need to be optimized to take full advantage of underlying computing resource. In this paper, we use TPC-H benchmark as our optimization case study and gather many perspective logs such as application log, JVM log such as GC and JIT, system utilization, and hardware events from PMU. We investigate the existing problems and then introduce several optimization approaches for accelerating Spark performance. As a result, our optimization achieves 30 - 40\% speed up on average, and up to 5x faster than the naive configuration.

By: Tatsuhiro Chiba and Tamiya Onodera

Published in: Proceedings of 2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) Piscataway, NJ,, IEEE, p.112-21 - 10.1109/ISPASS.2016.7482079 in 2016


