Adaptive Caching Algorithms for Big Data Systems

Today’s Big Data platforms have enabled the democratization of data by allowing data sharing among various data processing frameworks and applications that run in the same platform. This data and resource sharing, combined with the fact that most applications tend to access a hot set of the data has led to the development of external, in-memory, distributed caching frameworks. In this paper, we develop online, adaptive algorithms for external caches. Our caching algorithms take into account the workload access pattern, and the cost of insertions in the external caching framework when making cache insertion and replacement decisions. We provide both a detailed simulation study as well as cluster experiments on IBM Big SQL, and show that only our adaptive algorithms perform well for different workload characteristics, are able to adapt to evolving workload access patterns, and can approach the performance observed by optimized offline solutions.

By: Avrilia Floratou, Nimrod Megiddo, Navneet Potti, Fatma Özcan, Uday Kale, Jan Schmitz-Hermes

Published in: RJ10531 in 2015

LIMITED DISTRIBUTION NOTICE:

This Research Report is available. This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. It has been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g., payment of royalties). I have read and understand this notice and am a member of the scientific community outside or inside of IBM seeking a single copy only.

RJ10531.pdf

Questions about this service can be mailed to reports@us.ibm.com .