A System Software Approach to Proactive Memory-Error Avoidance

Today’s HPC systems use two mechanisms to address main-memory errors. Error-correcting codes make correctable errors transparent to software, while checkpoint/restart (CR) enables recovery from uncorrectable errors. Unfortunately, CR overhead will be enormous at exascale due to the high failure rate of memory. We propose a new OS-based approach that proactively avoids memory errors using prediction. This scheme exposes correctable error information to the OS, which migrates pages and offlines unhealthy memory to avoid application crashes. We analyze memory error patterns in extensive logs from a BG/P system and show how correctable error patterns can be used to identify memory likely to fail. We implement a proactive memory management system on BG/Q by extending the firmware and Linux. We evaluate our approach with a realistic workload and compare our overhead against traditional CR. We show that our approach increases application resiliency without introducing significant overhead and allows checkpointing requirements to be greatly relaxed.

By: Carlos H. A. Costa, Yoonho Park, Kyung Dong Ryu, Bryan Rosenburg, Chen-Yong Cher

Published in: Proceedings of SC14: International Conference for High Performance Computing, Networking, Storage and Analysis,Los Alamitos, CA, IEEE,, p.707-18 in 2014


