The Case for Microarchitectural Awareness of Lifetime Reliability

Ensuring long processor lifetimes by limiting failures due to hard errors is a critical requirement for all microprocessor manufacturers. Current methodologies for qualifying long-term lifetime reliability are overly conservative since they seek to maintain reliability for peak usage of the processor. This paper makes the case that the continued use of such methodologies will significantly and unnecessarily constrain performance. Instead, lifetime reliability awareness at the microarchitectural design stage can mitigate this problem, by designing processors that dynamically adapt in response to the observed usage to meet a reliability target.

We make two specific contributions. First, we describe an architecture-level model and its implementation, called RAMP, that can dynamically track lifetime reliability, responding to changes in application behavior. We use stateof-the-art models for different wear-out mechanisms and apply them to calculate failure rates of individual architectural structures. These failure rates are a function of temperature, switching activity, and voltage. RAMP is coupled with a conventional performance and power simulator to track these parameters over an application run.

Second, we propose dynamic reliability management (DRM) – a technique where the processor can respond to changing application behavior to maintain its lifetime reliability target. In contrast to current worst-case behavior based reliability qualification methodologies, DRM allows processors to be qualified for reliability at lower (but more likely) operating points than the worst case. Using RAMP, we show that this can save cost and/or improve performance, dynamic voltage scaling is an effective response technique for DRM, and dynamic thermal management neither subsumes nor is subsumed by DRM.

By: Jayanth Srinivasan, Sarita V. Adve, Pradip Bose, Jude Rivers

Published in: RC23088 in 2003

LIMITED DISTRIBUTION NOTICE:

This Research Report is available. This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. It has been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g., payment of royalties). I have read and understand this notice and am a member of the scientific community outside or inside of IBM seeking a single copy only.

rc23088.pdf

Questions about this service can be mailed to reports@us.ibm.com .