MoHiDoC: Modular Hierachical Diagnosis on Chip*

With ever shrinking geometries, higher density circuits and higher frequencies, soft errors in logic are expected to become a great concern for chip design, operation and maintenance in upcoming years. This kind of errors occurs randomly and unpredictably. Further, an error in a component on a System-on-Chip (SoC) could potentially affect the entire component or even the entire chip if no adequate actions are taken. Possible consequences include erroneous states, corrupted data that challenge data integrity or system failure. Mechanisms exist for detecting single or multiple bit errors in logic circuits, such as predictive coding, parity checks or code replication. Other techniques, such as ECC or time redundancy, make recovery possible. However, we argue that those methods can only deal with Single Event Upsets (SEU) locally, are expensive, particularly if implemented at each component and ignore the global system state, consequently neglecting issues such as error propagation. We suggest that a new approach should make a chip aware of its state and capable of autonomously reacting in case an error occurs.

To achieve the above, we propose a hierarchical modular framework. We first model the I/O functionality, state determination, and potential error detection, analysis and reaction of a building block (BB). Second, we map the system functionality into logical BBs; i.e., a BB may correspond to a component or a part of a component. For communication between BBs we rely on a Petri Net (PN) structure. More precisely, we map the BBs to transitions in the PN and introduce token places as interfaces between BBs. As inferred above, there are additional higher layers in the PN that allow different levels of information, analysis and control. The higher layer PN(s) can correspond to a centralized system-level control block and/or to off chip higher-layer control. Again, the logical BBs of the higher layer functionality are mapped into transitions and token places are the interfaces between layers. Tokens carry the regular data and structured control data attached to it. The analysis mechanism of the BB in the different PNs uses the control data to estimate the state of the component (lower layer PN) or the system (higher layer PN) and to decide on the reaction mechanism(s) which it propagates again through the control data of the tokens. We believe that our approach offers the chip designer and tester a lot of flexibility while enabling runtime system error recovery and unified interfaces to higher layers (e.g., OS or service processor) for system maintenance.

In order to estimate the value of the framework, we develop a mathematical model of costs versus benefits, compare it to previous work and run simulations.

*Work performed by Dominique Tschopp for his EPFL Diploma Thesis while at the IBM Zurich
Research Laboratory under the technical supervision of M. Gabrani.

By: Maria Gabrani; Dominique Tschopp

Published in: RZ3568 in 2004

LIMITED DISTRIBUTION NOTICE:

This Research Report is available. This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. It has been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g., payment of royalties). I have read and understand this notice and am a member of the scientific community outside or inside of IBM seeking a single copy only.

rz3568.pdf

Questions about this service can be mailed to reports@us.ibm.com .