Soft Errors: The Hardware Software Interface

Overview

Continuous technology scaling provides us with the capability to fabricate complex functionality, into smaller processor chips, consuming low-power at affordable costs. This has accelerated our dependence on computation devices for a wide range of applications from embedded systems to manageable supercomputers. A consequence of rapid technology scaling is that transistors become more susceptible to soft errors, caused by charge-carrying energy particles; leading to system failures due to data corruption. A recent book by Shubu Mukherjee demonstrates the impact of soft errors in future computing systems, and motivates through examples for the integration efficient soft-error mitigation techniques during architecture design. The industry, realizing this urgency to ensure application reliability, have sacrificed power and hardware overheads to protect the data stored (even temporarily) in the processor. For instance, the L1, L2 cache and register files of NVIDIA’s Tesla Personal Supercomputing GPUs are ECC protected; and the local memories in each of the 8 processors of the IBM CELL Broadband Engine are ECC protected. Although the soft error rate in embedded devices is about once-per-year today, due to the exponential growth rate of technology, it is expected to reach alarming levels of once-per-day in about a decade.

Over the years, researchers have developed several techniques at various layers of the design abstraction to protect the system from soft-errors. With the advent of multi-core systems, designers need to reevaluate traditional design methodologies. At this juncture, we need to step back and reevaluate soft-error mitigation methodologies, at a system-level perspective. In this tutorial, we will discuss various soft-error mitigation techniques at all design layers, with a particular focus on those at the compiler and microarchitecture layers. We elaborate on key works that had been instrumental in providing orthogonal dimensions of approaches to solving the problem of computation reliability.

The tutorial will cater to a wide audience comprising of both (i) researchers specializing in embedded and high performance computing, and (ii) system designers, architects, and programmers from the industry. For a researcher, the tutorial will be a valuable one-stop-shop to acquire knowledge of and analyze seminal research work in the field of soft error mitigation, at each of the design layers. From an industry perspective, modifications to an existing design need to be weighed against possible gains, and we understand that such a decision is not taken lightly. The tutorial is shaped to showcase some very efficient design methodologies at both the software, and hardware layers, that could prove to be key design decisions targeted towards improving the reliability of future computing systems.

Speakers

Prof. Aviral Shrivastava

Prof. Kyoungwoo Lee

Dr. Reiley Jeyapaul 

 

Leave a Reply