Software-level Solutions Against Hardware Faults


Introduction

Challenge:

Continuous technology scaling facilitates having more transistors in an integrated circuit. For every new generation of processors, the transistors are shrinking in size, with a lower threshold voltage and narrower noise margin compared to their previous counterparts. As a result of this aggressive transistor feature scale down, devices are becoming more and more susceptible to soft errors. Soft errors or transient faults — typically caused by cosmic particle strike on transistors — can change the logic value within the transistor. Without countermeasures against soft errors, any application running on flawless hardware can result in an unexpected malfunction (e.g., object identification that misidentifies a truck as a bird1). Traditional hardware-level reliability solutions against soft errors involve the cost of additional hardware and are difficult to apply to already distributed general hardware. Our research aims to develop software-level reliability solutions without hardware modification, by implementing software-level redundancy.

A soft error is a bit-flip in a transistor induced by external sources.
A soft error can result in a malfunction of an application.

Instruction Replication Solutions for Common Applications

A main strategy to mitigate soft errors is redundancy; if a soft error corrupts one of the redundant instances, the system can detect such faults by comparing redundant instances. Instruction replication solutions achieve such redundancy by replicating assembly instructions by dividing the registers into original and shadow (redundant) registers. Instead of checking every pair of original and redundant instructions, such solutions place software-level fault detection codes by utilizing redundant registers for critical operations such as store and control-flow instructions. Our research focuses on finding vulnerable points of state-of-the-art instruction replication solutions and resolving such vulnerabilities with improved replication and checking methodologies.

Instruction replication schemes such as SWIFT2 replicate computations and detect mismatches before critical operations.
CHITIN3, our proposed instruction replication scheme, resolves the vulnerability of the control-flow protection in SWIFT solution.

Fault–aware Scheduling for Mixed-Criticality System (MCS)

The scheduling in a mixed-criticality system (MCS) judges the success of scheduling based on the completion of the scheduled tasks within their deadlines. However, in the presence of soft errors, scheduled tasks can result in system-visible failures such as crashes or system-invisible silent data corruption (SDC) which means the normal completion of the task with incorrect outputs. Our research aims to improve the accuracy of failure rate analysis for the tasks in MCS based on the failure classification in the reliability research domain and provide proper re-execution strategies for the tasks with or without protections based on our proposed instruction replication solutions and task-level failure analysis.

Failure classification in reliability research domain can categorize the results of soft errors in the tasks for the scheduling.
Mixed-criticality system scheduling can apply different levels of protection and re-execution4 based on the requirements of tasks.

1Li, G., Hari, S. K. S., Sullivan, M., Tsai, T., Pattabiraman, K., Emer, J., & Keckler, S. W. (2017, November). Understanding error propagation in deep learning neural network (DNN) accelerators and applications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (pp. 1-12).
2Reis, G. A., Chang, J., Vachharajani, N., Rangan, R., & August, D. I. (2005, March). SWIFT: Software implemented fault tolerance. In International symposium on Code generation and optimization. IEEE.
3So, H., Didehban, M., Jung, J., Shrivastava, A., & Lee, K. (2021, February). CHITIN: A comprehensive in-thread instruction replication technique against transient faults. In 2021 Design, Automation & Test in Europe Conference & Exhibition (DATE) (pp. 1440-1445). IEEE.
4Huang, S. Y., Zeng, J., Deng, X., Wang, S., Sifat, A., Bharmal, B., … & Jung, C. (2023, December). RTailor: Parameterizing Soft Error Resilience for Mixed-Criticality Real-Time Systems. In 2023 IEEE Real-Time Systems Symposium (RTSS) (pp. 344-357). IEEE.