Contents
Introduction
Challenge:
Continuous technology scaling facilitates having more transistors in an integrated circuit. For every new generation of processors, the transistors are shrinking in size, with a lower threshold voltage and narrower noise margin compared to their previous counterparts. As a result of this aggressive transistor feature scale down, devices are becoming more and more susceptible to soft errors. Soft errors or transient faults — typically caused by cosmic particle strike on transistors — can change the logic value within the transistor. Without countermeasures against soft errors, any application running on flawless hardware can result in an unexpected malfunction (e.g., object identification that misidentifies a truck as a bird1). Traditional hardware-level reliability solutions against soft errors involve the cost of additional hardware and are difficult to apply to already distributed general hardware. Our research aims to develop software-level reliability solutions without hardware modification, by implementing software-level redundancy.


Instruction Replication Solutions for Common Applications
A main strategy to mitigate soft errors is redundancy; if a soft error corrupts one of the redundant instances, the system can detect such faults by comparing redundant instances. Instruction replication solutions achieve such redundancy by replicating assembly instructions by dividing the registers into original and shadow (redundant) registers. Instead of checking every pair of original and redundant instructions, such solutions place software-level fault detection codes by utilizing redundant registers for critical operations such as store and control-flow instructions. Our research focuses on finding vulnerable points of state-of-the-art instruction replication solutions and resolving such vulnerabilities with improved replication and checking methodologies.


Reliability Enhancement for Machine Learning
With the surge of deep neural networks (DNNs), machine learning plays a key role in most modern computing including safety-critical applications such as autonomous driving. Neural networks have been known to be inherently robust against faults due to their distributed structure and intrinsic redundancy. Still, a recent study4 found that neural networks without protection cannot satisfy the strict reliability standard. In addition, the reliability of neural networks in the real-world environment also gets threatened by security attacks such as adversarial attacks, and unexpected inputs such as out-of-distribution (OOD) data. Our research for machine learning includes efficient soft error mitigation solutions for neural networks and holistic detection solutions for faults, OOD inputs, and adversarial attacks.

mitigate faults with additional algorithmic computation.

Fault–aware Scheduling for Mixed-Criticality System (MCS)
The scheduling in a mixed-criticality system (MCS) judges the success of scheduling based on the completion of the scheduled tasks within their deadlines. However, in the presence of soft errors, scheduled tasks can result in system-visible failures such as crashes or system-invisible silent data corruption (SDC) which means the normal completion of the task with incorrect outputs. Our research aims to improve the accuracy of failure rate analysis for the tasks in MCS based on the failure classification in the reliability research domain and provide proper re-execution strategies for the tasks with or without protections based on our proposed instruction replication solutions and task-level failure analysis.


1Li, G., Hari, S. K. S., Sullivan, M., Tsai, T., Pattabiraman, K., Emer, J., & Keckler, S. W. (2017, November). Understanding error propagation in deep learning neural network (DNN) accelerators and applications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (pp. 1-12).
2Reis, G. A., Chang, J., Vachharajani, N., Rangan, R., & August, D. I. (2005, March). SWIFT: Software implemented fault tolerance. In International symposium on Code generation and optimization. IEEE.
3So, H., Didehban, M., Jung, J., Shrivastava, A., & Lee, K. (2021, February). CHITIN: A comprehensive in-thread instruction replication technique against transient faults. In 2021 Design, Automation & Test in Europe Conference & Exhibition (DATE) (pp. 1440-1445). IEEE.
4He, Y., Balaprakash, P., & Li, Y. (2020, October). Fidelity: Efficient resilience analysis framework for deep learning accelerators. In 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO) (pp. 270-281). IEEE.
5Goodfellow, I. J., Shlens, J., & Szegedy, C. (2014). Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572.
6Huang, S. Y., Zeng, J., Deng, X., Wang, S., Sifat, A., Bharmal, B., … & Jung, C. (2023, December). RTailor: Parameterizing Soft Error Resilience for Mixed-Criticality Real-Time Systems. In 2023 IEEE Real-Time Systems Symposium (RTSS) (pp. 344-357). IEEE.