{"id":7195,"date":"2025-05-30T15:39:05","date_gmt":"2025-05-30T22:39:05","guid":{"rendered":"https:\/\/labs.engineering.asu.edu\/mps-lab\/?page_id=7195"},"modified":"2025-05-30T17:35:12","modified_gmt":"2025-05-31T00:35:12","slug":"software-level-solutions-against-hardware-faults","status":"publish","type":"page","link":"https:\/\/labs.engineering.asu.edu\/mps-lab\/software-level-solutions-against-hardware-faults\/","title":{"rendered":"Software-level Solutions Against Hardware Faults"},"content":{"rendered":"\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:33.33%\">\n<div class=\"wp-block-group is-nowrap is-layout-flex wp-container-core-group-is-layout-ad2f72ca wp-block-group-is-layout-flex\">\n<div class=\"wp-block-buttons is-layout-flex wp-block-buttons-is-layout-flex\">\n<div class=\"wp-block-button\"><a class=\"wp-block-button__link has-asu-maroon-background-color has-background has-small-font-size has-custom-font-size wp-element-button\" href=\"https:\/\/docs.google.com\/document\/d\/1NOoZxUBtLrMPTwM_sFFKDbVnmguQH8y-ESd4fVlYV_Q\/edit#heading=h.yeepu7u2h8b\">Reading List<\/a><\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-buttons is-layout-flex wp-block-buttons-is-layout-flex\">\n<div class=\"wp-block-button\"><a class=\"wp-block-button__link has-asu-maroon-background-color has-background has-small-font-size has-custom-font-size wp-element-button\" href=\"https:\/\/labs.engineering.asu.edu\/mps-lab\/publications\/?tgid=5\">Publications<\/a><\/div>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:66.66%\"><\/div>\n<\/div>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h3 class=\"wp-block-heading\">Introduction<\/h3>\n\n\n\n<h4 class=\"wp-block-heading\"><strong>Challenge:<\/strong><\/h4>\n\n\n\n<p>Continuous technology scaling facilitates having more transistors in an integrated circuit. For every new generation of processors, the transistors are shrinking in size, with a lower threshold voltage and narrower noise margin compared to their previous counterparts. As a result of this aggressive transistor feature scale down, devices are becoming more and more susceptible to soft errors. Soft errors or transient faults \u2014 typically caused by cosmic particle strike on transistors \u2014 can change the logic value within the transistor. Without countermeasures against soft errors, any application running on flawless hardware can result in an unexpected malfunction (e.g., object identification that misidentifies a truck as a bird<sup>1<\/sup>). Traditional hardware-level reliability solutions against soft errors involve the cost of additional hardware and are difficult to apply to already distributed&nbsp;general hardware. Our research aims to develop software-level reliability solutions without hardware modification, by implementing software-level redundancy.<\/p>\n\n\n\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\"><div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"461\" src=\"https:\/\/labs.engineering.asu.edu\/mps-lab\/wp-content\/uploads\/sites\/8\/2024\/08\/reliability-figure-1-1024x461.png\" alt=\"\" class=\"wp-image-6681\" srcset=\"https:\/\/labs.engineering.asu.edu\/mps-lab\/wp-content\/uploads\/sites\/8\/2024\/08\/reliability-figure-1-1024x461.png 1024w, https:\/\/labs.engineering.asu.edu\/mps-lab\/wp-content\/uploads\/sites\/8\/2024\/08\/reliability-figure-1-300x135.png 300w, https:\/\/labs.engineering.asu.edu\/mps-lab\/wp-content\/uploads\/sites\/8\/2024\/08\/reliability-figure-1-768x346.png 768w, https:\/\/labs.engineering.asu.edu\/mps-lab\/wp-content\/uploads\/sites\/8\/2024\/08\/reliability-figure-1-1536x691.png 1536w, https:\/\/labs.engineering.asu.edu\/mps-lab\/wp-content\/uploads\/sites\/8\/2024\/08\/reliability-figure-1.png 1920w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">A soft error  is a bit-flip in a transistor induced by external sources.<\/figcaption><\/figure>\n<\/div><\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\"><div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"461\" src=\"https:\/\/labs.engineering.asu.edu\/mps-lab\/wp-content\/uploads\/sites\/8\/2024\/08\/reliability-figure-2-1024x461.png\" alt=\"\" class=\"wp-image-6683\" style=\"width:484px;height:auto\" srcset=\"https:\/\/labs.engineering.asu.edu\/mps-lab\/wp-content\/uploads\/sites\/8\/2024\/08\/reliability-figure-2-1024x461.png 1024w, https:\/\/labs.engineering.asu.edu\/mps-lab\/wp-content\/uploads\/sites\/8\/2024\/08\/reliability-figure-2-300x135.png 300w, https:\/\/labs.engineering.asu.edu\/mps-lab\/wp-content\/uploads\/sites\/8\/2024\/08\/reliability-figure-2-768x346.png 768w, https:\/\/labs.engineering.asu.edu\/mps-lab\/wp-content\/uploads\/sites\/8\/2024\/08\/reliability-figure-2-1536x691.png 1536w, https:\/\/labs.engineering.asu.edu\/mps-lab\/wp-content\/uploads\/sites\/8\/2024\/08\/reliability-figure-2.png 1920w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">A soft error can result in a malfunction of an application.<\/figcaption><\/figure>\n<\/div><\/div>\n<\/div>\n\n\n\n<p><strong>Instruction Replication Solutions for Common Applications<\/strong><\/p>\n\n\n\n<p>A main strategy to mitigate soft errors is redundancy; if a soft error corrupts one of the redundant instances, the system can detect such faults by comparing redundant instances. Instruction replication solutions achieve such redundancy by replicating assembly instructions by dividing the registers into original and shadow (redundant) registers. Instead of checking every pair of original and redundant instructions, such solutions place software-level fault detection codes by utilizing redundant registers for critical operations such as store and control-flow instructions. Our research focuses on finding vulnerable points of state-of-the-art instruction replication solutions and resolving such vulnerabilities with improved replication and checking methodologies.<\/p>\n\n\n\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"461\" src=\"https:\/\/labs.engineering.asu.edu\/mps-lab\/wp-content\/uploads\/sites\/8\/2022\/05\/reliability-figure-3-1024x461.png\" alt=\"\" class=\"wp-image-6705\" srcset=\"https:\/\/labs.engineering.asu.edu\/mps-lab\/wp-content\/uploads\/sites\/8\/2022\/05\/reliability-figure-3-1024x461.png 1024w, https:\/\/labs.engineering.asu.edu\/mps-lab\/wp-content\/uploads\/sites\/8\/2022\/05\/reliability-figure-3-300x135.png 300w, https:\/\/labs.engineering.asu.edu\/mps-lab\/wp-content\/uploads\/sites\/8\/2022\/05\/reliability-figure-3-768x346.png 768w, https:\/\/labs.engineering.asu.edu\/mps-lab\/wp-content\/uploads\/sites\/8\/2022\/05\/reliability-figure-3-1536x691.png 1536w, https:\/\/labs.engineering.asu.edu\/mps-lab\/wp-content\/uploads\/sites\/8\/2022\/05\/reliability-figure-3.png 1920w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">Instruction replication schemes such as SWIFT<sup>2<\/sup> replicate computations and detect mismatches before critical operations.<\/figcaption><\/figure>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"461\" src=\"https:\/\/labs.engineering.asu.edu\/mps-lab\/wp-content\/uploads\/sites\/8\/2024\/08\/reliability-figure-4-1024x461.png\" alt=\"\" class=\"wp-image-6706\" srcset=\"https:\/\/labs.engineering.asu.edu\/mps-lab\/wp-content\/uploads\/sites\/8\/2024\/08\/reliability-figure-4-1024x461.png 1024w, https:\/\/labs.engineering.asu.edu\/mps-lab\/wp-content\/uploads\/sites\/8\/2024\/08\/reliability-figure-4-300x135.png 300w, https:\/\/labs.engineering.asu.edu\/mps-lab\/wp-content\/uploads\/sites\/8\/2024\/08\/reliability-figure-4-768x346.png 768w, https:\/\/labs.engineering.asu.edu\/mps-lab\/wp-content\/uploads\/sites\/8\/2024\/08\/reliability-figure-4-1536x691.png 1536w, https:\/\/labs.engineering.asu.edu\/mps-lab\/wp-content\/uploads\/sites\/8\/2024\/08\/reliability-figure-4.png 1920w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">CHITIN<sup>3<\/sup>, our proposed instruction replication scheme, resolves the vulnerability of the control-flow protection in SWIFT solution.<\/figcaption><\/figure>\n<\/div>\n<\/div>\n\n\n\n<p><strong>Fault&#8211;aware Scheduling for Mixed-Criticality System (MCS)<\/strong><\/p>\n\n\n\n<p>The scheduling in a mixed-criticality system (MCS) judges the success of scheduling based on the completion of the scheduled tasks within their deadlines. However, in the presence of soft errors, scheduled tasks can result in system-visible failures such as crashes or system-invisible silent data corruption (SDC) which means the normal completion of the task with incorrect outputs. Our research aims to improve the accuracy of failure rate analysis for the tasks in MCS based on the failure classification in the reliability research domain and provide proper re-execution strategies for the tasks with or without protections based on our proposed instruction replication solutions and task-level failure analysis.<\/p>\n\n\n\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"461\" src=\"https:\/\/labs.engineering.asu.edu\/mps-lab\/wp-content\/uploads\/sites\/8\/2024\/08\/reliability-figure-7-1024x461.png\" alt=\"\" class=\"wp-image-6724\" srcset=\"https:\/\/labs.engineering.asu.edu\/mps-lab\/wp-content\/uploads\/sites\/8\/2024\/08\/reliability-figure-7-1024x461.png 1024w, https:\/\/labs.engineering.asu.edu\/mps-lab\/wp-content\/uploads\/sites\/8\/2024\/08\/reliability-figure-7-300x135.png 300w, https:\/\/labs.engineering.asu.edu\/mps-lab\/wp-content\/uploads\/sites\/8\/2024\/08\/reliability-figure-7-768x346.png 768w, https:\/\/labs.engineering.asu.edu\/mps-lab\/wp-content\/uploads\/sites\/8\/2024\/08\/reliability-figure-7-1536x691.png 1536w, https:\/\/labs.engineering.asu.edu\/mps-lab\/wp-content\/uploads\/sites\/8\/2024\/08\/reliability-figure-7.png 1920w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">Failure classification in reliability research domain can categorize the results of soft errors in the tasks for the scheduling.<\/figcaption><\/figure>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"461\" src=\"https:\/\/labs.engineering.asu.edu\/mps-lab\/wp-content\/uploads\/sites\/8\/2024\/08\/reliability-figure-8-1024x461.png\" alt=\"\" class=\"wp-image-6723\" srcset=\"https:\/\/labs.engineering.asu.edu\/mps-lab\/wp-content\/uploads\/sites\/8\/2024\/08\/reliability-figure-8-1024x461.png 1024w, https:\/\/labs.engineering.asu.edu\/mps-lab\/wp-content\/uploads\/sites\/8\/2024\/08\/reliability-figure-8-300x135.png 300w, https:\/\/labs.engineering.asu.edu\/mps-lab\/wp-content\/uploads\/sites\/8\/2024\/08\/reliability-figure-8-768x346.png 768w, https:\/\/labs.engineering.asu.edu\/mps-lab\/wp-content\/uploads\/sites\/8\/2024\/08\/reliability-figure-8-1536x691.png 1536w, https:\/\/labs.engineering.asu.edu\/mps-lab\/wp-content\/uploads\/sites\/8\/2024\/08\/reliability-figure-8.png 1920w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">Mixed-criticality system scheduling can apply different levels of protection and re-execution<sup>4<\/sup> based on the requirements of tasks.<\/figcaption><\/figure>\n<\/div>\n<\/div>\n\n\n\n<p><sup>1<\/sup>Li, G., Hari, S. K. S., Sullivan, M., Tsai, T., Pattabiraman, K., Emer, J., &amp; Keckler, S. W. (2017, November). Understanding error propagation in deep learning neural network (DNN) accelerators and applications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (pp. 1-12).<br><sup>2<\/sup>Reis, G. A., Chang, J., Vachharajani, N., Rangan, R., &amp; August, D. I. (2005, March). SWIFT: Software implemented fault tolerance. In International symposium on Code generation and optimization. IEEE.<br><sup>3<\/sup>So, H., Didehban, M., Jung, J., Shrivastava, A., &amp; Lee, K. (2021, February). CHITIN: A comprehensive in-thread instruction replication technique against transient faults. In 2021 Design, Automation &amp; Test in Europe Conference &amp; Exhibition (DATE) (pp. 1440-1445). IEEE.<br><sup>4<\/sup>Huang, S. Y., Zeng, J., Deng, X., Wang, S., Sifat, A., Bharmal, B., \u2026 &amp; Jung, C. (2023, December). RTailor: Parameterizing Soft Error Resilience for Mixed-Criticality Real-Time Systems. In 2023 IEEE Real-Time Systems Symposium (RTSS) (pp. 344-357). IEEE.<\/p>\n\n\n\n<p><\/p>\n\n\n\n<h4 class=\"wp-block-heading\">&nbsp;<\/h4>\n","protected":false},"excerpt":{"rendered":"<p>Introduction Challenge: Continuous technology scaling facilitates having more transistors in an integrated circuit. For every new generation of processors, the transistors are shrinking in size, with a lower threshold voltage and narrower noise margin compared [&hellip;]<\/p>\n","protected":false},"author":92,"featured_media":0,"parent":0,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"footnotes":""},"class_list":["post-7195","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"https:\/\/labs.engineering.asu.edu\/mps-lab\/wp-json\/wp\/v2\/pages\/7195","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/labs.engineering.asu.edu\/mps-lab\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/labs.engineering.asu.edu\/mps-lab\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/labs.engineering.asu.edu\/mps-lab\/wp-json\/wp\/v2\/users\/92"}],"replies":[{"embeddable":true,"href":"https:\/\/labs.engineering.asu.edu\/mps-lab\/wp-json\/wp\/v2\/comments?post=7195"}],"version-history":[{"count":0,"href":"https:\/\/labs.engineering.asu.edu\/mps-lab\/wp-json\/wp\/v2\/pages\/7195\/revisions"}],"wp:attachment":[{"href":"https:\/\/labs.engineering.asu.edu\/mps-lab\/wp-json\/wp\/v2\/media?parent=7195"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}