[Back]


Talks and Poster Presentations (with Proceedings-Entry):

L. Hoang, M. Hanif, M. Shafique:
"TRe-Map: Towards Reducing the Overheads of Fault-Aware Retraining of Deep Neural Networks by Merging Fault Maps";
Talk: 2021 24th Euromicro Conference on Digital System Design, Virtual Conference; 2021-09-01 - 2021-09-03; in: "Proceedings of the 2021 24th Euromicro Conference on Digital System Design", (2021), 434 - 441.



English abstract:
Recently, fault-aware retraining has emerged as a promising approach to improve the error resilience of Deep Neu-ral Networks (DNNs) against manufacturing-induced defects in DNN accelerators. However, state-of-the-art fault-aware training techniques incur a gigantic retraining overhead due to their per-chip retraining nature for the chipīs unique fault map, which may render it practically infeasible if retraining is done on large datasets. To address this major limitation and improve the practicability of the fault-aware retraining methodology, this work proposes a novel concept of merging fault maps to effectively retrain a DNN for a group of faulty chips in a single fault-aware retraining round. The merging of fault maps enables to avoid per chip retraining and thereby reduces the retraining overhead significantly. However, the merging of fault maps brings in new challenges such as training divergence (accuracy collapse) if a high number of accumulated faults are injected into the network in the first epoch. To address these challenges, we propose a methodology for effective merging of fault maps and then retraining of DNNs. Experimental results show that our methodology offers at least 1.4x retraining speedup on average while improving the error resilience of the network (depending on the DNN models and the number of merged fault maps). For example, for the Resnet-32 model using fault map generated from 5 fault maps at the fault rate 6e-3, our methodology offers 2x retraining speedup and 0.6% classification accuracy drop against per-chip retraining.


"Official" electronic version of the publication (accessed through its Digital Object Identifier - DOI)
http://dx.doi.org/10.1109/DSD53832.2021.00072


Created from the Publication Database of the Vienna University of Technology.