Publications in Scientific Journals:
T. Li, M. Shafique, J. Ambrose, J. Henkel, S. Parameswaran:
"Fine-Grained Checkpoint Recovery for Application-Specific Instruction-Set Processors";
IEEE Transactions on Computers,
Checkpoint recovery (CR) is a classic fault-tolerance technique, which enables computing systems to execute correctly even when affected by transient faults. Although a number of software and hardware based approaches for CR does exist, these approaches usually are either too large, too slow, or require extensive modifications to the software and the caching/memory schemes. In this paper, we propose a novel CR approach, which is based on re-engineering the instruction set of a target processor. We take the base instruction set and augment the native micro-operations, i.e., an architectural description language (ADL), with additional microoperations to perform checkpointing at the granularity of basic blocks. The recovery mechanism is realized by three custom instructions, which can undo the corruptions caused by transient faults during instruction execution, including the values of generalpurpose registers, data memory, and special-purpose registers (PC, status registers, etc.), which were incorrectly modified. Our
checkpoint storage is sized according to the application program executed. The experimental results show that our approach degrades the system performance by just 0.76 percent when there is no fault, and introduces an area overhead of 44 percent on average and 79 percent in the worst case. During the fault injection test with the benchmark applications, the recovery took just 62 clock cycles (worst case).
ASIP, checkpoint recovery, reliability
"Official" electronic version of the publication (accessed through its Digital Object Identifier - DOI)
Created from the Publication Database of the Vienna University of Technology.