Article ID Journal Published Year Pages File Type
10332464 Journal of Computational Science 2013 8 Pages PDF
Abstract
Soft errors are one-time events that corrupt the state of a computing system but not its overall functionality. Soft errors normally do not interrupt the execution of the affected program, but the affected computation results cannot be trusted any more. A well known technique to correct soft errors in matrix-matrix multiplication is algorithm-based fault tolerance (ABFT). While ABFT achieves much better efficiency than triple modular redundancy (TMR) - a traditional general technique to correct soft errors, both ABFT and TMR detect errors off-line after the computation is finished. This paper extends the traditional ABFT technique from off-line to on-line so that soft errors in matrix-matrix multiplication can be detected in the middle of the computation during the program execution and higher efficiency can be achieved by correcting the corrupted computations in a timely manner. Experimental results demonstrate that the proposed technique can correct one error every ten seconds with negligible (i.e. less than 1%) performance penalty over the ATLAS dgemm().
Related Topics
Physical Sciences and Engineering Computer Science Computational Theory and Mathematics
Authors
, , , , , ,