Article ID | Journal | Published Year | Pages | File Type |
---|---|---|---|---|
10332464 | Journal of Computational Science | 2013 | 8 Pages |
Abstract
Soft errors are one-time events that corrupt the state of a computing system but not its overall functionality. Soft errors normally do not interrupt the execution of the affected program, but the affected computation results cannot be trusted any more. A well known technique to correct soft errors in matrix-matrix multiplication is algorithm-based fault tolerance (ABFT). While ABFT achieves much better efficiency than triple modular redundancy (TMR) - a traditional general technique to correct soft errors, both ABFT and TMR detect errors off-line after the computation is finished. This paper extends the traditional ABFT technique from off-line to on-line so that soft errors in matrix-matrix multiplication can be detected in the middle of the computation during the program execution and higher efficiency can be achieved by correcting the corrupted computations in a timely manner. Experimental results demonstrate that the proposed technique can correct one error every ten seconds with negligible (i.e. less than 1%) performance penalty over the ATLAS dgemm().
Related Topics
Physical Sciences and Engineering
Computer Science
Computational Theory and Mathematics
Authors
Panruo Wu, Chong Ding, Longxiang Chen, Teresa Davies, Christer Karlsson, Zizhong Chen,