کد مقاله کد نشریه سال انتشار مقاله انگلیسی نسخه تمام متن
524301 868593 2015 16 صفحه PDF دانلود رایگان
عنوان انگلیسی مقاله ISI
GS-DMR: Low-overhead soft error detection scheme for stencil-based computation
موضوعات مرتبط
مهندسی و علوم پایه مهندسی کامپیوتر نرم افزارهای علوم کامپیوتر
پیش نمایش صفحه اول مقاله
GS-DMR: Low-overhead soft error detection scheme for stencil-based computation
چکیده انگلیسی


• We improve the dual modular redundancy (DMR), a soft error detection scheme.
• We propose a grid sampling error detection scheme for stencil-based computation.
• We analyze the soft error propagation pattern through the stencil operations.
• We demonstrate our scheme by experiments on Tianhe-2 supercomputer.

Soft errors are becoming a prominent problem for massive parallel scientific applications. Dual-modular redundancy (DMR) can provide approximately 100% error coverage, but it has the problem of overhead excessive. Stencil kernel is one of the most important routines applied in the context of structured grids. In this paper, we propose Grid Sampling DMR (GS-DMR), a low-overhead soft error detection scheme for stencil-based computation. Instead of comparing the whole set of the results in the traditional DMR, GS-DMR just compares a subset of the results according to sampling on the grid data, which is based on the error propagation pattern on the grid. We also design a fault tolerant (FT) framework combining GS-DMR with checkpoint technology, and provide theoretical analysis and an algorithm for the optimal FT parameters. Experimental results on Tianhe-2 supercomputer demonstrate that GS-DMR can achieve a good FT effect for stencil-based computation, and the effect is greatly improved for massively parallel applications, reducing the total FT overhead up to 51%.

ناشر
Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Parallel Computing - Volume 41, January 2015, Pages 50–65
نویسندگان
, , , , , ,