کد مقاله کد نشریه سال انتشار مقاله انگلیسی نسخه تمام متن
4951652 1441481 2017 12 صفحه PDF دانلود رایگان
عنوان انگلیسی مقاله ISI
Fault tolerant communication-optimal 2.5D matrix multiplication
موضوعات مرتبط
مهندسی و علوم پایه مهندسی کامپیوتر نظریه محاسباتی و ریاضیات
پیش نمایش صفحه اول مقاله
Fault tolerant communication-optimal 2.5D matrix multiplication
چکیده انگلیسی
In future computing systems, handling faults efficiently at the algorithmic level is expected to become more and more important. In this paper, we illustrate that in practice classical algorithm-based fault tolerance (ABFT) cannot protect all exponent bits of a floating-point number. Consequently, we extend the method to recover from bit-flips in all positions without additional overhead. We also derive fault detection conditions suitable for multiple checksum encoding vectors. Moreover, we show how to efficiently employ ABFT to protect communication-optimal parallel 2.5D matrix multiplication against bit-flips occurring silently during the computation. Furthermore, we show that for very low fault rates the overhead of fault tolerance in the context of the 2.5D matrix multiplication algorithms can be reduced even further. Numerical experiments on a high-performance cluster illustrate the high scalability and low overhead of our algorithms. We demonstrate the fault tolerance of our approach with randomly and asynchronously injected bit-flips and illustrate that our method can also handle bit-flips occurring at high frequencies. Like in classical ABFT, the overhead per correctable bit-flip of our approach decreases with increasing error rate.
ناشر
Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Journal of Parallel and Distributed Computing - Volume 104, June 2017, Pages 179-190
نویسندگان
, , ,