کد مقاله کد نشریه سال انتشار مقاله انگلیسی نسخه تمام متن
523772 868491 2015 19 صفحه PDF دانلود رایگان
عنوان انگلیسی مقاله ISI
Fault-tolerant finite-element multigrid algorithms with hierarchically compressed asynchronous checkpointing
موضوعات مرتبط
مهندسی و علوم پایه مهندسی کامپیوتر نرم افزارهای علوم کامپیوتر
پیش نمایش صفحه اول مقاله
Fault-tolerant finite-element multigrid algorithms with hierarchically compressed asynchronous checkpointing
چکیده انگلیسی


• Fault-tolerant and robust multigrid methods.
• Hierarchical finite element compression.
• Asynchronous checkpointing with local restart.

We analyse novel fault tolerance schemes for data loss in multigrid solvers, which essentially combine ideas of checkpoint-restart with algorithm-based fault tolerance. To improve efficiency compared to conventional global checkpointing, we exploit the inherent data compression of the multigrid hierarchy, and relax the synchronicity requirement through a local failure local recovery approach. We experimentally identify the root cause of convergence degradation in the presence of data loss using smoothness considerations. Our resulting schemes form a family of techniques that can be tailored to the expected error probability of (future) large-scale machines. A performance model gives further insight into the benefits and applicability of our techniques.

ناشر
Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Parallel Computing - Volume 49, November 2015, Pages 117–135
نویسندگان
, , ,