کد مقاله کد نشریه سال انتشار مقاله انگلیسی نسخه تمام متن
425841 685931 2015 13 صفحه PDF دانلود رایگان
عنوان انگلیسی مقاله ISI
On the impact of process replication on executions of large-scale parallel applications with coordinated checkpointing
ترجمه فارسی عنوان
در تأثیر تکرار پردازش بر اعدام برنامه های کاربردی موازی بزرگ با کنترل بازرسی هماهنگ شده
کلمات کلیدی
تحمل خطا، محاسبات موازی، بازرسی بازگردانی بازیابی، تکثیر پروسه
موضوعات مرتبط
مهندسی و علوم پایه مهندسی کامپیوتر نظریه محاسباتی و ریاضیات
چکیده انگلیسی


• Process replication combined with checkpoint–rollback–recovery.
• Exact values for the Mean Number of Failures To Interruption and the Mean Time To Interruption for Exponential failure distributions.
• Closed-form expression for the Mean Time To Interruption for Weibull distributions.
• Scenarios where replication is beneficial.

Processor failures in post-petascale parallel computing platforms are common occurrences. The traditional fault-tolerance solution, checkpoint–rollback–recovery, severely limits parallel efficiency. One solution is to replicate application processes so that a processor failure does not necessarily imply an application failure. Process replication, combined with checkpoint–rollback–recovery, has been recently advocated. We first derive novel theoretical results for Exponential failure distributions, namely exact values for the Mean Number of Failures To Interruption and the Mean Time To Interruption. We then extend these results to arbitrary failure distributions, obtaining closed-form solutions for Weibull distributions. Finally, we evaluate process replication in simulation using both synthetic and real-world failure traces so as to quantify average application makespan. One interesting result from these experiments is that, when process replication is used, application performance is not sensitive to the checkpointing period, provided that period is within a large neighborhood of the optimal period. More generally, our empirical results make it possible to identify regimes in which process replication is beneficial.

ناشر
Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Future Generation Computer Systems - Volume 51, October 2015, Pages 7–19
نویسندگان
, , , ,