کد مقاله | کد نشریه | سال انتشار | مقاله انگلیسی | نسخه تمام متن |
---|---|---|---|---|
4951621 | 1441476 | 2017 | 13 صفحه PDF | دانلود رایگان |
عنوان انگلیسی مقاله ISI
DINO: Divergent node cloning for sustained redundancy in HPC
دانلود مقاله + سفارش ترجمه
دانلود مقاله ISI انگلیسی
رایگان برای ایرانیان
موضوعات مرتبط
مهندسی و علوم پایه
مهندسی کامپیوتر
نظریه محاسباتی و ریاضیات
پیش نمایش صفحه اول مقاله
چکیده انگلیسی
This problem is addressed by, DIvergent NOde cloning (DINO), a redundant execution environment that quickly recovers from hard failures. DINO consists of a novel node cloning service integrated into the MPI runtime system that solves the problem of consolidating divergent states among replicas on-the-fly. With DINO, after degradation to dual redundancy, a good replica can be quickly cloned so that triple redundancy is restored. We present experimental results over 9 NAS Parallel Benchmarks (NPB), Sweep3D and LULESH. Results confirm the applicability of the approach and the correctness of the recovery process and indicate that DINO can recover from failures nearly instantly. The cloning overhead depends on the process image size that needs to be transferred between source and destination of the clone operation and varies between 5.60 to 90.48Â s. Simulation results with our model show that dual redundancy with DINO recovery always outperforms 2x and surpasses 3x redundancy on up to 1 million nodes. To the best of our knowledge, the design and implementation for repairing failed replicas in redundant MPI computing is unprecedented.
ناشر
Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Journal of Parallel and Distributed Computing - Volume 109, November 2017, Pages 350-362
Journal: Journal of Parallel and Distributed Computing - Volume 109, November 2017, Pages 350-362
نویسندگان
Arash Rezaei, Frank Mueller, Paul Hargrove, Eric Roman,