Proactive process-level live migration and back migration in HPC environments

کد مقاله	کد نشریه	سال انتشار	مقاله انگلیسی	نسخه تمام متن
432796	689073	2012	14 صفحه PDF	دانلود رایگان

عنوان انگلیسی مقاله ISI

دانلود مقاله + سفارش ترجمه

دانلود مقاله ISI انگلیسی

رایگان برای ایرانیان

کلمات کلیدی

Fault tolerance - تحمل‌پذیری High-performance computing - محاسبات با کارایی بالا live migration - مهاجرت زنده Health monitoring - نظارت بر سلامت

موضوعات مرتبط

مهندسی و علوم پایه مهندسی کامپیوتر نظریه محاسباتی و ریاضیات

پیش نمایش صفحه اول مقاله

Proactive process-level live migration and back migration in HPC environments

چکیده انگلیسی

As the number of nodes in high-performance computing environments keeps increasing, faults are becoming common place. Reactive fault tolerance (FT) often does not scale due to massive I/O requirements and relies on manual job resubmission.This work complements reactive with proactive FT at the process level. Through health monitoring, a subset of node failures can be anticipated when one’s health deteriorates. A novel process-level live migration mechanism supports continued execution of applications during much of process migration. This scheme is integrated into an MPI execution environment to transparently sustain health-inflicted node failures, which eradicates the need to restart and requeue MPI jobs. Experiments indicate that 1–6.5 s of prior warning are required to successfully trigger live process migration while similar operating system virtualization mechanisms require 13–24 s. This self-healing approach complements reactive FT by nearly cutting the number of checkpoints in half when 70% of the faults are handled proactively. The work also provides a novel back migration approach to eliminate load imbalance or bottlenecks caused by migrated tasks. Experiments indicate the larger the amount of outstanding execution, the higher the benefit due to back migration.

► Novel process-level live migration is devised and integrated into an MPI runtime system for proactive fault tolerance.
► Live process migration can be triggered as late as 1–6.5 s prior to actual faults under health monitoring.
► This compares favorably to 13–24 s for live migration under operating system virtualization.
► Combined with checkpoints, the number of checkpoints can be cut in half when 70% of faults are handled proactively.
► A novel back migration approach eliminates load imbalance or bottlenecks caused by migrated tasks.

ناشر

Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Journal of Parallel and Distributed Computing - Volume 72, Issue 2, February 2012, Pages 254–267

نویسندگان

Chao Wang, Frank Mueller, Christian Engelmann, Stephen L. Scott,

علوم انسانی و هنر

فنی، مهندسی و علوم پایه

پزشکی و سلامت

بیو تکنولوژی

پذیرش سفارش ترجمه

Proactive process-level live migration and back migration in HPC environments

دسترسی سریع

ارتباط

English Website