BlobCR: Virtual disk based checkpoint-restart for HPC applications on IaaS clouds

کد مقاله	کد نشریه	سال انتشار	مقاله انگلیسی	نسخه تمام متن
432749	689058	2013	14 صفحه PDF	دانلود رایگان

عنوان انگلیسی مقاله ISI

دانلود مقاله + سفارش ترجمه

دانلود مقاله ISI انگلیسی

رایگان برای ایرانیان

کلمات کلیدی

Fault tolerance - تحمل‌پذیری High performance computing - محاسبات با کارایی بالا

موضوعات مرتبط

مهندسی و علوم پایه مهندسی کامپیوتر نظریه محاسباتی و ریاضیات

پیش نمایش صفحه اول مقاله

BlobCR: Virtual disk based checkpoint-restart for HPC applications on IaaS clouds

چکیده انگلیسی

Infrastructure-as-a-Service (IaaS) cloud computing is gaining significant interest in industry and academia as an alternative platform for running HPC applications. Given the need to provide fault tolerance, support for suspend–resume and offline migration, an efficient Checkpoint-Restart mechanism becomes paramount in this context. We propose BlobCR, a dedicated checkpoint repository that is able to take live incremental snapshots of the whole disk attached to the virtual machine (VM) instances. BlobCR aims to minimize the performance overhead of checkpointing by persisting VM disk snapshots asynchronously in the background using a low overhead technique we call selective copy-on-write. It includes support for both application-level and process-level checkpointing, as well as support to roll back filesystem changes. Experiments at large scale demonstrate the benefits of our proposal both in synthetic settings and for a real-life HPC application.

► We explore the challenges of Checkpoint-Restart for HPC applications on IaaS clouds.
► We propose BlobCR, a dedicated checkpoint repository based on virtual disk snapshots.
► We introduce selective copy-on-write, live VM disk snapshotting approach.
► We provide the algorithmic descriptions and analysis for this approach.
► We experiments both with synthetic benchmarks and a real life HPC application (CM1).

ناشر

Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Journal of Parallel and Distributed Computing - Volume 73, Issue 5, May 2013, Pages 698–711

نویسندگان

Bogdan Nicolae, Franck Cappello,

علوم انسانی و هنر

فنی، مهندسی و علوم پایه

پزشکی و سلامت

بیو تکنولوژی

پذیرش سفارش ترجمه

BlobCR: Virtual disk based checkpoint-restart for HPC applications on IaaS clouds

دسترسی سریع

ارتباط

English Website