کد مقاله کد نشریه سال انتشار مقاله انگلیسی نسخه تمام متن
432481 688911 2010 10 صفحه PDF دانلود رایگان
عنوان انگلیسی مقاله ISI
Failure-aware resource management for high-availability computing clusters with distributed virtual machines
موضوعات مرتبط
مهندسی و علوم پایه مهندسی کامپیوتر نظریه محاسباتی و ریاضیات
پیش نمایش صفحه اول مقاله
Failure-aware resource management for high-availability computing clusters with distributed virtual machines
چکیده انگلیسی

In large-scale networked computing systems, component failures become norms instead of exceptions. Failure-aware resource management is crucial for enhancing system availability and achieving high performance. In this paper, we study how to efficiently utilize system resources for high-availability computing with the support of virtual machine (VM) technology. We design a reconfigurable distributed virtual machine (RDVM) infrastructure for networked computing systems. We propose failure-aware node selection strategies for the construction and reconfiguration of RDVMs. We leverage the proactive failure management techniques in calculating nodes’ reliability states. We consider both the performance and reliability status of compute nodes in making selection decisions. We define a capacity–reliability metric to combine the effects of both factors in node selection, and propose Best-fit algorithms with optimistic and pessimistic selection strategies to find the best qualified nodes on which to instantiate VMs to run user jobs. We have conducted experiments using failure traces from production systems and the NAS Parallel Benchmark programs on a real-world cluster system. The results show the enhancement of system productivity by using the proposed strategies with practically achievable accuracy of failure prediction. With the Best-fit strategies, the job completion rate is increased by 17.6% compared with that achieved in the current LANL HPC cluster. The task completion rate reaches 91.7% with 83.6% utilization of relatively unreliable nodes.

ناشر
Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Journal of Parallel and Distributed Computing - Volume 70, Issue 4, April 2010, Pages 384–393
نویسندگان
,