کد مقاله کد نشریه سال انتشار مقاله انگلیسی نسخه تمام متن
432660 689006 2016 8 صفحه PDF دانلود رایگان
عنوان انگلیسی مقاله ISI
A failure index for HPC applications
ترجمه فارسی عنوان
شاخص شکست برای برنامه های HPC
کلمات کلیدی
شاخص شکست (FI)؛ اقدامات نابرابری؛ محاسبات با کارایی بالا؛ انعطاف پذیری؛ نوسان سازي سيستم؛ سطح مناسب عملکرد
موضوعات مرتبط
مهندسی و علوم پایه مهندسی کامپیوتر نظریه محاسباتی و ریاضیات
چکیده انگلیسی


• A novel metric called Failure Index (FI) that can be used in the evaluation of High Performance Computing failure or resilience information.
• Modeling resource allocation schemes leveraging batches and queues and a history of successful and unsuccessful jobs.
• Index estimates used to construct a reliability-aware metascheduler tasked with processing incoming jobs.

This paper conducts an examination of log files originating from High Performance Computing (HPC) applications with known reliability problems. The results of this study further the maturation and adoption of meaningful metrics representing HPC system and application failure characteristics. Quantifiable metrics representing the reliability of HPC applications are foundational for building an application resilience methodology critical in the realization of exascale supercomputing. In this examination, statistical inequality methods originating from the study of economics are applied to health and status information contained in HPC application log files. The main result is the derivation of a new failure index metric for HPC—a normalized representation of parallel application volatility and/or resiliency to complement existing reliability metrics such as mean time between failure (MTBF), which aims for a better presentation of HPC application resilience. This paper provides an introduction to a Failure Index (FI) for HPC reliability and takes the reader through a use-case wherein the FI is used to expose various run-time fluctuations in the failure rate of applications running on a collection of HPC platforms.

ناشر
Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Journal of Parallel and Distributed Computing - Volumes 93–94, July 2016, Pages 146–153
نویسندگان
, , , ,