An analysis of clustered failures on large supercomputing systems

کد مقاله	کد نشریه	سال انتشار	مقاله انگلیسی	نسخه تمام متن
432101	688708	2009	14 صفحه PDF	دانلود رایگان

عنوان انگلیسی مقاله ISI

دانلود مقاله + سفارش ترجمه

دانلود مقاله ISI انگلیسی

رایگان برای ایرانیان

کلمات کلیدی

supercomputer - ابر رایانه Semi-Markov process - روند نیمه مارکوف Reliability modeling - مدل سازی قابلیت اطمینان

موضوعات مرتبط

مهندسی و علوم پایه مهندسی کامپیوتر نظریه محاسباتی و ریاضیات

پیش نمایش صفحه اول مقاله

An analysis of clustered failures on large supercomputing systems

چکیده انگلیسی

Large supercomputers are built today using thousands of commodity components, and suffer from poor reliability due to frequent component failures. The characteristics of failure observed on large-scale systems differ from smaller scale systems studied in the past. One striking difference is that system events are clustered temporally and spatially, which complicates failure analysis and application design. Developing a clear understanding of failures for large-scale systems is a critical step in building more reliable systems and applications that can better tolerate and recover from failures. In this paper, we analyze the event logs of two large IBM Blue Gene systems, statistically characterize system failures, present a model for predicting the probability of node failure, and assess the effects of differing rates of failure on job failures for large-scale systems. The work presented in this paper will be useful for developers and designers seeking to deploy efficient and reliable petascale systems.

ناشر

Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Journal of Parallel and Distributed Computing - Volume 69, Issue 7, July 2009, Pages 652–665

نویسندگان

Thomas J. Hacker, Fabian Romero, Christopher D. Carothers,

علوم انسانی و هنر

فنی، مهندسی و علوم پایه

پزشکی و سلامت

بیو تکنولوژی

پذیرش سفارش ترجمه

An analysis of clustered failures on large supercomputing systems

دسترسی سریع

ارتباط

English Website