کد مقاله کد نشریه سال انتشار مقاله انگلیسی نسخه تمام متن
432101 688708 2009 14 صفحه PDF دانلود رایگان
عنوان انگلیسی مقاله ISI
An analysis of clustered failures on large supercomputing systems
موضوعات مرتبط
مهندسی و علوم پایه مهندسی کامپیوتر نظریه محاسباتی و ریاضیات
پیش نمایش صفحه اول مقاله
An analysis of clustered failures on large supercomputing systems
چکیده انگلیسی

Large supercomputers are built today using thousands of commodity components, and suffer from poor reliability due to frequent component failures. The characteristics of failure observed on large-scale systems differ from smaller scale systems studied in the past. One striking difference is that system events are clustered temporally and spatially, which complicates failure analysis and application design. Developing a clear understanding of failures for large-scale systems is a critical step in building more reliable systems and applications that can better tolerate and recover from failures. In this paper, we analyze the event logs of two large IBM Blue Gene systems, statistically characterize system failures, present a model for predicting the probability of node failure, and assess the effects of differing rates of failure on job failures for large-scale systems. The work presented in this paper will be useful for developers and designers seeking to deploy efficient and reliable petascale systems.

ناشر
Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Journal of Parallel and Distributed Computing - Volume 69, Issue 7, July 2009, Pages 652–665
نویسندگان
, , ,