The Failure Trace Archive: Enabling the comparison of failure measurements and models of distributed systems

کد مقاله	کد نشریه	سال انتشار	مقاله انگلیسی	نسخه تمام متن
431844	688638	2013	16 صفحه PDF	دانلود رایگان

عنوان انگلیسی مقاله ISI

دانلود مقاله + سفارش ترجمه

دانلود مقاله ISI انگلیسی

رایگان برای ایرانیان

کلمات کلیدی

Resource failures Statistical analysis - تحلیل آماری Distributed systems - سیستم توزیع شده Failure model - مدل شکست

موضوعات مرتبط

مهندسی و علوم پایه مهندسی کامپیوتر نظریه محاسباتی و ریاضیات

پیش نمایش صفحه اول مقاله

The Failure Trace Archive: Enabling the comparison of failure measurements and models of distributed systems

چکیده انگلیسی

• We design a public failure trace archive as a standard format for failure traces.
• We develop a toolbox and a simulator that facilitates comparative trace analysis.
• We present statistical analyses and failure models for several distributed systems.
• We study the significance of differences in the interpretation of failures.

With the increasing presence, scale, and complexity of distributed systems, resource failures are becoming an important and practical topic of computer science research. While numerous failure models and failure-aware algorithms exist, their comparison has been hampered by the lack of public failure data sets and data processing tools. To facilitate the design, validation, and comparison of fault-tolerant models and algorithms, we have created the Failure Trace Archive (FTA)—an online, public repository of failure traces collected from diverse parallel and distributed systems. In this work, we first describe the design of the archive, in particular of the standard FTA data format, and the design of a toolbox that facilitates automated analysis of trace data sets. We also discuss the use of the FTA for various current and future purposes. Second, after applying the toolbox to nine failure traces collected from distributed systems used in various application domains (e.g., HPC, Internet operation, and various online applications), we present a comparative analysis of failures in various distributed systems. Our analysis presents various statistical insights and typical statistical modeling results for the availability of individual resources in various distributed systems. The analysis results underline the need for public availability of trace data from different distributed systems. Last, we show how different interpretations of the meaning of failure data can result in different conclusions for failure modeling and job scheduling in distributed systems. Our results for different interpretations show evidence that there may be a need for further revisiting existing failure-aware algorithms, when applied for general rather than for domain-specific distributed systems.

ناشر

Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Journal of Parallel and Distributed Computing - Volume 73, Issue 8, August 2013, Pages 1208–1223

نویسندگان

Bahman Javadi, Derrick Kondo, Alexandru Iosup, Dick Epema,

علوم انسانی و هنر

فنی، مهندسی و علوم پایه

پزشکی و سلامت

بیو تکنولوژی

پذیرش سفارش ترجمه

The Failure Trace Archive: Enabling the comparison of failure measurements and models of distributed systems

دسترسی سریع

ارتباط

English Website