کد مقاله کد نشریه سال انتشار مقاله انگلیسی نسخه تمام متن
424572 685592 2015 10 صفحه PDF دانلود رایگان
عنوان انگلیسی مقاله ISI
A hierarchical watchdog mechanism for systemic fault awareness on distributed systems
ترجمه فارسی عنوان
یک مکانیسم نظارت سلسله مراتبی برای آگاهی از گسل سیستمیک در سیستم های توزیع شده
کلمات کلیدی
معماری توزیع شده، تحمل گسل سیستم سطح، سیستم ها و شبکه های قابل اعتماد و محرمانه
موضوعات مرتبط
مهندسی و علوم پایه مهندسی کامپیوتر نظریه محاسباتی و ریاضیات
چکیده انگلیسی


• We approach fault tolerance for distributed systems from fault detection and awareness.
• We propose a HW/SW mechanism based on a mutual watchdog mechanism between Host and NIC.
• A double diagnostic message path leads to resilient systemic fault awareness.
• Our tool can interface fault reaction/recovery systems to trigger them automatically.
• Our mechanism has no impact on system performance.

Systemic fault tolerance is usually pursued with a number of strategies, like redundancy and checkpoint/restart; any of them needs to be triggered by safe and fast fault detection. We devised a hardware/software approach to fault detection that enables a system-level Fault Awareness by implementing a hierarchical Mutual Watchdog. It relies on an improved high performance Network Interface Card (NIC), implementing an nn-dimensional mesh topology and a Service Network. The hierarchical watchdog mechanism is able to quickly detect faults on each node, as the Host and the high performance NIC guard each other while every node monitors its own first neighbours in the mesh. Duplicated and distributed Supervisor Nodes receive communication by means of diagnostic messages routed through either the Service Network or the NN-dimensional Network, then assemble a global picture of the system status. In this way our approach allows achieving a Fault Awareness with no-single-point-of-failure. We describe an implementation of this hardware/software co-design for our high performance 3D torus NIC, with a focus on how routed diagnostic messages do not affect the system performances.

ناشر
Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Future Generation Computer Systems - Volume 53, December 2015, Pages 90–99
نویسندگان
, , , , , , , , , ,