کد مقاله | کد نشریه | سال انتشار | مقاله انگلیسی | نسخه تمام متن |
---|---|---|---|---|
424572 | 685592 | 2015 | 10 صفحه PDF | دانلود رایگان |
• We approach fault tolerance for distributed systems from fault detection and awareness.
• We propose a HW/SW mechanism based on a mutual watchdog mechanism between Host and NIC.
• A double diagnostic message path leads to resilient systemic fault awareness.
• Our tool can interface fault reaction/recovery systems to trigger them automatically.
• Our mechanism has no impact on system performance.
Systemic fault tolerance is usually pursued with a number of strategies, like redundancy and checkpoint/restart; any of them needs to be triggered by safe and fast fault detection. We devised a hardware/software approach to fault detection that enables a system-level Fault Awareness by implementing a hierarchical Mutual Watchdog. It relies on an improved high performance Network Interface Card (NIC), implementing an nn-dimensional mesh topology and a Service Network. The hierarchical watchdog mechanism is able to quickly detect faults on each node, as the Host and the high performance NIC guard each other while every node monitors its own first neighbours in the mesh. Duplicated and distributed Supervisor Nodes receive communication by means of diagnostic messages routed through either the Service Network or the NN-dimensional Network, then assemble a global picture of the system status. In this way our approach allows achieving a Fault Awareness with no-single-point-of-failure. We describe an implementation of this hardware/software co-design for our high performance 3D torus NIC, with a focus on how routed diagnostic messages do not affect the system performances.
Journal: Future Generation Computer Systems - Volume 53, December 2015, Pages 90–99