Article ID Journal Published Year Pages File Type
4955383 Computers & Electrical Engineering 2017 15 Pages PDF
Abstract

•Incorporating failure-awareness into system management stack.•The concept of network fault influence domain is proposed.•The rules for topology-based fault influence analysis are established.

The extremely high performance of supercomputers is derived from the coordination of a large number of compute nodes. As a consequence, the communication subsystem significantly affects the overall system performance. A single router or link breakdown in the interconnection network may affect a group of tasks. The rapid increase of system scale makes this problem even worse. However, impacts of network faults are typically highly skew on different parts of the system. On the occurrence of a network fault, there could be a subset of compute nodes, among which the fault influence could be ignored. With this intuition, we designed FIDA, a network fault influence domain analysis tool, which infers which part of the system suffers most severely from the fault. The influence domain given by FIDA will be further delivered to the resource management subsystem as guidelines to allocate healthy nodes preferentially to achieve better performance.

Graphical abstractDownload high-res image (69KB)Download full-size image

Related Topics
Physical Sciences and Engineering Computer Science Computer Networks and Communications
Authors
, , , ,