کد مقاله | کد نشریه | سال انتشار | مقاله انگلیسی | نسخه تمام متن |
---|---|---|---|---|
4955383 | 1444183 | 2017 | 15 صفحه PDF | دانلود رایگان |
- Incorporating failure-awareness into system management stack.
- The concept of network fault influence domain is proposed.
- The rules for topology-based fault influence analysis are established.
The extremely high performance of supercomputers is derived from the coordination of a large number of compute nodes. As a consequence, the communication subsystem significantly affects the overall system performance. A single router or link breakdown in the interconnection network may affect a group of tasks. The rapid increase of system scale makes this problem even worse. However, impacts of network faults are typically highly skew on different parts of the system. On the occurrence of a network fault, there could be a subset of compute nodes, among which the fault influence could be ignored. With this intuition, we designed FIDA, a network fault influence domain analysis tool, which infers which part of the system suffers most severely from the fault. The influence domain given by FIDA will be further delivered to the resource management subsystem as guidelines to allocate healthy nodes preferentially to achieve better performance.
69
Journal: Computers & Electrical Engineering - Volume 57, January 2017, Pages 266-280