Article ID Journal Published Year Pages File Type
461263 Journal of Systems and Software 2011 9 Pages PDF
Abstract

Thank to the excellent extensibility and usability, computer clusters have become the dominating platform for parallel computing. Fault-tolerance is mandatory for safety-critical applications running on clusters. In this paper we propose a service-aware and adaptive fault-tolerant scheduling algorithm using overlapping technologies (SAO in short) that can tolerate a node’s permanent failure at any time instant for real-time tasks with service requirements in heterogeneous clusters. SAO adopts the primary/backup model and considers the timing constraints, service requirements, and system resource utilization. To improve system resource utilization, we employ backup-backup (BB in short) and primary-backup (PB in short) overlapping technologies and analyze the overlapping constraints. In addition, SAO has high system adaptivity by dynamically adjusting the service levels of tasks based on system load. Furthermore, to improve resource utilization and schedulability, SAO makes backup copies adopt passive execution scheme or decrease the overlapping execution time of the primary copy and backup copy of a task as much as possible. Compared with a baseline algorithm SAWO (a service-aware and adaptive fault-tolerant scheduling algorithm without using overlapping technologies) and an existing algorithm DYFARS with simulation experiments, SAO achieves an average of 51.25% improvement in performability.

► A service-aware and adaptive fault-tolerant scheduling algorithm SAO was proposed. ► SAO adopts the primary/backup model and overlapping technologies. ► SAO can tolerate a node’s permanent failure at any time instant. ► SAO improves the adaptivity of real-time fault-tolerant systems.

Related Topics
Physical Sciences and Engineering Computer Science Computer Networks and Communications
Authors
, , , ,