Article ID Journal Published Year Pages File Type
490490 Procedia Computer Science 2013 10 Pages PDF
Abstract

The execution times of large-scale parallel applications on modern multi/many-core systems are usually longer than their mean time between failures. Therefore, parallel applications must tolerate hardware failures to ensure that not all computation is lost on machine failures. Checkpointing and rollback recovery are very useful techniques to implement fault-tolerant applications. In parallel applications a checkpointing protocol is required to guarantee that individual checkpoints form a consistent global state. Coordinated approaches are the most popular solution to achieve global checkpointing consistency. However, their main drawback is their poor scalability due to the required runtime coordination. This work presents a new hybrid protocol that combines the detection of valid recovery lines at compile time with a light and asynchronous protocol at runtime to negotiate the closest valid recovery line. Experimental results prove the efficiency and scalability of the proposal.

Related Topics
Physical Sciences and Engineering Computer Science Computer Science (General)