Achieving Checkpointing Global Consistency Through a Hybrid Compile Time and Runtime Protocol

Article ID	Journal	Published Year	Pages	File Type
490490	Procedia Computer Science	2013	10 Pages	PDF

Abstract

The execution times of large-scale parallel applications on modern multi/many-core systems are usually longer than their mean time between failures. Therefore, parallel applications must tolerate hardware failures to ensure that not all computation is lost on machine failures. Checkpointing and rollback recovery are very useful techniques to implement fault-tolerant applications. In parallel applications a checkpointing protocol is required to guarantee that individual checkpoints form a consistent global state. Coordinated approaches are the most popular solution to achieve global checkpointing consistency. However, their main drawback is their poor scalability due to the required runtime coordination. This work presents a new hybrid protocol that combines the detection of valid recovery lines at compile time with a light and asynchronous protocol at runtime to negotiate the closest valid recovery line. Experimental results prove the efficiency and scalability of the proposal.