کد مقاله | کد نشریه | سال انتشار | مقاله انگلیسی | نسخه تمام متن |
---|---|---|---|---|
490490 | 707499 | 2013 | 10 صفحه PDF | دانلود رایگان |

The execution times of large-scale parallel applications on modern multi/many-core systems are usually longer than their mean time between failures. Therefore, parallel applications must tolerate hardware failures to ensure that not all computation is lost on machine failures. Checkpointing and rollback recovery are very useful techniques to implement fault-tolerant applications. In parallel applications a checkpointing protocol is required to guarantee that individual checkpoints form a consistent global state. Coordinated approaches are the most popular solution to achieve global checkpointing consistency. However, their main drawback is their poor scalability due to the required runtime coordination. This work presents a new hybrid protocol that combines the detection of valid recovery lines at compile time with a light and asynchronous protocol at runtime to negotiate the closest valid recovery line. Experimental results prove the efficiency and scalability of the proposal.
Journal: Procedia Computer Science - Volume 18, 2013, Pages 169-178