Article ID | Journal | Published Year | Pages | File Type |
---|---|---|---|---|
426132 | Future Generation Computer Systems | 2012 | 8 Pages |
The EU-funded XtreemOS project implements an open-source grid operating system based on Linux. In order to provide fault tolerance and migration for grid applications, it integrates a distributed grid-checkpointing service called XtreemGCP. This service is designed to support various checkpointing protocols and different checkpointer packages (e.g. BLCR, LinuxSSI, OpenVZ, etc.) in a transparent manner through a uniform checkpointer interface. In this paper, we present the integration of a backward error recovery protocol based on independent checkpointing into the XtreemGCP service. The solution we propose is not checkpointer bound and thus can be transparently used on top of any checkpointer package.To evaluate the prototype we run it within a heterogeneous environment composed of single-PC nodes and a Single System Image (SSI) cluster. The experimental results demonstrate the capability of the XtreemGCP service to integrate different checkpointing protocols and independently checkpoint a distributed application within a heterogeneous grid environment. Moreover, the performance evaluation also shows that our solution outperforms the existing coordinated checkpointing protocol in terms of scalability.
We integrated independent checkpointing with partial message-logging in XtreemOS. ► The solution works transparently on top of any underlying checkpointer package. ► We have evaluated the prototype within a heterogeneous grid environment. ► We have compared it with the existing coordinated checkpointing protocol. ► Independent checkpointing performed better for the selected test applications.