Article ID | Journal | Published Year | Pages | File Type |
---|---|---|---|---|
10330118 | Future Generation Computer Systems | 2005 | 7 Pages |
Abstract
Fault tolerant message passing environments protect parallel applications against node failures. Very large scale computing systems, ranging from large clusters to worldwide Global Computing systems, require a high level of fault tolerance in order to efficiently run parallel applications. The Channel Memory approach provides the infrastructure for scalable tolerance to simultaneous faults. Along with a specially designed checkpointing system and recovery protocol, this approach has resulted in the MPICH-V architecture. In this paper, we describe CMDE - a stand-alone distributed program system based on MPICH-V architecture and implementing an approach to tolerate faults of Channel Memories.
Related Topics
Physical Sciences and Engineering
Computer Science
Computational Theory and Mathematics
Authors
A. Selikhov, C. Germain,