Article ID Journal Published Year Pages File Type
424741 Future Generation Computer Systems 2010 10 Pages PDF
Abstract
A systematic study of issues related to suspending, migrating and resuming virtual clusters for data-driven HPC applications is presented. The interest is focused on nontrivial virtual clusters, that is where the running computation is expected to be coordinated and strongly coupled. It is shown that this requires that all cluster level operations, such as start and save, should be performed as synchronously as possible on all nodes, introducing the need of barriers at the virtual cluster computing meta-level. Once a synchronization mechanism is provided, and appropriate transport strategies have been setup, it is possible to suspend, migrate and resume whole virtual clusters composed of “heavy” (4 GB RAM, 6 GB disk images) virtual machines in times of the order of few minutes without disrupting parallel computation-albeit of the MapReduce type-running inside them. The approach is intrinsically parallel, and should scale without problems to larger size virtual clusters.
Keywords
Related Topics
Physical Sciences and Engineering Computer Science Computational Theory and Mathematics
Authors
, , , , ,