Diskless checkpointing
As the choice of parallel platforms shifts from dedicated parallel machines to networks of workstations, the need for program fault-tolerance has never been greater. Checkpointing is the only means to provide programs with fault-tolerance in general-purpose computing environments. Checkpointing usually involves saving program states to disk. However, in parallel environments, stable storage becomes a bottleneck that prevents efficient checkpointing. Presented in this thesis are algorithms to provide parallel programs with fault-tolerance without relying on stable storage. An implementation of these algorithms was created and compared with the traditional disk-based algorithms. Results show that diskless checkpointing is a viable option to provide efficient fault-tolerance with low overhead.
Thesis97.P84.pdf
5.83 MB
Unknown
a16511a1db9e79eca201b567e290ffc3