Masters Theses
Date of Award
12-1997
Degree Type
Thesis
Degree Name
Master of Science
Major
Computer Science
Major Professor
James S. Plank
Committee Members
Brad Vader Zanden
Abstract
Checkpointing is a functionality that enables users of distributed systems to perform job swapping, process migration and fault-tolerance. While checkpointers typically provide job swapping and process migration with reasonable overhead, the overhead for fault-tolerance is often too high. The reason for this is not inherent in the act of checkpointing, but instead stems from how the checkpoints are placed on stable storage.
This thesis explores two placement strategies for checkpointing in distributed systems. These are called Single Processor Fault Tolerance, and Reed-Solomon coding. Both strategies are adaptations of RAID techniques [16, 41] for check- pointing systems, and aim to improve performance at the expense of fault cover- age. We detail an implementation of these strategies in MIST, a checkpointer for PVM, and present performance results of these and standard checkpoint place- ment strategies. The conclusions that we draw are that both strategies can im- prove the performance of checkpointing, and should be employed by users who desire improved performance over wholesale failure coverage.
Recommended Citation
Pace, Darryl V., "Checkpoint placement strategies for fault tolerance on networks of workstations. " Master's Thesis, University of Tennessee, 1997.
https://trace.tennessee.edu/utk_gradthes/10673