Masters Theses

Date of Award

12-1997

Degree Type

Thesis

Degree Name

Master of Science

Major

Computer Science

Major Professor

James S. Plank

Committee Members

Brad Vader Zanden

Abstract

Checkpointing is a functionality that enables users of distributed systems to perform job swapping, process migration and fault-tolerance. While checkpointers typically provide job swapping and process migration with reasonable overhead, the overhead for fault-tolerance is often too high. The reason for this is not inherent in the act of checkpointing, but instead stems from how the checkpoints are placed on stable storage.

This thesis explores two placement strategies for checkpointing in distributed systems. These are called Single Processor Fault Tolerance, and Reed-Solomon coding. Both strategies are adaptations of RAID techniques [16, 41] for check- pointing systems, and aim to improve performance at the expense of fault cover- age. We detail an implementation of these strategies in MIST, a checkpointer for PVM, and present performance results of these and standard checkpoint place- ment strategies. The conclusions that we draw are that both strategies can im- prove the performance of checkpointing, and should be employed by users who desire improved performance over wholesale failure coverage.

Files over 3MB may be slow to open. For best results, right-click and select "save as..."

Share

COinS