Doctoral Dissertations

Author

Youngbae Kim

Date of Award

8-1996

Degree Type

Dissertation

Degree Name

Doctor of Philosophy

Major

Computer Science

Major Professor

Jack J. Dongarra

Committee Members

Mike Berry, Ohannes Karakashian, Jim Plank

Abstract

With the proliferation of parallel and distributed systems, it is an increasingly important problem to render parallel applications fault-tolerant because such applications are more prone to failures with an increasing number of processors.

This dissertation explores fault tolerance in a wide variety of matrix operations for parallel and distributed scientific computing. It proposes a novel computing paradigm to provide fault tolerance for numerical algorithms. This fault-tolerant computing paradigm relies on checkpointing and rollback recovery using processor and memory redundancy. The paradigm is an algorithm-based approach, in which fault tolerance techniques are tailored into each numerical algorithm without redesigning the algorithm and replicating the processes. The paradigm tolerates the changing and failure-prone nature of a computing platform, thereby allowing users to run their parallel codes dynamically and efficiently.

This dissertation describes the fault-tolerant implementations of various classes of high- performance matrix operations in a parallel programming environment. The implementations are currently applicable to networks of workstations. An empirical performance evaluation of the implementations on a network of workstation confirms that the advantages of our paradigm are its low overhead, simplicity, ease of implementation, and feasibility to scientific applications. This evaluation also demonstrates that the paradigm is an effective approach to achieve fast, reliable scientific computations on networks of workstations.

Files over 3MB may be slow to open. For best results, right-click and select "save as..."

Share

COinS