Doctoral Dissertations
Date of Award
8-1996
Degree Type
Dissertation
Degree Name
Doctor of Philosophy
Major
Computer Science
Major Professor
Jack J. Dongarra
Committee Members
Mike Berry, Ohannes Karakashian, Jim Plank
Abstract
With the proliferation of parallel and distributed systems, it is an increasingly important problem to render parallel applications fault-tolerant because such applications are more prone to failures with an increasing number of processors.
This dissertation explores fault tolerance in a wide variety of matrix operations for parallel and distributed scientific computing. It proposes a novel computing paradigm to provide fault tolerance for numerical algorithms. This fault-tolerant computing paradigm relies on checkpointing and rollback recovery using processor and memory redundancy. The paradigm is an algorithm-based approach, in which fault tolerance techniques are tailored into each numerical algorithm without redesigning the algorithm and replicating the processes. The paradigm tolerates the changing and failure-prone nature of a computing platform, thereby allowing users to run their parallel codes dynamically and efficiently.
This dissertation describes the fault-tolerant implementations of various classes of high- performance matrix operations in a parallel programming environment. The implementations are currently applicable to networks of workstations. An empirical performance evaluation of the implementations on a network of workstation confirms that the advantages of our paradigm are its low overhead, simplicity, ease of implementation, and feasibility to scientific applications. This evaluation also demonstrates that the paradigm is an effective approach to achieve fast, reliable scientific computations on networks of workstations.
Recommended Citation
Kim, Youngbae, "Fault tolerant matrix operations for parallel and distributed systems. " PhD diss., University of Tennessee, 1996.
https://trace.tennessee.edu/utk_graddiss/9778