Repository logo
Log In(current)
  1. Home
  2. Colleges & Schools
  3. Graduate School
  4. Doctoral Dissertations
  5. Fault tolerant matrix operations for parallel and distributed systems
Details

Fault tolerant matrix operations for parallel and distributed systems

Date Issued
August 1, 1996
Author(s)
Kim, Youngbae
Advisor(s)
Jack J. Dongarra
Additional Advisor(s)
Mike Berry, Ohannes Karakashian, Jim Plank
Abstract

With the proliferation of parallel and distributed systems, it is an increasingly important problem to render parallel applications fault-tolerant because such applications are more prone to failures with an increasing number of processors.


This dissertation explores fault tolerance in a wide variety of matrix operations for parallel and distributed scientific computing. It proposes a novel computing paradigm to provide fault tolerance for numerical algorithms. This fault-tolerant computing paradigm relies on checkpointing and rollback recovery using processor and memory redundancy. The paradigm is an algorithm-based approach, in which fault tolerance techniques are tailored into each numerical algorithm without redesigning the algorithm and replicating the processes. The paradigm tolerates the changing and failure-prone nature of a computing platform, thereby allowing users to run their parallel codes dynamically and efficiently.

This dissertation describes the fault-tolerant implementations of various classes of high- performance matrix operations in a parallel programming environment. The implementations are currently applicable to networks of workstations. An empirical performance evaluation of the implementations on a network of workstation confirms that the advantages of our paradigm are its low overhead, simplicity, ease of implementation, and feasibility to scientific applications. This evaluation also demonstrates that the paradigm is an effective approach to achieve fast, reliable scientific computations on networks of workstations.

Degree
Doctor of Philosophy
Major
Computer Science
File(s)
Thumbnail Image
Name

Thesis96b.K559.pdf_AWSAccessKeyId_AKIAYVUS7KB2IXSYB4XB_Signature_7qTvcvuLb24tSevGdCq6Ov0ZVK0_3D_Expires_1716551597

Size

9.53 MB

Format

Unknown

Checksum (MD5)

3635887d3844d03ca53ce94c204ec1e0

Learn more about how TRACE supports reserach impact and open access here.

Built with DSpace-CRIS software - Extension maintained and optimized by 4Science

  • Privacy policy
  • End User Agreement
  • Send Feedback
  • Contact
  • Libraries at University of Tennessee, Knoxville
Repository logo COAR Notify