Repository logo
Log In(current)
  1. Home
  2. Colleges & Schools
  3. Graduate School
  4. Doctoral Dissertations
  5. Algorithm-Based Fault Tolerance for Two-Sided Dense Matrix Factorizations
Details

Algorithm-Based Fault Tolerance for Two-Sided Dense Matrix Factorizations

Date Issued
December 1, 2015
Author(s)
Jia, Yulu  
Advisor(s)
Jack J. Dongarra
Additional Advisor(s)
James Plank
Michael Berry
Ohannes Karakashian
Permanent URI
https://trace.tennessee.edu/handle/20.500.14382/24721
Abstract

The mean time between failure (MTBF) of large supercomputers is decreasing, and future exascale computers are expected to have a MTBF of around 30 minutes. Therefore, it is urgent to prepare important algorithms for future machines with such a short MTBF. Eigenvalue problems (EVP) and singular value problems (SVP) are common in engineering and scientific research. Solving EVP and SVP numerically involves two-sided matrix factorizations: the Hessenberg reduction, the tridiagonal reduction, and the bidiagonal reduction. These three factorizations are computation intensive, and have long running times. They are prone to suffer from computer failures.


We designed algorithm-based fault tolerant (ABFT) algorithms for the parallel Hessenberg reduction and the parallel tridiagonal reduction. The ABFT algorithms target fail-stop errors. These two fault tolerant algorithms use a combination of ABFT and diskless checkpointing. ABFT is used to protect frequently modified data . We carefully design the ABFT algorithm so the checksums are valid at the end of each iterative cycle. Diskless checkpointing is used for rarely modified data. These checkpoints are in the form of checksums, which are small in size, so the time and storage cost to store them in main memory is small. Also, there are intermediate results which need to be protected for a short time window. We store a copy of this data on the neighboring process in the process grid.

We also designed algorithm-based fault tolerant algorithms for the CPU-GPU hybrid Hessenberg reduction algorithm and the CPU-GPU hybrid bidiagonal reduction algorithm. These two fault tolerant algorithms target silent errors. Our design employs both ABFT and diskless checkpointing to provide data redundancy. The low cost error detection uses two dot products and an equality test. The recovery protocol uses reverse computation to roll back the state of the matrix to a point where it is easy to locate and correct errors.

We provided theoretical analysis and experimental verification on the correctness and efficiency of our fault tolerant algorithm design. We also provided mathematical proof on the numerical stability of the factorization results after fault recovery. Experimental results corroborate with the mathematical proof that the impact is mild.

Subjects

ABFT

fault tolerance

dense linear algebra

two-sided matrix fact...

Hessenberg

checksum

Disciplines
Numerical Analysis and Scientific Computing
Degree
Doctor of Philosophy
Major
Computer Science
Embargo Date
January 1, 2011
File(s)
Thumbnail Image
Name

yulu_jia_dissertation.pdf

Size

737.43 KB

Format

Adobe PDF

Checksum (MD5)

e57c0705b4e63f62416161b26f22326d

Built with DSpace-CRIS software - Extension maintained and optimized by 4Science

  • Privacy policy
  • End User Agreement
  • Send Feedback
  • Contact
  • Libraries at University of Tennessee, Knoxville
Repository logo COAR Notify