Repository logo
Log In(current)
  1. Home
  2. Colleges & Schools
  3. Graduate School
  4. Doctoral Dissertations
  5. Hard and Soft Error Resilience for One-sided Dense Linear Algebra Algorithms
Details

Hard and Soft Error Resilience for One-sided Dense Linear Algebra Algorithms

Date Issued
August 1, 2012
Author(s)
Du, Peng
Advisor(s)
Jack Dongarra
Additional Advisor(s)
Michael Berry, James Plank, Xiaobing Feng, Jack Dongarra
Abstract

Dense matrix factorizations, such as LU, Cholesky and QR, are widely used by scientific applications that require solving systems of linear equations, eigenvalues and linear least squares problems. Such computations are normally carried out on supercomputers, whose ever-growing scale induces a fast decline of the Mean Time To Failure (MTTF). This dissertation develops fault tolerance algorithms for one-sided dense matrix factorizations, which handles Both hard and soft errors.


For hard errors, we propose methods based on diskless checkpointing and Algorithm Based Fault Tolerance (ABFT) to provide full matrix protection, including the left and right factor that are normally seen in dense matrix factorizations. A horizontal parallel diskless checkpointing scheme is devised to maintain the checkpoint data with scalable performance and low space overhead, while the ABFT checksum that is generated before the factorization constantly updates itself by the factorization operations to protect the right factor. In addition, without an available fault tolerant MPI supporting environment, we have also integrated the Checkpoint-on-Failure(CoF) mechanism into one-sided dense linear operations such as QR factorization to recover the running stack of the failed MPI process.

Soft error is more challenging because of the silent data corruption, which leads to a large area of erroneous data due to error propagation. Full matrix protection is developed where the left factor is protected by column-wise local diskless checkpointing, and the right factor is protected by a combination of a floating point weighted checksum scheme and soft error modeling technique. To allow practical use

on large scale system, we have also developed a complexity reduction scheme such that correct computing results can be recovered with low performance overhead.

Experiment results on large scale cluster system and multicore+GPGPU hybrid system have confirmed that our hard and soft error fault tolerance algorithms exhibit the expected error correcting capability, low space and performance overhead and compatibility with double precision floating point operation.

Subjects

fault tolerance

dense linear algebra

soft error

hard error

Disciplines
Numerical Analysis and Computation
Other Computer Engineering
Degree
Doctor of Philosophy
Major
Computer Science
Embargo Date
January 1, 2012
File(s)
Thumbnail Image
Name

pengdu.pdf

Size

3.38 MB

Format

Adobe PDF

Checksum (MD5)

fbb20fc219093ed3460abd3708fa72dc

Built with DSpace-CRIS software - Extension maintained and optimized by 4Science

  • Privacy policy
  • End User Agreement
  • Send Feedback
  • Contact
  • Libraries at University of Tennessee, Knoxville
Repository logo COAR Notify