Repository logo
Log In(current)
  1. Home
  2. Colleges & Schools
  3. Graduate School
  4. Doctoral Dissertations
  5. Toward Message Passing Failure Management
Details

Toward Message Passing Failure Management

Date Issued
May 1, 2013
Author(s)
Bland, Wesley B.
Advisor(s)
Jack J. Dongarra
Additional Advisor(s)
James S. Plank
Gregory D. Peterson
Vasilios Alexiades
Permanent URI
https://trace.tennessee.edu/handle/20.500.14382/22641
Abstract

As machine sizes have increased and application runtimes have lengthened, research into fault tolerance has evolved alongside. Moving from result checking, to rollback recovery, and to algorithm based fault tolerance, the type of recovery being performed has changed, but the programming model in which it executes has remained virtually static since the publication of the original Message Passing Interface (MPI) Standard in 1992. Since that time, applications have used a message passing paradigm to communicate between processes, but they could not perform process recovery within an MPI implementation due to limitations of the MPI Standard. This dissertation describes a new protocol using the exiting MPI Standard called Checkpoint-on-Failure to perform limited fault tolerance within the current framework of MPI, and proposes a new platform titled User Level Failure Mitigation (ULFM) to build more complete and complex fault tolerance solutions with a true fault tolerant MPI implementation. We will demonstrate the overhead involved in using these fault tolerant solutions and give examples of applications and libraries which construct other fault tolerance mechanisms based on the constructs provided in ULFM.

Subjects

mpi

fault tolerance

high performance comp...

ulfm

user level failure mi...

Disciplines
OS and Networks
Degree
Doctor of Philosophy
Major
Computer Science
Embargo Date
January 1, 2011
File(s)
Thumbnail Image
Name

wesleyblandfinal.pdf

Size

730.35 KB

Format

Adobe PDF

Checksum (MD5)

8969e57d4b53cf98454e89b89ab9f0c2

Built with DSpace-CRIS software - Extension maintained and optimized by 4Science

  • Privacy policy
  • End User Agreement
  • Send Feedback
  • Contact
  • Libraries at University of Tennessee, Knoxville
Repository logo COAR Notify