Masters Theses
Date of Award
8-2015
Degree Type
Thesis
Degree Name
Master of Science
Major
Reliability and Maintainability Engineering
Major Professor
Xiaoyan Zhu
Committee Members
Ramon V. Leon, Alberto Garcia, Mingzhou Jin
Abstract
A supercomputer is a repairable system with large number of compute nodes interconnected to work in harmony to achieve superior computational performance. Reliability of such a complex system depends on an effective maintenance strategy that involves both emergency and preventive maintenance. This thesis analyzes the maintenance records of four supercomputers operational at The National Institute of Computational Science located at Oak Ridge National Laboratory. We propose to use the generalized proportional intensities model (GPIM) to model the maintenance interrupts as it can capture both the reliability parameters and maintenance parameters and allows the inclusion of both emergency and preventive maintenance. We use this model to obtain the reliability parameters indicating the system performance and maintenance parameters indicating the effectiveness of maintenance actions for each of the four supercomputers. System performance measures such as reliability and availability are used to evaluate the effectiveness of the existing maintenance policy and to propose a new maintenance policy that increases the system availability and reduces maintenance cost.
Recommended Citation
Cherukuri, Jagadish, "Modelling Supercomputer Maintenance Interrupts: Maintenance Policy Recommendations. " Master's Thesis, University of Tennessee, 2015.
https://trace.tennessee.edu/utk_gradthes/3442
Included in
Industrial Engineering Commons, Other Engineering Commons, Risk Analysis Commons, Statistical Models Commons