Doctoral Dissertations

Orcid ID


Date of Award


Degree Type


Degree Name

Doctor of Philosophy


Computer Science

Major Professor

Jack Dongarra

Committee Members

Jack Dongarra, George Bosilca, Michael Berry, Ichitaro Yamazaki


High Performance Computing (HPC) has always been a key foundation for scientific simulation and discovery. And more recently, deep learning models' training have further accelerated the demand of computational power and lower precision arithmetic. In this era following the end of Dennard's Scaling and when Moore's Law seemingly still holds true to a lesser extent, it is not a coincidence that HPC systems are equipped with multi-cores CPUs and a variety of hardware accelerators that are all massively parallel. Coupling this with interconnect networks' speed improvements lagging behind those of computational power increases, the current state of HPC systems is heterogeneous and extremely complex.

This was heralded as a great challenge to the software stacks and their ability to extract performance from these systems, but also as a great opportunity to innovate at the programming model level to explore the different approaches and propose new solutions. With usability, portability, and performance as the main factors to consider, this dissertation first evaluates some of the widely used parallel programming models (MPI, MPI+OpenMP, and task-based runtime systems) ability to manage the load imbalance among the processes computing the LU factorization of a large dense matrix stored in the Block Low-Rank (BLR) format.

Next I proposed a number of optimizations and implemented them in PaRSEC's Dynamic Task Discovery (DTD) model, including user-level graph trimming and direct Application Programming Interface (API) calls to perform data broadcast operation to further extend the limit of STF model. On the other hand, the Parameterized Task Graph (PTG) approach in PaRSEC is the most scalable approach for many different applications, which I then explored the possibility of combining both the algorithmic approach of Communication-Avoiding (CA) and the communication-computation overlapping benefits provided by runtime systems using 2D five-point stencil as the test case. This broad programming models evaluation and extension work highlighted the abilities of task-based runtime system in achieving scalable performance and portability on contemporary heterogeneous HPC systems. Finally, I summarized the profiling capability of PaRSEC runtime system, and demonstrated with a use case its important role in the performance bottleneck identification leading to optimizations.

Files over 3MB may be slow to open. For best results, right-click and select "save as..."