Date of Award


Degree Type


Degree Name

Master of Science


Computer Engineering

Major Professor

Gregory D. Peterson

Committee Members

Michael W. Berry, Nathanael R. Paul


The high performance computing (HPC) community is obsessed over the general matrix-matrix multiply (GEMM) routine. This obsession is not without reason. Most, if not all, Level 3 Basic Linear Algebra Subroutines (BLAS) can be written in terms of GEMM, and many of the higher level linear algebra solvers' (i.e., LU, Cholesky) performance depend on GEMM's performance. Getting high performance on GEMM is highly architecture dependent, and so for each new architecture that comes out, GEMM has to be programmed and tested to achieve maximal performance. Also, with emergent computer architectures featuring more vector-based and multi to many-core processors, GEMM performance becomes hinged to the utilization of these technologies. In this research, three Intel processor architectures are explored, including the new Intel MIC Architecture. Each architecture has different vector lengths and number of cores. The effort given to create three Level 3 BLAS routines (GEMM, TRSM, SYRK) is examined with respect to the architectural features as well as some parallel algorithmic nuances. This thorough examination culminates in a Cholesky (POTRF) routine which offers a legitimate test application. Lastly, four shared memory, parallel languages are explored for these routines to explore single-node supercomputing performance. These languages are OpenMP, Pthreads, Cilk and TBB. Each routine is developed in each language offering up information about which language is superior. A clear picture develops showing how these and similar routines should be written in OpenMP and exactly what architectural features chiefly impact performance.

Files over 3MB may be slow to open. For best results, right-click and select "save as..."