Integration of Slurm and Kubernetes for University Research Computing Workloads
Slurm is the de facto standard for resource and workload management for a variety of academic and research applications on campus clusters in many university environments. Kubernetes, originally designed and developed by Google, is a cloud-native open-source software system for resource and workload management of containerized application workloads and is considered the de facto standard for Artificial Intelligence applications and orchestrated workflows. Conventional and accelerator compute resources and storage are widely available in university campus high-performance computing clusters, along with Slurm to manage and schedule the jobs on those campus clusters. With increasing interest among university researchers in developing Artificial Intelligence applications using campus cluster resources and the prevalence of Artificial Intelligence and Machine Learning applications being developed in Docker containers and deployed on the widely adopted Kubernetes platform, there is a need to investigate solutions for the integration of Slurm and Kubernetes onto university clusters to facilitate Artificial Intelligence workloads. An optimal solution would prevent the need for bifurcating campus cluster resources into separately managed Slurm and Kubernetes clusters or for needing to purchase and create entirely new clusters just for Kubernetes operation. This thesis describes the researcher's Artificial Intelligence computational needs, the differences between compute clusters managed by Slurm and Kubernetes, surveys the available Slurm and Kubernetes integration solutions, and describes the experiences of implementation and use of one of the integration solutions at the University of Tennessee, Knoxville, on the university cluster.
HazlewoodV_Thesis_Slurm_and_K8s_20251018.pdf
5.66 MB
Adobe PDF
25390e04a0334f9611a26a62a8116fcf