Building containerized environments for reproducibility and traceability of scientific workflows
Date of Award
Master of Science
Michael R. Jantz, Michael W. Berry, Gerald F. Lofstead II
Scientists use simulations to study natural phenomena, and trusting the simulation results is vital to the integrity of scientific discovery. To trust results, we must ensure the simulations’ reproducibility, replicability, and traceability through the annotation of simulation's executions. The annotation allows us to build a record trail of data moving within a given simulation workflow. Past efforts advocated for the need to build record trails at the system level but a key hindrance to these approaches was the limits of the system technology. The evolution of virtual machines to containers has opened new opportunities for system-level solutions.In this work, we propose an operative system-level solution that leverages the intrinsic characteristics of containers (i.e., portability, isolation, encapsulation, and unique identifiers) to annotate workflows and capture their metadata. Our solution enables transparent and automatic metadata collection and access, easy-to-read record trail, and tight connections between data and metadata. We build a prototype of a containerized environment that encapsulates each component of a scientific workflow (i.e., data and applications) in individual containers. Our prototype implementation features zero-copy data transfer between containers, requires no modification of the underlying applications, and automatically links the metadata to the workflow. We assess the effectiveness of our prototype for four increasingly complex workflows, ranging from simple visualization applications such as, Gnuplot to machine learning applications such as KKNN and random forest; and show that we are able to build workflow record trails at the OS-level for all four scenarios in an automatic, easy-to-read, and with a tight connection between data and metadata. We measure the costs of our containerized environment in terms of time and space. We observe that time overhead associated with the containerization becomes tolerable when the workflows have a larger size and long runtime applications. We also observe that the space overhead is driven by the OS, software stack, and filesystem. Our containerized environment addresses metadata from OS system-level by leveraging cutting edge container technology to provide a complete, transparent, and automatic collection and management of workflow metadata.
Olaya, Paula Fernanda, "Building containerized environments for reproducibility and traceability of scientific workflows. " Master's Thesis, University of Tennessee, 2020.