Repository logo
Log In(current)
  1. Home
  2. Colleges & Schools
  3. Graduate School
  4. Masters Theses
  5. Building containerized environments for reproducibility and traceability of scientific workflows
Details

Building containerized environments for reproducibility and traceability of scientific workflows

Date Issued
May 1, 2020
Author(s)
Olaya, Paula Fernanda
Advisor(s)
Michela Taufer
Additional Advisor(s)
Michael R. Jantz
Michael W. Berry
Gerald F. Lofstead II
Permanent URI
https://trace.tennessee.edu/handle/20.500.14382/41865
Abstract

Scientists use simulations to study natural phenomena, and trusting the simulation results is vital to the integrity of scientific discovery. To trust results, we must ensure the simulations' reproducibility, replicability, and traceability through the annotation of simulation's executions. The annotation allows us to build a record trail of data moving within a given simulation workflow. Past efforts advocated for the need to build record trails at the system level but a key hindrance to these approaches was the limits of the system technology. The evolution of virtual machines to containers has opened new opportunities for system-level solutions. In this work, we propose an operative system-level solution that leverages the intrinsic characteristics of containers (i.e., portability, isolation, encapsulation, and unique identifiers) to annotate workflows and capture their metadata. Our solution enables transparent and automatic metadata collection and access, easy-to-read record trail, and tight connections between data and metadata. We build a prototype of a containerized environment that encapsulates each component of a scientific workflow (i.e., data and applications) in individual containers. Our prototype implementation features zero-copy data transfer between containers, requires no modification of the underlying applications, and automatically links the metadata to the workflow. We assess the effectiveness of our prototype for four increasingly complex workflows, ranging from simple visualization applications such as, gnuplot to machine learning applications such as KKNN and random forest; and show that we are able to build workflow record trails at the OS-level for all four scenarios in an automatic, easy-to-read, and with a tight connection between data and metadata. We measure the costs of our containerized environment in terms of time and space. We observe that time overhead associated with the containerization becomes tolerable when the workflows have a larger size and long runtime applications. We also observe that the space overhead is driven by the OS, software stack, and file system. Our containerized environment addresses metadata from OS system-level by leveraging cutting edge container technology to provide a complete, transparent, and automatic collection and management of workflow metadata.

Subjects

containers

metadata

software systems

data provenance

computer environments...

Degree
Master of Science
Major
Computer Science
File(s)
Thumbnail Image
Name

utkirtd_13605.pdf

Size

1.83 MB

Format

Adobe PDF

Checksum (MD5)

79a59e13cc2623417f1cd4c65755a8eb

Built with DSpace-CRIS software - Extension maintained and optimized by 4Science

  • Privacy policy
  • End User Agreement
  • Send Feedback
  • Contact
  • Libraries at University of Tennessee, Knoxville
Repository logo COAR Notify