Date of Award
Doctor of Philosophy
Data Science and Engineering
David Kainer, Michela Taufer, Michael Langston
As technology improves, the field of biology has increasingly utilized high performance computing techniques to analyze big data and provide insights into biological systems. A reproducible, efficient, and effective method is required to analyze these large datasets of varying types into interpretable results. Iterative Random Forest (iRF) is an explainable supervised learner that makes few assumptions about the relationships between variables and is able to capture complex interactions that are common in biological systems. This forest based learner is the basis of iRF-Leave One Out Prediction (iRF-LOOP), an algorithm that uses a matrix of data to produce all-to-all predictive networks. This dissertation includes a validation of the improved performance of iRF over the industry standard of Random Forest, using synthetic and empirical data from various organisms. Additionally, this dissertation includes the use of iRF to create a predictive model of COVID-19 outcomes using environmental features at the county level in the U.S. This dissertation also includes a whole systems biology study in which an improved iRF-LOOP pre-processing pipeline Divide-Test-Integrate is used to produce new gene-to-gene predictive expression networks for a multiplex network study of the model organism Saccharomyces cerevisiae using seed genes of interest from Septoria musiva.
Walker, Angelica M., "Iterative Random Forest Based High Performance Computing Methods Applied to Biological Systems and Human Health. " PhD diss., University of Tennessee, 2022.