Doctoral Dissertations

Date of Award

8-2022

Degree Type

Dissertation

Degree Name

Doctor of Philosophy

Major

Life Sciences

Major Professor

Michael A. Langston

Committee Members

Faisal N. Abu-Khzam, Tian Hong, David J. Icove

Abstract

With the ever-increasing size of data available to researchers, traditional methods of analysis often cannot scale to match problems being studied. Often only a subset of variables may be utilized or studied further, motivating the need of techniques that can prioritize variable selection. This dissertation describes the development and application of graph theoretic techniques, particularly the notion of domination, for this purpose. In the first part of this dissertation, algorithms for vertex prioritization in the field of network controllability are studied. Here, the number of solutions to which a vertex belongs is used to classify said vertex and determine its suitability in controlling a network. Novel efficient scalable algorithms are developed and analyzed. Empirical tests demonstrate the improvement of these algorithms over those already established in the literature. The second part of this dissertation concerns the prioritization of genes for loss-of-function allele studies in mice. The International Mouse Phenotyping Consortium leads the initiative to develop a loss-of-function allele for each protein coding gene in the mouse genome. Only a small proportion of untested genes can be selected for further study. To address the need to prioritize genes, a generalizable data science strategy is developed. This strategy models genes as a gene-similarity graph, and from it selects subset that will be further characterized. Empirical tests demonstrate the method’s utility over that of pseudorandom selection and less computationally demanding methods. Finally, part three addresses the important task of preprocessing in the context of noisy public health data. Many public health databases have been developed to collect, curate, and store a variety of environmental measurements. Idiosyncrasies in these measurements, however, introduce noise to data found in these databases in several ways including missing, incorrect, outlying, and incompatible data. Beyond noisy data, multiple measurements of similar variables can introduce problems of multicollinearity. Domination is again employed in a novel graph method to handle autocorrelation. Empirical results using the Public Health Exposome dataset are reported. Together these three parts demonstrate the utility of subset selection via domination when applied to a multitude of data sources from a variety of disciplines in the life sciences.

Files over 3MB may be slow to open. For best results, right-click and select "save as..."

Share

COinS