Doctoral Dissertations
Date of Award
8-2024
Degree Type
Dissertation
Degree Name
Doctor of Philosophy
Major
Computer Science
Major Professor
Scott J. Emrich
Committee Members
Catherine D. Schuman, Andrew D. Steen, Jian Huang
Abstract
DNA (DeoxyriboNucleic Acid) carries the genetic information for the biological processes and function of all organisms. It is composed of nucleotides, which can be grouped into 3-mer triplets called codons. It is well known that codons encoding the same amino acid, referred to as "synonymous" codons, are selected with differing frequencies between organisms. Prior research has revealed there are codons used with much higher frequency than others, causing to them being "preferred" in highly expressed genes. This has led to the development of multiple computational models that do a good job predicting gene expression in some protein-coding genes; however, their performance is often negatively impacted when modeling higher organisms with increasingly diverse gene expression.
Our goal is to help answer the broader question: why should an organism prefer to choose one codon over another? With the introduction of modern sequencing technologies leading to the mass production of sequenced data at reduced costs, the abundance of available genomics data allows for the adoption of deep learning into modern bioinformatics pipelines. We are particularly interested in protein language models (PLMs) that employ natural language processing on the amino acid sequences themselves to be able to accurately predict many properties of proteins such as folding, function and structure. We theorize that modeling the codons rather than the amino acids using protein language modeling paradigms will allow for capturing more diverse codon usage patterns.
In this dissertation, we build a foundation from the ground up for the first-ever effort towards "codon language modeling", i.e., using the codons rather than the amino acids in deep neural networks to capture intrinsic codon usage patterns from the sequence information alone. A key distinction of our methodology from other works is that we: (i) make distinct architectural choices aimed specifically at capturing codon bias to be more informative in downstream predictions, and (ii) we include demonstrations of explainable AI (XAI) methodologies that are able to validate previously discovered genomic patterns. Significantly, our final model, CodonT5, is able to generate comparable "translations" to previously established methods using only the sequence itself. This is important as it paves the way for more diverse sequence-to-sequence modeling that will be necessary in many applications involving replicating a reference protein in a host (e.g. insulin production, mRNA vaccines).
Recommended Citation
Babjac, Ashley N., "CodonT5: A Multi-Task Codon Language Model for Species-to-Species Translation. " PhD diss., University of Tennessee, 2024.
https://trace.tennessee.edu/utk_graddiss/10429
Included in
Artificial Intelligence and Robotics Commons, Biomedical Informatics Commons, Data Science Commons, Medical Genetics Commons