Doctoral Dissertations

Orcid ID

https://orcid.org/0000-0002-0991-7726

Date of Award

8-2024

Degree Type

Dissertation

Degree Name

Doctor of Philosophy

Major

Computer Science

Major Professor

Scott J. Emrich

Committee Members

Catherine D. Schuman, Andrew D. Steen, Jian Huang

Abstract

DNA (DeoxyriboNucleic Acid) carries the genetic information for the biological processes and function of all organisms. It is composed of nucleotides, which can be grouped into 3-mer triplets called codons. It is well known that codons encoding the same amino acid, referred to as "synonymous" codons, are selected with differing frequencies between organisms. Prior research has revealed there are codons used with much higher frequency than others, causing to them being "preferred" in highly expressed genes. This has led to the development of multiple computational models that do a good job predicting gene expression in some protein-coding genes; however, their performance is often negatively impacted when modeling higher organisms with increasingly diverse gene expression.

Our goal is to help answer the broader question: why should an organism prefer to choose one codon over another? With the introduction of modern sequencing technologies leading to the mass production of sequenced data at reduced costs, the abundance of available genomics data allows for the adoption of deep learning into modern bioinformatics pipelines. We are particularly interested in protein language models (PLMs) that employ natural language processing on the amino acid sequences themselves to be able to accurately predict many properties of proteins such as folding, function and structure. We theorize that modeling the codons rather than the amino acids using protein language modeling paradigms will allow for capturing more diverse codon usage patterns.

In this dissertation, we build a foundation from the ground up for the first-ever effort towards "codon language modeling", i.e., using the codons rather than the amino acids in deep neural networks to capture intrinsic codon usage patterns from the sequence information alone. A key distinction of our methodology from other works is that we: (i) make distinct architectural choices aimed specifically at capturing codon bias to be more informative in downstream predictions, and (ii) we include demonstrations of explainable AI (XAI) methodologies that are able to validate previously discovered genomic patterns. Significantly, our final model, CodonT5, is able to generate comparable "translations" to previously established methods using only the sequence itself. This is important as it paves the way for more diverse sequence-to-sequence modeling that will be necessary in many applications involving replicating a reference protein in a host (e.g. insulin production, mRNA vaccines).

Files over 3MB may be slow to open. For best results, right-click and select "save as..."

Share

COinS