Date of Award
Master of Science
Mike Leuze, Colleen Jonsson
Whole genome sequencing has been rapidly developed and widely used, made possible by exponentially decreasing cost and computational advances in biological sequence analysis. Massive amount of viral sequences has been produced. By Oct 2016, over 102,000 of records has been archived in NCBI Viral Genome Project and 7730 genomes are RefSeq genomes. To better understand viral classification, phylogenomic analysis, which based on whole-genome information, provides the possibility of reconstructing a “tree of life”. However, there are difficulties to apply phylogenomic methods to large-scale viral genomes. In this study, we designed a 3-step strategy for identifying the optimal length of K-mer in a viral phylogenomic analysis using genomic alignment-free method. These three steps include: 1) Cumulative Relative Entropy, 2) Average Number of Common Features among genomes, and 3) Shannon Diversity Index. A dendrogram of 3905 RefSeq viral genomes has also been constructed by using the optimal K = 9. The resulting dendrogram shows consistency with the viral taxonomy and the Baltimore classification of viruses.
Zhang, Qian, "Strategies for Identifying the Optimal Length of K-mer in a Viral Phylogenomic Analysis using Genomic Alignment-free Method. " Master's Thesis, University of Tennessee, 2016.