Date of Award


Degree Type


Degree Name

Doctor of Philosophy


Life Sciences

Major Professor

Igor B. Jouline

Committee Members

Elias Fernandez, Jerome Baudry, Tongye Shen, Mircea Podar


With the explosion in the amount of available sequence data, computational methods have become indispensable for studying proteins. Domains are the fundamental structural, functional and evolutionary units that make up proteins. Studying protein domains is an important part of understanding protein function and evolution. Hidden Markov Models (HMM) are one of the most successful methods that have been applied for protein sequence and structure analysis. In this study, HMM based methods were applied to study the evolution of sensory domains in microbial signal transduction systems as well as functional characterization and identification of cellulases in metagenomics datasets. Use of HMM domain models enabled identification of the ambiguity in sequence and structure based definitions of the Cache domain family. Cache domains are extracellular sensory domains that are present in microbial signal transduction proteins and eukaryotic voltage gated calcium channels. The ambiguity in domain definitions was resolved and more accurate HMM models were built that detected more than 50,000 new members. It was discovered that Cache domains constitute the largest family of extracellular sensory domains in prokaryotes. Cache domains were also found to be remotely homologous to PAS domains at the level of sequence, a relationship previously suggested purely based on structural comparisons. We used HMM-HMM comparisons to study the diversity of extracellular sensory domains in prokaryotic signal transductions systems. This approach allowed annotation of more than 46,000 sequences and reduced the percentage of unknown domains from 64% to 15%. New relationships were also discovered between domain families that were otherwise thought to be unrelated. Finally, HMM models were used to retrieve Family 48 glycoside hydrolases (GH48) from sequence databases. Analysis of these sequences, enabled the identification of distinguishing features of cellulases. These features were used to identify GH48 cellulases from metagenomics datasets. In summary, HMM based methods enabled domain identification, remote homology detection and functional characterization of protein domains.

Appendix-1.1.xlsx (134 kB)
HHsearch results for Cache superfamily

Appendix-1.2.xlsx (86 kB)
Overlapping hits with Cache domain prediction using new models

Appendix-1.3.xlsx (206 kB)
Phyletic distribution of Cache, PAS and GAF superfamilies

Appendix-2.1.xlsx (33 kB)
Supplementary for Chapter II

Appendix-3.1.pdf (176 kB)
Supplementary for Chapter III

Appendix-3.2.xlsx (24 kB)
Supplementary for Chapter III

Files over 3MB may be slow to open. For best results, right-click and select "save as..."