Doctoral Dissertations

Date of Award


Degree Type


Degree Name

Doctor of Philosophy


Life Sciences

Major Professor

Nathan C. VerBerkmoes, Tim E. Sparer

Committee Members

Tamah Fridman, Arnold M. Saxton, Chongle Pan


Since the large-scale metaproteome was first reported in 2005, metaproteomics has advanced at a tremendous rate both in its quantitative and qualitative metrics. Furthermore metaproteomics is now being applied as a general tool in microbial ecology in a large variety of environmental studies. Though metaproteomics is becoming a useful and even a standard tool for the microbial ecologist, standardized bioinformatics pipelines are not readily available. Therefore, we developed quantitative and functional analysis pipeline for metaproteomics (QFAM) to help analyze large and complicated metaproteomics data in a robust and timely fashion with outputs designed to be simple and clearly understood by the microbial ecologist.

QFAM starts by running peptide-spectrum searches against resultant MS/MS datasets with mixed metagenome/appropriate protein FASTA database. Its primary search algorithm is MyriMatch/IDPicker. MyriMatch/IDPicker uses multi-CPUs effectively, has an accurate scoring-system, correctly use the high MS accuracy data, and finally has a robust method for protein determination. These are required features for metaproteomics requiring large protein database and complicated peptide-structure.

QFAM has quantitative (QAM) and functional (FAM) analysis to provide dependable protein signatures and confident information for understanding the characteristics of the metaproteome. QAM employs a ’selfea’ R package, which provides probability models as well as Cohen’s effect sizes. Our benchmark data test and Monte Carlo simulation results show that selfea can reduce false positives efficiently while losing few true positives; one of the key goals of proteomics and/or metaproteomics experiments.

FAM has two modules: BioSystems and COG analysis. The BioSystems module is most appropriate for well-annotated model organisms, such as humans, whereas the COG module is useful for less-annotated microorganisms and metagenome sequences. Both modules provide an enrichment test using Fisher’s exact-test and a significance test using selfea. With two statistics, FAM generates differentially enriched functional terms that are insightful for discerning biological information held behind the metaproteome data.

Two application studies in chapter 4 and 5 show how QFAM can be employed for metaproteomics data analysis. QFAM is distinguished from other proteomics pipelines by multiprocessing as well as quantitative and functional analysis.

Files over 3MB may be slow to open. For best results, right-click and select "save as..."