Doctoral Dissertations

Orcid ID

Date of Award


Degree Type


Degree Name

Doctor of Philosophy


Data Science and Engineering

Major Professor

Hong-Jun Yoon

Committee Members

Shang Gao, Russell Zaretzki, Audris Mockus


Information contained in electronic health records (EHR) combined with the latest advances in machine learning (ML) have the potential to revolutionize the medical sciences. In particular, information contained in cancer pathology reports is essential to investigate cancer trends across the country. Unfortunately, large parts of information in EHRs are stored in the form of unstructured, free-text which limit their usability and research potential. To overcome this accessibility barrier, cancer registries depend on expert personnel who read, interpret, and extract relevant information. Naturally, as the number of stored pathology reports increases every day, depending on human experts presents scalability challenges. Recently, researchers have attempted to automate the information extraction process from cancer pathology reports using ML techniques commonly found in natural language processing (NLP). However, clinical text is inherently different than other common forms of text, and state-of-the-art NLP approaches often exhibit mediocre performance. In this study, we narrow the literature gap by investigating methods to tackle overfitting and improve the performance of ML models for the classification of cancer pathology reports so that we can reduce the dependency on human expert annotators. We (1) show that using active learning can mitigate extreme class imbalance by increasing the representation of documents belonging to rare cancer types, (2) investigated the feasibility of ensemble learning and a mixture-of-expert variant to boost minority class performance, and (3) demonstrated that ensemble model distillation provides a strategy for quantifying the uncertainty inherent in labeled data, offering an effective low-resource solution that can be easily deployed by cancer registries.

Files over 3MB may be slow to open. For best results, right-click and select "save as..."

Included in

Data Science Commons