Doctoral Dissertations

Date of Award

12-2021

Degree Type

Dissertation

Degree Name

Doctor of Philosophy

Major

Computer Science

Major Professor

Audris Mockus

Committee Members

Hairong Qi, Amir Sadovnik, Dawnie Wolfe Steadman

Abstract

Large image collections are becoming common in many fields and offer tantalizing opportunities to transform how research, work, and education are conducted if the information and associated insights could be extracted from them. However, major obstacles to this vision exist. First, image datasets with associated metadata contain errors and need to be cleaned and organized to be easily explored and utilized. Second, such collections typically lack the necessary context or may have missing attributes that need to be recovered. Third, such datasets are domain-specific and require human expert involvement to make the right interpretation of the image content. Fourth, the large size of these collections makes it time-consuming, costly, and in some cases, unfeasible to address the aforementioned problems. This dissertation aims to systematically address all four obstacles by curating (organizing, structuring, and enriching data in image collections). Specifically, we use a collection of 1M photos from forensic anthropology as well as other smaller image datasets to design and implement an auto-curation framework consisting of three overarching phases and associated unsupervised and semi-supervised techniques and tools to support each phase. As a result, we have developed data exploration techniques to support initial understanding of large image collections, an unsupervised clustering method for organizing such collections, a human-machine collaboration method to enable mass data labeling with relevant information, a semi-supervised method to reuse the existing expert-provided content for a small portion of a dataset and propagate it to the remaining uncurated data, and a system to preserve, publish and present the resulted curated data. Our evaluations of these techniques show that they outperform their corresponding state-of-the-art counterparts. The general auto-curation framework and tools presented in this work are applicable to any large image dataset, and the techniques are specifically designed for image datasets with evolving content. We employed the proposed tools and techniques for a large image collection of human decomposition in the forensic anthropology domain and, as a result, have enabled the use of digital resources for research where fieldwork is typically the norm. We hope that this work can help other disciplines to utilize the full potential of their data.

Files over 3MB may be slow to open. For best results, right-click and select "save as..."

Share

COinS