Auto-curation of Large Evolving Image Datasets
Large image collections are becoming common in many fields and offer tantalizing opportunities to transform how research, work, and education are conducted if the information and associated insights could be extracted from them. However, major obstacles to this vision exist. First, image datasets with associated metadata contain errors and need to be cleaned and organized to be easily explored and utilized. Second, such collections typically lack the necessary context or may have missing attributes that need to be recovered. Third, such datasets are domain-specific and require human expert involvement to make the right interpretation of the image content. Fourth, the large size of these collections makes it time-consuming, costly, and in some cases, unfeasible to address the aforementioned problems. This dissertation aims to systematically address all four obstacles by curating (organizing, structuring, and enriching data in image collections). Specifically, we use a collection of 1M photos from forensic anthropology as well as other smaller image datasets to design and implement an auto-curation framework consisting of three overarching phases and associated unsupervised and semi-supervised techniques and tools to support each phase. As a result, we have developed data exploration techniques to support initial understanding of large image collections, an unsupervised clustering method for organizing such collections, a human-machine collaboration method to enable mass data labeling with relevant information, a semi-supervised method to reuse the existing expert-provided content for a small portion of a dataset and propagate it to the remaining uncurated data, and a system to preserve, publish and present the resulted curated data. Our evaluations of these techniques show that they outperform their corresponding state-of-the-art counterparts. The general auto-curation framework and tools presented in this work are applicable to any large image dataset, and the techniques are specifically designed for image datasets with evolving content. We employed the proposed tools and techniques for a large image collection of human decomposition in the forensic anthropology domain and, as a result, have enabled the use of digital resources for research where fieldwork is typically the norm. We hope that this work can help other disciplines to utilize the full potential of their data.
SaraMousaviDissertation.pdf
75.01 MB
Adobe PDF
4666721fa5d544bb9d472ff5466f0cca