Date of Award
Doctor of Philosophy
Hairong Qi, Yilu Liu, Jindong Tan
Data has overwhelmed the digital world in terms of volume, variety and velocity. Data- intensive applications are facing unprecedented challenges. On the other hand, computation resources, such as memory, suffer from shortage comparing to data scale. However, in certain applications, it is a must to process large amount of data in a time efficient manner. Probabilistic approaches are compromises between these three perspectives: large amount of data, limited computation resources and high time efficiency, in the sense that those approaches cannot guarantee 100% correctness, their error rates, however, are predictable and adjustable depending on available computation resources and time constraints.
Data storage and data integrity check are two fundamental components in data-intensive applications. Among various data storage platforms, key-value storage is crucial for many applications, such as social networks, online retailing, and cloud computing. Such storage provides support for operations on key-value pairs, and can locate in memory to speed up responses to queries. So far, existing methods have been deterministic. Providing such accuracy, however, comes at the cost of memory and CPU time. In contrast, we present an approximate key-value storage that is more compact and efficient than existing methods.
Besides data storage, ensuring data integrity during its life-cycle is also paramount important, particularly in large scale high-performance computing (HPC) applications. Since scientific data can take millions of compute hours to generate, the results often need to be sanitized, validated, and archived for long term storage, and shared with scientific community for further analysis. Ensuring the data integrity of the full dataset at scale is a daunting task, considering that most conventional tools are serial and file-based, and cannot scale. To tackle this particular challenge, we presents the design, implementation and evaluation of two Bloom filter based scalable parallel checksumming tools, for data integrity check and data corruption detection purposes.
Xiong, Sisi, "A Probabilistic Software Framework for Scalable Data Storage and Integrity Check. " PhD diss., University of Tennessee, 2017.