Date of Award


Degree Type


Degree Name

Doctor of Philosophy


Computer Engineering

Major Professor

Qing Cao

Committee Members

Hairong Qi, Yilu Liu, Jindong Tan


Data has overwhelmed the digital world in terms of volume, variety and velocity. Data- intensive applications are facing unprecedented challenges. On the other hand, computation resources, such as memory, suffer from shortage comparing to data scale. However, in certain applications, it is a must to process large amount of data in a time efficient manner. Probabilistic approaches are compromises between these three perspectives: large amount of data, limited computation resources and high time efficiency, in the sense that those approaches cannot guarantee 100% correctness, their error rates, however, are predictable and adjustable depending on available computation resources and time constraints.

Data storage and data integrity check are two fundamental components in data-intensive applications. Among various data storage platforms, key-value storage is crucial for many applications, such as social networks, online retailing, and cloud computing. Such storage provides support for operations on key-value pairs, and can locate in memory to speed up responses to queries. So far, existing methods have been deterministic. Providing such accuracy, however, comes at the cost of memory and CPU time. In contrast, we present an approximate key-value storage that is more compact and efficient than existing methods.

Besides data storage, ensuring data integrity during its life-cycle is also paramount important, particularly in large scale high-performance computing (HPC) applications. Since scientific data can take millions of compute hours to generate, the results often need to be sanitized, validated, and archived for long term storage, and shared with scientific community for further analysis. Ensuring the data integrity of the full dataset at scale is a daunting task, considering that most conventional tools are serial and file-based, and cannot scale. To tackle this particular challenge, we presents the design, implementation and evaluation of two Bloom filter based scalable parallel checksumming tools, for data integrity check and data corruption detection purposes.

Files over 3MB may be slow to open. For best results, right-click and select "save as..."