Repository logo
Log In(current)
  1. Home
  2. Colleges & Schools
  3. Graduate School
  4. Doctoral Dissertations
  5. A Probabilistic Software Framework for Scalable Data Storage and Integrity Check
Details

A Probabilistic Software Framework for Scalable Data Storage and Integrity Check

Date Issued
August 1, 2017
Author(s)
Xiong, Sisi  
Advisor(s)
Qing Cao
Additional Advisor(s)
Hairong Qi
Yilu Liu
Jindong Tan
Permanent URI
https://trace.tennessee.edu/handle/20.500.14382/25916
Abstract

Data has overwhelmed the digital world in terms of volume, variety and velocity. Data- intensive applications are facing unprecedented challenges. On the other hand, computation resources, such as memory, suffer from shortage comparing to data scale. However, in certain applications, it is a must to process large amount of data in a time efficient manner. Probabilistic approaches are compromises between these three perspectives: large amount of data, limited computation resources and high time efficiency, in the sense that those approaches cannot guarantee 100% correctness, their error rates, however, are predictable and adjustable depending on available computation resources and time constraints.


Data storage and data integrity check are two fundamental components in data-intensive applications. Among various data storage platforms, key-value storage is crucial for many applications, such as social networks, online retailing, and cloud computing. Such storage provides support for operations on key-value pairs, and can locate in memory to speed up responses to queries. So far, existing methods have been deterministic. Providing such accuracy, however, comes at the cost of memory and CPU time. In contrast, we present an approximate key-value storage that is more compact and efficient than existing methods.

Besides data storage, ensuring data integrity during its life-cycle is also paramount important, particularly in large scale high-performance computing (HPC) applications. Since scientific data can take millions of compute hours to generate, the results often need to be sanitized, validated, and archived for long term storage, and shared with scientific community for further analysis. Ensuring the data integrity of the full dataset at scale is a daunting task, considering that most conventional tools are serial and file-based, and cannot scale. To tackle this particular challenge, we presents the design, implementation and evaluation of two Bloom filter based scalable parallel checksumming tools, for data integrity check and data corruption detection purposes.

Subjects

Probabilistic approac...

Bloom filter

data storage

integrity check

Disciplines
Data Storage Systems
Other Computer Engineering
Degree
Doctor of Philosophy
Major
Computer Engineering
Embargo Date
January 1, 2011
File(s)
Thumbnail Image
Name

Dissertation_Sisi_Xiong.pdf

Size

4.31 MB

Format

Adobe PDF

Checksum (MD5)

23d213ab217e0745f59e8ef9e97f0590

Built with DSpace-CRIS software - Extension maintained and optimized by 4Science

  • Privacy policy
  • End User Agreement
  • Send Feedback
  • Contact
  • Libraries at University of Tennessee, Knoxville
Repository logo COAR Notify