Masters Theses

Date of Award

12-2019

Degree Type

Thesis

Degree Name

Master of Science

Major

Forestry

Major Professor

Timothy Young

Committee Members

Alexander Petutschnig, Bogdan Bichescu, Terry Liles

Abstract

Data mining and predictive analytics in the sustainable-biomaterials industries is currently not feasible given the lack of organization and management of the database structures. The advent of artificial intelligence, data mining, robotics, etc., has become a standard for successful business endeavors and is known as the ‘Fourth Industrial Revolution’ or ‘Industry 4.0’ in Europe. Data quality improvement through real-time multi-layer data fusion across interconnected networks and statistical quality assessment may improve the usefulness of databases maintained by these industries. Relational databases with a high degree of quality may be the gateway for predictive modeling and enhanced business analytics. Data quality is a key issue in the sustainable bio-materials industry. Untreated data from multiple databases (e.g., sensor data and destructive test data) are generally not in the right structure to perform advanced analytics. Some inherent problems of data from sensors that are stored in data warehouses at millisecond intervals include missing values, duplicate records, sensor failure data (data out of feasible range), outliers, etc. These inherent problems of the untreated data represent information loss and mute predictive analytics. The goal of this data science focused research was to create a continuous real-time software algorithm for data cleaning that automatically aligns, fuses, and assesses data quality for missing fields and potential outliers. The program automatically reduces the variable size, imputes missing values, and predicts the destructive test data for every record in a database. Improved data quality was assessed using 10-fold cross-validation and the normalized root mean square error of prediction (NRMSEP) statistic. The impact of outliers and missing data were tested on a simulated dataset with 201 variations of outlier percentages ranging from 0-90% and missing data percentages ranging from 0-90%. The software program was also validated on a real dataset from the wood composites industry. One result of the research was that the number of sensors needed for accurate predictions are highly dependent on the correlation between independent variables and dependent variables. Overall, the data cleaning software program significantly decreased the NRMSEP ranging from 64% to 12% of quality control variables for key destructive test values (e.g., internal bond, water absorption and modulus of rupture).

Files over 3MB may be slow to open. For best results, right-click and select "save as..."

Share

COinS