Masters Theses

Date of Award

12-1997

Degree Type

Thesis

Degree Name

Master of Science

Major

Computer Science

Major Professor

Michael W. Berry

Committee Members

Bradley Vander Zanden, June Donato

Abstract

Data Mining is the application of algorithms for extracting valuable informa-tion from large databases in order to make important business decisions. This study explores a new technique for data mining - Latent Semantic Indexing (LSI). LSI is an efficient information retrieval method for textual documents. By determining the singular value decomposition (SVD) of a large sparse term-by-document matrix, LSI constructs an approximate vector space model which rep-resents important associative relationships between terms and documents that are not evident in individual documents. This thesis explores the applicability of the LSI model to numerical databases, especially consumer product data. By properly chosing attributes of data records as terms or documents, a term-by-document in-cidence matrix is built and then a distribution-based indexing scheme is employed to construct a correlated distribution matrix. Hence a similar LSI vector space model can be generated to detect useful or hidden patterns in the databases. The extracted information can then be validated using statistical hypotheses testing or resampling. LSI is an automatic yet intelligent indexing method, its application to numerical data introduces a promising way to discover knowledge in important commercial application areas such as retail and consumer banking.

Files over 3MB may be slow to open. For best results, right-click and select "save as..."

Share

COinS