Masters Theses

Date of Award

8-1996

Degree Type

Thesis

Degree Name

Master of Science

Major

Computer Science

Major Professor

Michael W. Berry

Committee Members

Brad Vander Zanden, David Straight

Abstract

As the amount of electronic information increases, traditional lexical (or Boolean) information retrieval techniques will become less useful. Large, heterogeneous col-lections will be difficult to search since the sheer volume of unranked documents returned in response to a query will overwhelm the user. Vector-space approaches to information retrieval, on the other hand, allow the user to search for concepts rather than specific words and rank the results of the search according to their relative sim-ilarity to the query. One vector-space approach. Latent Semantic Indexing (LSI), has achieved up to 30% better retrieval performance than lexical searching techniques by employing a reduced-rank model of the term-document space. However, the original implementation of LSI lacked the execution efficiency required to make LSI useful for large data sets. A new implementation of LSI, LSI++, seeks to make LSI efficient, extensible, portable, and maintainable. The LSI++ Application Programming Interface (API) allows applications to immediately use LSI without knowing the implementation details of the underlying system. LSI++ supports both serial and distributed searching of large data sets, providing the same programming interface regardless of the imple-mentation actually executing. In addition, a World-Wide Web interface was created to allow simple, intuitive searching of document collections using LSI++. Timing re-sults indicate the serial implementation of LSI++ searches up to 6 times faster than the original implementation of LSI, while the parallel implementation searches nearly 180 times faster on large document collections.

Files over 3MB may be slow to open. For best results, right-click and select "save as..."

Share

COinS