Vector Space Models for Search and Cluster Mining

This chapter consists of two parts: a review of search and
cluster mining algorithms based on vector space modeling
followed by a description of a prototype search and cluster
mining system. In the review, we consider {\it latent
semantic indexing} (LSI), a method based on the singular
value decomposition of the document attribute matrix and
{\it principal component analysis} (PCA) of the document
vector covariance matrix. In the second part, we present
novel techniques for mining major and minor clusters from
massive databases based on enhancements of LSI and PCA and
automatically labeling of clusters based on their document
contents. Most mining systems have been designed to find
major clusters and they often fail to report information
on smaller, minor clusters. Minor cluster identification
is important in many business applications, such as:
detection of credit card fraud, profile analysis, and
scientific data analysis. Another novel feature of our
method is the recognition and preservation of naturally
occurring overlaps among clusters. Cluster overlap analysis
is important for multiperspective analysis of databases.
Results from implementation studies with a prototype system
incorporating the techniques we developed and over 100,000
news articles demonstrate the effectiveness the search and
clustering engines and the ease-of-use of the GUI.

By: Mei Kobayashi and Masaki Aono

Published in: A Comprehensive Survey of Text Mining in 2002

Please obtain a copy of this paper from your local library. IBM cannot distribute this paper externally.

Questions about this service can be mailed to reports@us.ibm.com .