Matrix Computations for Information Retrieval and Major and Outlier Cluster Detection

In this paper we introduce COV, a novel information
retrieval (IR) algorithm based vector space modeling of documents
in massive databases. Our algorithm uses information from spectral
analysis of the covariance matrix for the document vectors to reduce
the dimensionality of the IR problem. Since the dimension of the
covariance matrices depends only on the dimension of the attribute
space, COV can be applied to databases which are too massive for
methods based on the singular value decomposition of the
document-attribute matrix, such as {\it latent semantic indexing
(LSI)}. In addition to improved scalability, theoretical
considerations indicate that our algorithm tends to be more accurate
that those from LSI, particularly in detecting subtle differences in
document vectors. We demonstrate the power and accuracy of COV
through an important topic in data mining, known as outlier cluster
detection. We propose two new algorithms for detecting major and
outlier clusters in databases -- the first is based on LSI, the
second on COV. Results from implementation studies show that our
cluster detection algorithm based on COV outperforms the
algorithm based on LSI in detecting outlier clusters.

By: Mei Kobayashi, Masaki Aono, Hironori Takeuchi, Hikaru Samukawa

Published in: Journal of Computational and Applied Mathematics, volume 149, (no 1), pages 119-29 in 2002

Please obtain a copy of this paper from your local library. IBM cannot distribute this paper externally.

Questions about this service can be mailed to reports@us.ibm.com .