Covariance Matrix Analysis for Outlier Detection

In this paper we introduce COV, a novel information retrieval
and data mining algorithm that uses vector space modeling
and spectral analysis of the document vector covariance matrix to
map the retrieval/mining problem into a lower dimensional space. Since
the dimension of the covariance matrix depends on that of the
attribute space and is independent of the number of documents, COV
can be applied to databases that are too massive to be processed by
methods based on the singular value decomposition (SVD) of the
document-attribute matrix, such as latent semantic indexing
(LSI). In addition to improved scalability, COV selects basis vectors
for the lower dimensional space and shifts the origin so that subtle
differences in document
vectors can be more readily detected than LSI. We demonstrate the
significance of this feature of COV through an important application
in data mining, known as outlier cluster detection. We propose two
new algorithms for detecting major and outlier clusters in databases
-- the first is based on LSI, the second on COV -- and show through
implementation studies that our algorithm based on COV outperforms
the one based on LSI.

By: Mei Kobayashi, Masaki Aono, Hironori Takeuchi, Hikaru Samukawa

Published in: RT0436 in 2002

LIMITED DISTRIBUTION NOTICE:

This Research Report is available. This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. It has been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g., payment of royalties). I have read and understand this notice and am a member of the scientific community outside or inside of IBM seeking a single copy only.

rt0436.pdf

Questions about this service can be mailed to reports@us.ibm.com .