Exploratory Data Analysis in Large Sparse Datasets

Many applications of exploratory data analysis involve multivariate datasets that are large and high-dimensional, but quite sparse. Existing methods and computational algorithms are either expensive or inappropriate for these datasets. In this paper, we describe a modification of the Kohonen self-organizing maps algorithm for clustering and segmentation, whose storage and computational requirements are proportional to the data sparsity, rather than to the dimensions of the dataset. We also describe the use of a multidimensional scaling procedure that significantly improves the topological representation of the clusters obtained by the self-organizing maps algorithm. This methodology can be used in various applications including the analysis of retail shopping and credit card spending data, and text document indexing and classification.

By: Ramesh Natarajan

Published in: RC20749 in 1997

LIMITED DISTRIBUTION NOTICE:

This Research Report is available. This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. It has been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g., payment of royalties). I have read and understand this notice and am a member of the scientific community outside or inside of IBM seeking a single copy only.

8569.ps.gz

Questions about this service can be mailed to reports@us.ibm.com .