CSVD: Clustering and Singular Value Decomposition for Approximate Similarity Searches in High Dimensional Space

Copyright © (2003) by IEEE. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distrubuted for profit. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee.

High-dimensionality indexing of feature spaces is critical for many data-intensive applications such as content-based retrieval of images or video from multimedia databases and similarity retrieval of
patterns in data mining. Unfortunately, even with the aid of the commonly-used indexing schemes, the performance of nearest neighbor (NN) queries (required for similarity search) deteriorates rapidly
with the number of dimensions. We propose a method called Clustering with Singular Value Decomposition (CSVD) to reduce the number of index dimensions, while maintaining a reasonably high precision for a given value of recall. In CSVD, homogeneous points are grouped into clusters such that the points in each cluster are more amenable to dimensionality reduction than the original dataset. Within-cluster searches can then be performed with traditional indexing methods, and
we describe a method for selecting the clusters to search in response to a query. Experiments with texture vectors extracted from satellite images show that CSVD achieves significantly higher dimensionality reduction than plain SVD for the same Normalized Mean Squared Error (NMSE). Conversely, for the same compression ratio, CSVD results in a decrease in NMSE with respect to simple SVD (a 6-fold decrease a 10:1 compression ratio). This translates to a higher efficiency in
processing approximate NN queries, as quantified through experimental results.

By: Vittorio Castelli, Alexander Thomasian, Chung-Sheng Li

Published in: IEEE Transactions on Knowledge and Data Engineering, volume 15, (no 3), pages 671-85 in 2003

LIMITED DISTRIBUTION NOTICE:

This Research Report is available. This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. It has been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g., payment of royalties). I have read and understand this notice and am a member of the scientific community outside or inside of IBM seeking a single copy only.

RC21755.pdf

Questions about this service can be mailed to reports@us.ibm.com .