Managing Statistical Behavior in Large Datasets

Very large data sets make the statistical properties of hashing functions become apparent. Such properties are manifested by the utilization patterns of buckets. Non-uniform bucket occupancy distributions have been experimentally established in databases that support identification, biological, and spatial systems. In this paper, we discuss the impact that the choice of a hashing function has on load balancing in very large data sets. We propose a two-stage methodology that allows us to exploit the features of a suggested hashing function. We exploit these features to re-design hashing functions so that they exhibit a number of desired properties. Experimental results from synthetic as well as real data are presented. The discussed methodology can be generally applicable to a large number of storage and/or retrieval systems and results in significant performance improvements. Keywords: hashing, hash or lookup table, invariant, preprocessing, load balancing, storage balancing, storage/retrieval, very large datasets.

By: Isidore Rigoutsos and Alex Delis

Published in: RC20648 in 1996

LIMITED DISTRIBUTION NOTICE:

This Research Report is available. This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. It has been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g., payment of royalties). I have read and understand this notice and am a member of the scientific community outside or inside of IBM seeking a single copy only.

8414.ps.gz

Questions about this service can be mailed to reports@us.ibm.com .