Duplicate Management for Reference Data

Recent studies show that reference or fixed content data accounts for more than half of all newly created digital data, and is growing rapidly. Reference data is characterized by enormous quantities of largely similar data and very long retention periods. Their secure retention and eventual destruction are increasingly regulated by government agencies as more and more critical data are stored electronically and are vulnerable to unauthorized destruction and tampering. In this paper, we describe a storage system optimized for reference data. The system manages unique chunks of data to reliably and efficiently store large amounts of similar data, and to allow selected data to be efficiently shredded. We discuss ways to detect duplicate data, describing a sliding blocking method that greatly outperforms other methods. We also present practical ways to organize the metadata for the unique chunks, allowing most of it to be kept on disk and to be effectively prefetched when needed. Since electronic mail (email) is an important storage-intensive instance of reference data and is currently the intense focus of regulatory bodies, we use email as a sample application and analyze its storage characteristics in detail. We find that more than 30% of the blocks in an email data set are duplicates and that a duplicate block is most likely to occur within a few days of its previous occurrence. Our analysis further indicates that the effects of duplicate block elimination and compression techniques such as block gzip seem to be relatively independent so that they can be combined to achieve additive results.

By: Timothy E. Denehy, Windsor W. Hsu

Published in: RJ10305 in 2003

LIMITED DISTRIBUTION NOTICE:

This Research Report is available. This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. It has been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g., payment of royalties). I have read and understand this notice and am a member of the scientific community outside or inside of IBM seeking a single copy only.

rj10305.pdf

Questions about this service can be mailed to reports@us.ibm.com .