Redundancy Elimination within Large Collections of Files

Ongoing advancements in technology lead to ever-increasing storage capacities. In spite of this, optimizing storage usage can still provide rich dividends. Several techniques based on delta-encoding and duplicate block suppression have been shown to reduce storage overheads, with varying requirements for resources such as computation and memory. We propose a new scheme for storage reduction that reduces data sizes with an effectiveness comparable to the more expensive techniques, but at a cost comparable to the faster but less effective ones. The scheme, called Redundancy Elimination at the Block Level (REBL), leverages the benefits of compression, duplicate block suppression, and delta-encoding to eliminate a broad spectrum of redundant data in a scalable and efficient manner. REBL also uses super-fingerprints, a technique that reduces the data needed to identify similar blocks and therefore the computational requirements of this process. As a result, REBL encodes more compactly than compression and duplicate suppression while executing faster than generic delta-encoding. For the data sets analyzed, REBL improved on the space reduction of other techniques by factors of 4-23 in the best case.

By: Purushottam Kulkarni, Fred Douglis, Jason LaVoie, John M. Tracey

Published in: RC23042 in 2003


