Quick Access to Compressed Data in Storage Systems

Primary storage systems that compress data in real time, use some form of on disk metadata to perform the virtualization needed in storing compressed data. Usually this metadata is in the form of B-trees (eventually compressed) and stored on disk. For random accesses to compressed data, where the metadata is not in cache, this additional layer significantly slows down random reads & writes. Our solution is to use much less metadata that only provides an approximation of the location of compressed data on disk and can be easily stored in the memory of the storage system. Read operations are extended to compensate for the imprecise position information in the metadata, and index marks embedded in the data are used to locate the required data within the expanded read. The data placement of written data is constrained to be described by the reduced metadata. The placement uses a piecewise linear scheme based on the locality in compressibility of data and we support this assumption with experiments.

By: Cornel Constantinescu, David Chambliss

Published in: Proceedings of 2016 Data Compression Conference (DCC),Piscataway, NJ, IEEE, , p.10.1109/DCC.2016.51 in 2016

LIMITED DISTRIBUTION NOTICE:

This Research Report is available. This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. It has been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g., payment of royalties). I have read and understand this notice and am a member of the scientific community outside or inside of IBM seeking a single copy only.

RJ10533.pdf

Questions about this service can be mailed to reports@us.ibm.com .