Experimental Performance Evaluation to Enhance Database Compression on Commercial Servers

Data compression technique is a very useful technique which reduces the redundancy within the data so that the same amount of information can be stored or transmitted in fewer number of bits. Data compression is widely used in data management, and can be broadly classified into hardware and software compressions. Commercial database servers are now designed with dictionary based proprietary compression algorithms as a software compression unit or as a hardware compression unit to reduce storage resources required on the server. In this paper, various data compression algorithms are evaluated with different configuration parameters and various input data formats to investigate a possibility to improve compression performance of dictionary based compression algorithms implemented on commercial database servers. Huffman coding algorithm produces an optimal prefix code tree and converts fixed-length symbols into variable-length code words. We cascaded Huffman coding with the dictionary based compression algorithm and evaluated the performance of the proposed algorithm with different configuration parameters using fabricated synthetic data as well as real customer data sets such as On-Line Transaction Processing (OLTP) benchmark TPC-E, decision support benchmark TPC-H and a plain text file as input data sets. Experimental results show that the compression can yield better compression when dictionary based compression algorithm is coupled with entropy coding algorithm such as Hoffman coding algorithm, and test benchmarks are compressed more efficiently. For example, the proposed compression method compresses OLTP benchmark tables at least 10% more efficiently. The evaluation also shows promising results in database benchmark compression using a single generic Huffman tree customized for a specific benchmark and generated using a set of input symbols randomly sampled from the corresponding benchmark data tables.

By: Hangu Yeo, Vadim Sheinin, Petros Zerfos

Published in: RC25505 in 2014

LIMITED DISTRIBUTION NOTICE:

This Research Report is available. This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. It has been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g., payment of royalties). I have read and understand this notice and am a member of the scientific community outside or inside of IBM seeking a single copy only.

rc25505.pdf

Questions about this service can be mailed to reports@us.ibm.com .