A Method for Improving Lossless Compression of Aligned DNA Sequence

The huge volumes of genome sequencing data generated with Next Generation Sequencing technology demands compression tools that can handle increasing costs associated to storing and transmitting those data. We present a method for efficient lossless compression of the DNA sequences stored in SAM file format. The compression method compresses alignment information, read sequences and quality values separately using customized compression algorithms. We concentrated our efforts on read base sequence compression optimization, and the key feature of read sequence compression is run-length encoding based on classified and sorted read sequences. The quality values are compressed adaptively with run-length coding and predictive coding followed by Huffman coding technique. The proposed method generates a mapped difference from the reference to the target DNA sequences, classifies the mapped differences using number of mismatched read bases within a sequence into number of groups. The read sequences and quality values within each group are compressed losslessly with different coding strategies, and the coding schemes are decided by the analysis of the histogram created with the indexes of mismatched read bases within the sequence and neighboring quality values around the quality value to be compressed. We focused on the compression of sequences classified into first two groups, perfectly matched sequences and sequences with a single mismatched base, as those sequences account for majority of sequences to be compressed, and the proposed algorithm is a clear winner when compared with well known compression tools.

By: Hangu Yeo, Vadim Sheinin

Published in: RC25449 in 2014


This Research Report is available. This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. It has been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g., payment of royalties). I have read and understand this notice and am a member of the scientific community outside or inside of IBM seeking a single copy only.


Questions about this service can be mailed to reports@us.ibm.com .