Bounds on Expansion in LZ'77-like Coding

We investigate the maximum increase in number of phrases that results from changing k consecutive symbols in a string x having length n parsed using an LZ'77-like algorithm. We consider a class of compression algorithms that partition a sequence into a collection y of (non-overlapping) variable-length phrases and encode them. Each phrase is either a singleton or matches a substring that starts to its left.

We show that changing a single symbol of x in position i can yield an expansion that is of order We also show that changing k consecutive symbols starting from position i can yield a expansion having a similar but somewhat more involved form. The paper contains both analytically-derived upper and lower bounds, and algorithms for numerically computing tighter bounds. While deriving the bounds, we provide a detailed analysis of how expansion can arise when changing consecutive symbols.

This problem is motivated by management policies for computer systems, such as the IBM Memory eXtension Technology (MXT) or the IBM iSeries compressed disks, that use LZ'77-like coding on small compression units (1-4 KB) and store the compressed data in memory or on disk tracks. Here, when a change of a portion of the compression unit occurs, (e.g., a L2 cache line, or a 512-byte disk sector) the data is re-compressed and potentially stored in a different location. Knowing the maximum expansion (rather than the average expansion) is an important factor for designing policies for allocation and management of memory (or disk) space.

By: Vittorio Castelli, Luis Alfonso Lastras-Montaño

Published in: RC23254 in 2004

LIMITED DISTRIBUTION NOTICE:

This Research Report is available. This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. It has been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g., payment of royalties). I have read and understand this notice and am a member of the scientific community outside or inside of IBM seeking a single copy only.

rc23254.pdf

Questions about this service can be mailed to reports@us.ibm.com .