MDL Estimation with Small Sample Sizes Including an Application to the Problem of Segmenting Binary Strings Using Bernoulli Models

Minimum Description Length (MDL) estimation has proven itself of major importance in a large number of applications many of which are in the fields of computer vision and pattern recognition. A problem is encountered in applying the associated formulas, however, especially those associated with model cost. This is because most of these are asymptotic forms appropriate only for large sample sizes. Rissanen [27] has recently derived sharper codelength formulas valid for much smaller sample sizes. Because of the importance of these results, it is our intent here to present a tutorial description of them. In keeping with this goal we have chosen a simple application whose relative tractability allows it to be explored more deeply than most problems: the segmentation of binary strings based on a piecewise Bernoulli assumption. By that we mean that the strings are assumed to be divided into substrings, the bits of which are assumed to have been generated by a single (within a substring) Bernoulli source. After investigating the solution of this problem, we will discuss the extension of these results to more important problems such as piecewise polynomial curve futting and image segmentation, both of which we plan for future work.

By: Byron E. Dom

Published in: RJ9997 in 1995

LIMITED DISTRIBUTION NOTICE:

This Research Report is available. This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. It has been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g., payment of royalties). I have read and understand this notice and am a member of the scientific community outside or inside of IBM seeking a single copy only.

8087.ps.gz

Questions about this service can be mailed to reports@us.ibm.com .