MDL-Based Pruning of 2[supn]-Tree Classifiers

This paper describes a technique for both decreasing the memory requirement and improving the performance of 2[supn]-tree classifiers (described in [1,2]). This technique is an extension of that proposed by Quinlan and Rivest [5] for growing and pruning certain types of decision tree classifiers. The technique based on the Minimum Description Length principle (MDL), proposed by Rissanen [7] as a measure of the goodness of models for describing data. In this technique the classification tree produced by the 2[supn]-tree training algorithms is considered to be a model for the class assignments of the training data conditioned on the associated feature vectors. A scheme is proposed for encoding the data using this model. This scheme consists of a code for the model (the 2[supn]-tree in this case) and a code for the class assignments conditioned on the model and the feature vectors. The latter consists of an encoding of the correct class codes for those training vectors misclassified by the 2[supn] classification tree. Associated with the encoding scheme is a total code length, consisting of two parts corresponding to the two components just mentioned. This codelength may be considered to be a measure of the complexity of the data with respect to the 2[supn]-tree model class and the model that produces the lowest complexity (code length) will be sought. This complexity measure is used in an algorithm where the initial tree is pruned, until the total code length begins to increase. When a branch is pruned from the tree, the code length of the model (tree) decreases and the code length of the class codes conditioned on the model either increases or remains unchanged. The classification trees produced by the pruning technique are shown experimentally to...

By: Byron E. Dom and David Steele

Published in: RJ8673 in 1996


This Research Report is available. This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. It has been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g., payment of royalties). I have read and understand this notice and am a member of the scientific community outside or inside of IBM seeking a single copy only.

Questions about this service can be mailed to .