An Analysis of Naive Bayes Classifier on Low-Entropy Distributions

Naive Bayes classifier assumes that features are independent given class. Despite this unrealistic assumption, it is often highly competitive with more sophisticated learning techniques. Indeed, a classifier's optimality depends on its prediction of the most-likely class, rather than on the correct probability estimates. Although previous studies characterized certain cases of naive Bayes optimality, a better understanding of data characteristics that affect the performance of this classifier is still required. In this paper, we focus on problems including deterministic and almost-deterministic dependencies. Such dependencies are often present in practical problems, e.g., in error-correcting coding and in computer system performance management, to name a few. We derive error bounds for approximation of a joint distribution by the product of marginals for almost-deterministic (low-entropy) distributions, and demonstrate empirically how decreasing entropy of class-conditional feature distributions affects the error of naive Bayes classifier. We also prove that naive Bayes is optimal in certain cases of functionally dependent features. Then we relax those dependencies by adding noise, demonstrating empirically that naive Bayes reaches optimal performance in cases of completely independent and functionally dependent features, while it behaves worst between the two extremes.

By: Irina Rish, Joseph Hellerstein, Jayram Thathachar

Published in: RC21994 in 2001

LIMITED DISTRIBUTION NOTICE:

This Research Report is available. This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. It has been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g., payment of royalties). I have read and understand this notice and am a member of the scientific community outside or inside of IBM seeking a single copy only.

rc21994.pdf

Questions about this service can be mailed to reports@us.ibm.com .