An Analysis of Data Characteristics that Affect Naive Bayes Performance

Despite its unrealistic independence assumption, the naive Bayes classifier is remarkably successful in practice. This paper identifies some data characteristics for which naive Bayes works
well, such as certain deterministic and almost-deterministic dependencies (i.e., low-entropy distributions). First, we address zero-Bayes-risk problems, proving naive Bayes optimality for any two-class concept that assigns class 0 to exactly one example (i.e. $H(P({x_i}|0))=0$ ). We demonstrate empirically that the entropy of $P({x_i}|0)$ is a better predictor of the naive Bayes error than the class-conditional mutual information between features. Next, we consider a broader class of
non-zero Bayes risk problems, further pursuing the study of low-entropy distributions. We derive error bounds for approximating the joint distribution by the product of marginals in case of nearly-deterministic class-conditional feature distributions $P( {x_i}|C)$, and we demonstrate how the
performance of naive Bayes improves with decreasing entropy of such distributions. Finally, we consider functional dependencies between features and prove naive Bayes optimality in certain
cases. Using Monte Carlo simulations, we show that naive Bayes works best in two cases: completely independent features (as expected by the assumptions made) and functionally dependent
features (which is surprising). Naive Bayes has its worst performance between these extremes.

By: Irina Rish, Joseph Hellerstein, Jayram Thathachar

Published in: RC21993 in 2001

LIMITED DISTRIBUTION NOTICE:

This Research Report is available. This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. It has been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g., payment of royalties). I have read and understand this notice and am a member of the scientific community outside or inside of IBM seeking a single copy only.

rc21993.pdf

Questions about this service can be mailed to reports@us.ibm.com .