AN INFORMATION-THEORETIC EXTERNAL CLUSTER-VALIDITY MEASURE

In this paper we propose a measure of similarity/association between two partitions of
a set of objects. Our motivation is the desire to use the measure to characterize the quality
or accuracy of clustering algorithms by somehow comparing the clusters they produce with
"ground truth" consisting of classes assigned to the patterns by manual means or some other
means in whose veracity there is confidence. Such measures are referred to as "external".
Our measure also allows clusterings with different numbers of clusters to be compared in a
quantitative and principled way. Our evaluation scheme quantitatively measures how useful
the cluster labels of the patterns are as predictors of their class labels. When all clusterings
to be compared have the same number of clusters, the measure is equivalent to the mutual
information between the cluster labels and the class labels. In cases where the numbers of
clusters are different, however, it computes the reduction in the number of bits that would
be required to encode (compress) the class labels if both the encoder and decoder have free
access to the cluster labels. To achieve this encoding the estimated conditional probabilities
of the class labels given the cluster labels must also be encoded. These estimated proba-bilities
can be seen as a "model" for the class labels and their associated code length as a
"model cost". In addition to defining the measure we compare it to other commonly used
external measures and demonstrate its superiority as judged by certain criteria.

By: Byron E. Dan

Published in: RJ10219 in 2001

LIMITED DISTRIBUTION NOTICE:

This Research Report is available. This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. It has been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g., payment of royalties). I have read and understand this notice and am a member of the scientific community outside or inside of IBM seeking a single copy only.

RJ10219.pdf

Questions about this service can be mailed to reports@us.ibm.com .