An Interactive Approach to Document Classification

        Data mining and statistical techniques such as clustering have been applied with some success to large sets of documents to automatically produce meaningful subsets or classification hierarchies in the document space. While certainly providing some useful information, these automated techniques do not produce classifications that consistently reflect the categories that human experts expect to see in the data.

        This paper presents a case study showing why a more interactive approach to document classification and data exploration is sometimes needed. Our methodology utilizes human expertise to further refine and ultimately clarify the meaning of automated clustering outputs. The fact that repeatable methods for human refinement of clustering outputs are difficult to capture in a rigorous way presents some unique research challenges that require a solution in order to satisfy important customer requirements.

        Our attempt to facilitate human refinement of clustering outputs has resulted in an interactive text clustering methodology that is captured in a software tool: eClassifier. This software utilizes data visualization, graphical analysis, and statistical data summaries to communicate the substance of a clustering to a human expert. It also provides powerful tools to allow the user to edit a clustering to incorporate domain knowledge and to suit the needs of the user's application.

By: Jeffrey T. Kreulen, Dharmendra S. Modha, W. Scott Spangler, H. Raymond Strong

Published in: RJ10159 in 1999

This Research Report is not available electronically. Please request a copy from the contact listed below. IBM employees should contact ITIRC for a copy.

Questions about this service can be mailed to reports@us.ibm.com .