Modeling Document Taxonomies

Taxonomies are meaningful hierarchical categorizations of documents into topics reflecting the natural relationships between the documents and their business objectives. Creating effective taxonomies and reducing the overall cost required to create them is an important area of research. Unsupervised text clustering is one part of the solution, but automated clustering alone is insufficient to consistently create effective taxonomies in a business environment. To address this problem, we have developed tools that allow for a “mixed initiative” approach to taxonomy development, where human expertise can be employed to edit and refine a text clustering to make it more meaningful for a given application. Document taxonomies developed using mixed initiative methods pose the following challenge: how do we model the taxonomy so that future documents will be classified correctly. We have developed a comprehensive approach to solving this problem, and implemented this approach in a software tool called eClassifier. The crux of our solution is to apply a suite of classifiers, including both statistical and rule based varieties at each level of the taxonomic hierarchy, and then to choose for each category the best classifier or set of classifiers, that produce the most accurate results on unseen test documents. We tested various methods of combining these multiple classifiers against several different mixed initiative taxonomies and against the standard Reuters data set. We show that in nearly all cases, one method in particular performed better than the others and that this method significantly improves upon any single classifier approach.

By: Scott Spangler, Jeffrey Kreulen, Justin Lessler, David E. Johnson

Published in: RJ10288 in 2003

LIMITED DISTRIBUTION NOTICE:

This Research Report is available. This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. It has been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g., payment of royalties). I have read and understand this notice and am a member of the scientific community outside or inside of IBM seeking a single copy only.

rj10288.pdf

Questions about this service can be mailed to reports@us.ibm.com .