Interactive Methods for Taxonomy Editing and Validation

Today’s enterprise understands that improved utilization of its collective knowledge assets leads to improved business performance. The reality of the proliferation of electronic information and the pressure to produce more with fewer resources while performing increasingly complex tasks makes this a continuous challenge. To address this challenge enterprises are building knowledge repositories and structuring them in ways that are meaningful to their organization, business and processes. This structuring typically manifests itself in the form of one or more taxonomies. The taxonomies are meaningful hierarchical categorizations of documents into topics reflecting the natural relationships between the documents and their business objectives. Improving the quality of these taxonomies and reducing the overall cost required to create them is therefore an important area of research. Supervised and unsupervised text clustering are automated approaches to creating and maintaining document taxonomies. However, human expertise also has an indispensable role to play in guiding the taxonomy generation process and validating the results. Towards this end we have developed an interactive approach to taxonomy creation and validation. This approach involves helping the taxonomy editor understand and evaluate each category of a taxonomy and visualize the relationships between the categories. Multiple techniques allow the user to make changes at both the category and document level. Metrics then establish how well the resultant taxonomy can be modeled for future document classification. Our approach enables the development of multiple taxonomies so that multiple relationships in the documents can be modeled. In this paper, we present our approach to document taxonomy creation and modification and then demonstrate the effectiveness of this approach in real time analysis and reporting of discussion forum topics during IBM’s corporate wide “ValuesJam” event.

By: Scott Spangler, Jeffrey Kreulen

Published in: RJ10300 in 2003

LIMITED DISTRIBUTION NOTICE:

This Research Report is available. This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. It has been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g., payment of royalties). I have read and understand this notice and am a member of the scientific community outside or inside of IBM seeking a single copy only.

rj10300.pdf

Questions about this service can be mailed to reports@us.ibm.com .