Fuzziness Reduction and Training Set Generation for Text Classification in IT Services Domain

Extracting domain-specific concepts from a large set of documents containing unstructured text to enable concept-based search has become a critical step for fulfilling the core information management requirements in enterprises today. Given the demanding computational requirements of performing text analysis for concept extraction, and the small fraction of documents that typically have useful concepts worth extracting (less than 1% in our system), a key challenge to be addressed is how to perform an intelligent selection of documents for further detailed analysis without compromising recall.

In this paper, we present a methodology to semiautomatically generate training sets for a classifier and an algorithm to reduce the ambiguity inherent in our data, for using a Support Vector Machine (SVM) classifier to identify candidate documents for concept extraction in the IT Services domain. In such proprietary domains, due to privacy and security concerns, very few training datasets are publicly available to train a classifier. Furthermore, documents in the domain are normally noisy and fuzzy (contain terms not discriminating across categories). Experiments show that our method has improved the SVM performance, achieving an accuracy of 92%, a recall of 99.2% and an F-measure of 92.6%.

By: Yu Deng, Ruchi Mahindru, Nithya Rajamani, Soumitra Sarkar, Rafah Hosn, Murthy Devarakonda

Published in: RC25085 in 2010

LIMITED DISTRIBUTION NOTICE:

This Research Report is available. This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. It has been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g., payment of royalties). I have read and understand this notice and am a member of the scientific community outside or inside of IBM seeking a single copy only.

RC25085.pdf

Questions about this service can be mailed to reports@us.ibm.com .