Topic Words Extraction from Categorized Document Sets

In this paper, we propose a new measure for extracting topic words from categorized document data sets. Our new measure is based on the partitioning chi-squared statistic. If the document collection has noisy words with diverse appearance probabilities according to categories, conventional
measures cannot extract topic words with a high appearanve probability in a particular category. From a simulation study, it is shown that our new measure
can robustly extract the topic words from such a document collection with noisy words. We apply this new measure to real data for problem detection.

By: Hironori Takeuchi

Published in: 7th Pacific Rim International Conference on Artificial Intelligence, PRCAI-02(LNAI), Berlin, Springer in 2002

Please obtain a copy of this paper from your local library. IBM cannot distribute this paper externally.

Questions about this service can be mailed to reports@us.ibm.com .