In this paper, we propose a new measure for extracting topic words from categorized document data sets. Our new measure is based on the partitioning chi-squared statistic. If the document collection has noisy words with diverse appearance probabilities according to categories, conventional
measures cannot extract topic words with a high appearanve probability in a particular category. From a simulation study, it is shown that our new measure
can robustly extract the topic words from such a document collection with noisy words. We apply this new measure to real data for problem detection.
By: Hironori Takeuchi
Published in: 7th Pacific Rim International Conference on Artificial Intelligence, PRCAI-02(LNAI), Berlin, Springer in 2002
Please obtain a copy of this paper from your local library. IBM cannot distribute this paper externally.
Questions about this service can be mailed to reports@us.ibm.com .