Top-N Keyword Calculation on Dynamically Selected Documents

Interactivity is one of the most important requirements of text analysis since analyzers want to analyze a variety of sets of documents by changing the search criteria. In this paper, we propose a fast algorithm for calculating top-N most frequent keywords in dynamically selected documents. The key features of our method are as follows: (1) It uses optimal index structure specifically designed to minimize access to infrequent keywords which are not included in a result, and (2) It uses cost estimation to predict cases where the above index does not work efficiently, and changes to another kind of index. A performance study on textual data shows that our proposed method outperforms a naive algorithm implemented using relational database by a factor of five to ten.

By: Daisuke Takuma, Issei Yoshida

Published in: RT0760 in 2007

This Research Report is not available electronically. Please request a copy from the contact listed below. IBM employees should contact ITIRC for a copy.

Questions about this service can be mailed to reports@us.ibm.com .