Combinatorial Analysis for Efficient Distributed Top-K Keyword Aggregation

The Top-k Keyword Aggregation is a kind of Top-k query processing that finds the k most frequent keywords in a document set, where the documents are dynamically selected through keyword searches.Top-k is often used in exploratory text analysis, where the response time for queries is especially crucial in interactive analyses that are seeking higher-level knowledge within a large document collection.
In this paper we show that, in a distributed environment where the central master node aggregates the keywords sent from the worker nodes, we can reduce the problem of optimization of the number of keywords each worker node has to return to the master node to a problem of pure combinatorics.
To estimate the number of keywords that each worker returns, our combinatorial analysis carefully defines the probability that we obtain the correct answers.
There is a reasonable trade-off between the number of keywords and the probability. With high probabilities, say 90% or 95%, we can significantly reduce the number of keywords processed by all the worker nodes.Even when the answer is not correct, our method assures that the end user can know which of the returned k keywords in the answer are actually correct and which of them may be spurious, and the number of the latter is actually very small compared to that of the former.
We describe the column-based keyword partitioning approach to implement our algorithm in distributed Top-k.
Experimental evaluations on a large archive of medical documents confirmed the results of our theoretical analysis.

By: Issei Yoshida, Yuya Unno, Yuta Tsuboi

Published in: in 2012

RT0938.pdf

Hard Copy:

Published in: Research Report

Input by: Issei Yoshida
Input Date: 2012/02/23

LIMITED DISTRIBUTION NOTICE:

This Research Report is available. This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. It has been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g., payment of royalties). I have read and understand this notice and am a member of the scientific community outside or inside of IBM seeking a single copy only.

RT0938.pdf

Questions about this service can be mailed to reports@us.ibm.com .