A Frequency-aware Parallel Algorithm for Counting Stream Items on Multicore Processors

We present a parallel counting algorithm for estimating frequencies of items from data streams where a stream is ingested and queried in parallel by partitioning data over multiple processing cores of a multi-core processor. We demonstrate that current probabilistic counting algorithms are ill-suited to be used as the sequential kernel in the parallel algorithm due to space limitations, inaccuracies in approximating counts of low frequency items, and inability to identify the absent items in a stream. To address these concerns, we have devised a new sequential counting algorithm, called the Frequency-aware Counting Algorithm (FCM). FCM is related to the Cormode-Muthukrishnan Count-Min algorithm and incorporates novel capabilities for improving accuracy in estimating low-frequency items by dynamically capturing frequency changes of items in the stream, and using a variable number of hash functions per item as determined by an item’s current frequency. FCM also uses an auxiliary space-efficient data structure to reduce the errors due to absent items.

We have implemented the parallel counting algorithm on the multi-core Cell processor using the FCM algorithm as the sequential kernel. We experimentally and analytically demonstrate that with similar space consumption, FCM computes better frequency estimates of both the low- and high-frequency items than the Count-Min algorithm by reducing collisions with high-frequency items. FCM also significantly reduces the errors in identifying absent items. In the parallel scenario, as the number of processing cores is increased, using a hash-based data partitioning approach, our parallel algorithm is able to scale the overall performance linearly as well as improve the estimate accuracy.

By: Dina Thomas; Rajesh Bordawekar; Charu Aggarwal; Philip S. Yu

Published in: RC24207 in 2007

LIMITED DISTRIBUTION NOTICE:

This Research Report is available. This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. It has been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g., payment of royalties). I have read and understand this notice and am a member of the scientific community outside or inside of IBM seeking a single copy only.

rc24207.pdf

Questions about this service can be mailed to reports@us.ibm.com .