Towards “Intelligent Compression” in Streams: A Biased Reservoir Sampling based Bloom Filter Construction

With the explosion of information stored world-wide, data intensive computing
has become a central area of research. Efficient management and processing of this
massively exponential amount of data from diverse sources, such as telecommunication
call data records, telescope imagery, online transaction records, web pages,
stock markets, medical records (monitoring critical health conditions of patients),
climate warning systems, etc., has become a necessity. Removing redundancy
from such huge (multi-billion records) datasets resulting in resource and compute
efficiency for downstream processing constitutes an important area of study. “Intelligent
compression” or deduplication in streaming scenarios, for precise identification
and elimination of duplicates from the unbounded data stream is a greater
challenge given the real-time nature of data arrival. Stable Bloom Filters (SBF)
address this problem to a certain extent. However, SBF suffers from a high false
negative rate (FNR) and slow convergence rate, thereby rendering it inefficient for
applications with low FNR tolerance.
In this paper, we present a novel Reservoir Sampling based Bloom Filter,
(RSBF) data structure, based on the combined concepts of reservoir sampling
and Bloom filters for approximate detection of duplicates in data streams. Using
detailed theoretical analysis we prove analytical bounds on its false positive rate
(FPR), false negative rate (FNR) and convergence rates. We show that our FN
and convergence rate are better than those of SBF. Using empirical analysis on
real-world datasets (3 million records) and synthetic datasets with around 1 billion
records, we demonstrate upto 2 improvement in FNR with better convergence
rates as compared to SBF, while exhibiting comparable FPR. To the best of our
knowledge, this is the first attempt to integrate reservoir sampling method with
Bloom filters for deduplication in streaming scenarios.

By: Sourav Dutta, Souvik Bhattacherjee and Ankur Narang

Published in: RI11015 in 2011

LIMITED DISTRIBUTION NOTICE:

This Research Report is available. This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. It has been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g., payment of royalties). I have read and understand this notice and am a member of the scientific community outside or inside of IBM seeking a single copy only.

tech_report.pdf

Questions about this service can be mailed to reports@us.ibm.com .