FPVI: An Efficient Method for Discovering Privacy Vulnerabilities in Datasets

Analyzing datasets to discover privacy vulnerabilities is an important step in the privacy-preserving data publishing process and an area of increased interest for commercial data masking products. In this paper we propose FPVI, a fast algorithm for discovering privacy vulnerabilities in datasets in the form of combinations of attributes’ values leading to few records. FPVI operates in a multi-threaded fashion to efficiently index the data and scan different attributes’ combinations in parallel, while pruning the search space to limit the (exponential) number of attributes’ combinations that need to be searched for uniques. Our algorithm fully utilizes the execution environment, supporting hardware configurations spanning from commodity machines to multi-CPU multi-core nodes in cluster environments. Through experimental evaluation, using a large number of real-world datasets, FPVI is shown to significantly outperform the state-of-the-art to the extent that we had to design multi-threaded versions of the state-of-the-art method to form the baseline for our experiments. Performance measurements on the scalability of FPVI indicate that our method can analyze microdata consisting of 11 millions of records and 20 attributes in less than 9 minutes.

By: Aris Gkoulalas-Divanis, Stefano Braghin

Published in: RC25544 in 2015


This Research Report is available. This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. It has been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g., payment of royalties). I have read and understand this notice and am a member of the scientific community outside or inside of IBM seeking a single copy only.


Questions about this service can be mailed to reports@us.ibm.com .