Privacy preserving data mining: a signal processing perspective and a simple data perturbation protocol

In recent years, there have been privacy concerns over the proliferation of gathering of personal information by various institutions and merchants over the internet. This has led to the development of data mining algorithms that preserve the privacy of those whose personal data are collected and analyzed. A novel approach to such privacy preserving data mining algorithms was recently proposed where the individual data in a large data set is perturbed by adding a random value from a known distribution. This perturbation is performed by the user so that the true value of the data is not known to the data mining algorithm. In these applications, the distribution of the original data set is important and estimating it is one of the goals of the data mining algorithm. This distribution is estimated via an iterative algorithm. An algorithm based on the Expectation Maximization (EM) algorithm was subsequently shown to have desirable properties such as the ability to have low privacy loss and high fidelity estimates of
the distribution of the data set. Each iteration of EM requires computation which is proportional to the size of the data set and to the number of points in the estimate. This can require large computation time to estimate the distribution. In this paper we propose two ways to reduce the amount of computation. First, we show that the problem is equivalent to a deconvolution problem and signal processing algorithms can be applied to solve this problem. In particular we consider both a direct deconvolution method which estimates the Fourier coefficients directly and iterative deconvolution methods which are more robust against noise and ill-conditioning. We show that the well-known Richardson-Lucy deblurring algorithm is equivalent to EM after
quantization. The signal processing approach also shows how the choice of perturbation affect information loss and privacy loss.
Second, we propose another scheme for perturbing data which also has the nice properties of allowing arbitrarily small privacy loss and arbitrarily high delity in the estimate (i.e. zero information loss). The main advantage of this proposed scheme is the simplicity of the estimation algorithm. In contrast to iterative algorithms such as EM, the proposed scheme admits an algorithm which estimates the unknown distribution in one step. This is significant in applications where the data set is very large or when the data mining algorithm is run in an online environment.

By: Chai Wah Wu

Published in: RC22815 in 2003


This Research Report is available. This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. It has been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g., payment of royalties). I have read and understand this notice and am a member of the scientific community outside or inside of IBM seeking a single copy only.


Questions about this service can be mailed to .