A simple method of data perturbation and efficient algorithms for density estimation in privacy preserving data mining

In recent years, there have been privacy concerns over the proliferation of gathering of personal information by various institutions and merchants over the internet. This has led to the development of data mining algorithms which preserve the privacy of the people whose personal data are collected and analyzed. A novel approach to such privacy preserving data mining algorithms was recently proposed where the individual data in a large data set is perturbed by adding a random value from a known distribution. This perturbation is performed by the user so that the true value of the data is not known to the data mining algorithm. In these applications, the distribution of the original data set is important and estimating it is part of the goals of the data mining algorithm. This distribution is estimated via an iterative algorithm. An algorithm based on the Expectation Maximization (EM) algorithm was subsequently shown to have desirable properties such as the ability to have low privacy loss and high fidelity estimates of the distribution of the data set. Both these algorithms are iterative in nature and each literation requires computation which is proportional to the size of the data set and to the number of points in the estimate. This can require large computation time to estimate the distribution. In this paper we propose two methods to reduce the amount of computation. The first method constructs in one step an initial estimate of the distribution to aid the iterative algorithm in order to reduce the number of iterations. In the second method, we propose another scheme for perturbing data which also has the nice properties of allowing arbitrarily small privacy loss and arbitrarily high fidelity in the estimate (i.e. zero information loss). The main advantage of this proposed scheme is the simplicity of the estimation algorithm. In contrast to fiterative algorithms such as EM, the proposed scheme admits an algorithm which estimates the unknown distribution in one step. This is significant in applications where the data set is very large or
when the data mining algorithm is run in an online environment.

By: Chai Wah Wu

Published in: RC22727 in 2003

LIMITED DISTRIBUTION NOTICE:

This Research Report is available. This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. It has been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g., payment of royalties). I have read and understand this notice and am a member of the scientific community outside or inside of IBM seeking a single copy only.

RC22727.pdf

Questions about this service can be mailed to reports@us.ibm.com .