Article ID | Journal | Published Year | Pages | File Type |
---|---|---|---|---|
6883805 | Computers & Security | 2018 | 45 Pages |
Abstract
Clustering categorical data is an important data mining task with rich applications. As an extension of k-means applied to categorical data, the k-modes algorithm became a popular clustering tool due to its simplicity and efficiency. A lot of improvements for k-modes such as better initialization techniques or more effective dissimilarity scores have been proposed recently. However, the problem of running the k-modes in private manners is rarely considered. In this paper, we address the privacy-preserving k-modes problem using differential privacy, a formal and rigorous definition of privacy for data publication. Differential privacy guarantees that the existence of any item in the input dataset is indistinguishable by looking at the computation output. We analyze the challenges of differentially private k-modes with regard to the k-means counterpart. Then we propose several schemes in both interactive and non-interactive settings. We prove that our mechanisms satisfy differential privacy and run linearly in the number of data points. Evaluation over fifteen real datasets shows that we can achieve useful privacy-preserving clustering outputs. In terms of clustering cost, the interactive approaches perform better than the non-interactive schemes and the solution adapted from k-means.
Keywords
Related Topics
Physical Sciences and Engineering
Computer Science
Computer Networks and Communications
Authors
Huu Hiep Nguyen,