Privacy and utility preserving data clustering for data anonymization and distribution on Hadoop

Article ID	Journal	Published Year	Pages	File Type
4950389	Future Generation Computer Systems	2017	16 Pages	PDF

Abstract

Data privacy is a stringent need when sharing and processing data on a distributed environment or in Internet of Things. Collaborative privacy-preserving data mining based on secured multiparty computation incur high communication and computational cost. Data anonymization is a promising technique in the field of privacy-preserving data mining used to protect the data against identity disclosure. Information loss and common attacks possible on the anonymized data are serious challenges of anonymization. Recently, data anonymization using data mining techniques has showed significant improvement in data utility. Still the existing techniques lack in effective handling of attacks. Hence in this paper, an anonymization algorithm based on clustering and resilient to similarity attack and probabilistic inference attack is proposed. The anonymized data is distributed on Hadoop Distributed File System. The method achieves a better trade-off between privacy and utility. In our work the data utility is measured in terms of accuracy and FMeasure with respect to different classifiers. Experiments show that the accuracy, FMeasure and the execution time of the classification algorithms on the privacy-preserved data sets formed by the proposed clustering algorithms are better than the existing algorithms.

Keywords

Data privacy Clustering distributed data Data anonymization Classification Hadoop