The k-modes Algorithm with Entropy Based Similarity Coefficient

Article ID	Journal	Published Year	Pages	File Type
489838	Procedia Computer Science	2015	6 Pages	PDF

Abstract

Clustering is the process of organizing dataset into isolated groups such that data points in the same are more similar and data points of different groups are more dissimilar. The k-modes algorithm well known for its simplicity is a popular partitioning algorithm for clustering categorical data. In this paper, we discuss the limitations of distance function used in this algorithm with an illustrative example and then we propose a similarity coefficient based on Information Entropy. We analyze the time complexity of the k-modes algorithm with proposed similarity coefficient. The main advantage of this coefficient is that it improves the clustering accuracy while retaining scalability of the k-modes algorithm. We perform the scalability tests on synthetic datasets.