Article ID Journal Published Year Pages File Type
378713 Data & Knowledge Engineering 2014 13 Pages PDF
Abstract

•Revisit on the sum-of-squares based index WB-index•Analysis on three sum-of-squares based indices•A systematic comparison among 12 internal indexes•Employing three sum-of-squares based indices for automatic keyword categorization

Determining the number of clusters is an important part of cluster validity that has been widely studied in cluster analysis. Sum-of-squares based indices show promising properties in terms of determining the number of clusters. However, knee point detection is often required because most indices show monotonicity with increasing number of clusters. Therefore, indices with a clear minimum or maximum value are preferred. The aim of this paper is to revisit a sum-of-squares based index called the WB-index that has a minimum value as the determined number of clusters. We shed light on the relation between the WB-index and two popular indices which are the Calinski–Harabasz and the Xu-index. According to a theoretical comparison, the Calinski–Harabasz index is shown to be affected by the data size and level of data overlap. The Xu-index is close to the WB-index theoretically, however, it does not work well when the dimension of the data is greater than two. Here, we conduct a more thorough comparison of 12 internal indices and provide a summary of the experimental performance of different indices. Furthermore, we introduce the sum-of-squares based indices into automatic keyword categorization, where the indices are specially defined for determining the number of clusters.

Related Topics
Physical Sciences and Engineering Computer Science Artificial Intelligence
Authors
, ,