On the number of groups in clustering

Article ID	Journal	Published Year	Pages	File Type
1152310	Statistics & Probability Letters	2011	11 Pages	PDF

Abstract

Clustering is the problem of partitioning data into a finite number k of homogeneous and separate groups, called clusters. A good choice of k is essential for building meaningful clusters. In this paper, this task is addressed from the point of view of model selection via penalization. We design an appropriate penalty shape and derive an associated oracle-type inequality. The method is illustrated on both simulated and real-life data sets.

Keywords

62H30 K-means clustering Model selection Number of clusters Oracle inequality