kluster: An Efficient Scalable Procedure for Approximating the Number of Clusters in Unsupervised Learning

Article ID	Journal	Published Year	Pages	File Type
10225731	Big Data Research	2018	32 Pages	PDF

Abstract

The majority of the clinical observation data stored in large-scale Electronic Health Record (EHR) research data networks are unlabeled. Unsupervised clustering can provide invaluable tools for studying patient sub-groups in these data. Many of the popular unsupervised clustering algorithms are dependent on identifying the number of clusters. Multiple statistical methods are available to approximate the number of clusters in a dataset. However, available methods are computationally inefficient when applied to large amounts of data. Scalable analytical procedures are needed to extract knowledge from large clinical datasets. Using both simulated, clinical, and public data, we developed and tested the kluster procedure for approximating the number of clusters in a large clinical dataset. The kluster procedure iteratively applies four statistical cluster number approximation methods to small subsets of data that were drawn randomly with replacements and recommends the most frequent and mean number of clusters resulted from the iterations as the potential optimum number of clusters. Our results showed that the kluster's most frequent product that iteratively applies a model-based clustering strategy using Bayesian Information Criterion (BIC) to samples of 200-500 data points, through 100 iterations, offers a reliable and scalable solution for approximating the number of clusters in unsupervised clustering. We provide the kluster procedure as an R package.

Keywords

Unsupervised clustering Healthcare data Big Data