Model-based co-clustering for the effective handling of sparse data

Article ID	Journal	Published Year	Pages	File Type
4969605	Pattern Recognition	2017	15 Pages	PDF

Abstract

With the exponential growth of text documents on the web, there is a genuine need for techniques that organize terms and documents, simultaneously, into meaningful clusters, thereby making large datasets easier to handle and interpret. Several block diagonal clustering algorithms have proven successful in identifying co-clusters of documents and words. However, despite their effectiveness, most of the existing methods do not provide a parameterizable model for tackling the problem of block diagonal identification. In this paper, we rely on mixture models, which offer strong theoretical foundations and considerable flexibility. More precisely, we propose a parsimonious latent block model based on the mixture of Poisson distributions and tailored for sparse high dimensional data such as document-term matrices. In order to efficiently estimate the model parameters, we derive two scalable co-clustering algorithms based on variational inference. Empirical results obtained on several real-world text datasets highlight the advantages of the proposed model and the corresponding co-clustering algorithms.

Keywords

Latent block model Poisson distribution Sparse data Mixture models Co-clustering