Clustering binary cube dimensions to compute relaxed GROUP BY aggregations

Article ID	Journal	Published Year	Pages	File Type
396796	Information Systems	2015	19 Pages	PDF

Abstract

Computing cube aggregations on large data sets with high dimensionality is a crucial and costly task that normally requires multiple passes on the input table. This task gets harder when the number of result groups increases due to a large number of combinations of dimension values. In this research, we focus on reducing the number of aggregations and providing a more succinct result by deriving aggregations on top of groups with similar records exploiting an efficient binary clustering of the fact table, which can be viewed as a relaxation of traditional OLAP cubes. We present an efficient window-based Incremental K-Means algorithm implemented in a DBMS as a user-defined function. A significant speedup is achieved through sufficient statistics, multithreading, efficient distance computation and sparse matrix operations. Our algorithm performance is experimentally compared against multiple variants of the K-Means algorithm. We show our incremental K-Means algorithm achieves similar or better results much faster than the traditional K-Means algorithm. Moreover, we show interesting aggregations can be efficiently obtained using the cluster identifier as a new cube dimension.

Keywords

OLAP Clustering