Article ID Journal Published Year Pages File Type
396796 Information Systems 2015 19 Pages PDF
Abstract

Computing cube aggregations on large data sets with high dimensionality is a crucial and costly task that normally requires multiple passes on the input table. This task gets harder when the number of result groups increases due to a large number of combinations of dimension values. In this research, we focus on reducing the number of aggregations and providing a more succinct result by deriving aggregations on top of groups with similar records exploiting an efficient binary clustering of the fact table, which can be viewed as a relaxation of traditional OLAP cubes. We present an efficient window-based Incremental K-Means algorithm implemented in a DBMS as a user-defined function. A significant speedup is achieved through sufficient statistics, multithreading, efficient distance computation and sparse matrix operations. Our algorithm performance is experimentally compared against multiple variants of the K-Means algorithm. We show our incremental K-Means algorithm achieves similar or better results much faster than the traditional K-Means algorithm. Moreover, we show interesting aggregations can be efficiently obtained using the cluster identifier as a new cube dimension.

Keywords
Related Topics
Physical Sciences and Engineering Computer Science Artificial Intelligence
Authors
, ,