Efficient distributed skyline computation using dependency-based data partitioning

Article ID	Journal	Published Year	Pages	File Type
461063	Journal of Systems and Software	2014	15 Pages	PDF

Abstract

•Quick response time and progressiveness are important for distributed skylining.•Data partitioning techniques enable batch pruning and parallel query processing.•Data partitioning using dependencies increases the parallelism of query processing.•Ordering partitions by dependencies brings in progressiveness and effective pruning.

Skyline queries, together with other advanced query operators, are essential in order to help identify sets of interesting data points buried within huge amount of data readily available these days. A skyline query retrieves sets of non-dominated data points in a multi-dimensional dataset. As computing infrastructures become increasingly pervasive, connected by readily available network services, data storage and management have become inevitably more distributed. Under these distributed environments, designing efficient skyline querying with desirable quick response time and progressive returning of answers faces new challenges. To address this, in this paper, we propose a novel skyline query scheme termed MpSky. MpSky is based on a novel space partitioning scheme, employing the dependency relationships among data points on different servers. By grouping points of each server using dependencies, we are able to qualify a skyline point by only comparing it with data on dependent servers, and parallelize the skyline computation among non-dependent partitions that are from different servers or individual servers. By controlling the query propagation among partitions, we are able to generate skyline results progressively and prune partitions and points efficiently. Analytical and extensive simulation results show the effectiveness of the proposed scheme.

Keywords

Skyline query Distributed systems Data partitioning