Efficiently processing (p,ε)(p,ε)-approximate join aggregation on massive data

Article ID	Journal	Published Year	Pages	File Type
393699	Information Sciences	2014	20 Pages	PDF

Abstract

Join aggregation is an important operation in database systems to return aggregate information on the join of two or several tables. Compared with exact query, it is a better choice in many cases to return approximate result satisfying a user-specified confidence interval in a much faster response time. It is found that none of previous works can efficiently process approximate join aggregation on massive data with arbitrary accuracy. This paper proposes a novel algorithm pεpε-AJA ((p,ε)(p,ε)-Approximate Join Aggregation) to obtain approximate join aggregate result with arbitrary confidence interval efficiently. Two data structures of low space overhead, JRS and JPIPT, are presented in this paper. pεpε-AJA first makes use of JRS to return a quick response. If the approximate result computed by JRS does not satisfy the given confidence interval, JPIPT is exploited to obtain enough random join tuples. This paper presents a novel sampling algorithm to acquire random JPIPT tuples of specified size and devises its correctness proof. A tuple fetching method is proposed to retrieve join tuples by the sampled JPIPT tuples in one-pass sequential scan on joined tables. The construction and maintenance algorithms of JPIPT and JRS are provided also in this paper. The experimental results show that pεpε-AJA obtains 3 times to 2 orders of magnitude speedup over the existing algorithms and runs 1 to 4 orders of magnitude faster than exact query.

Keywords

JRS Massive data