Article ID Journal Published Year Pages File Type
437687 Theoretical Computer Science 2015 12 Pages PDF
Abstract

We study the problem of generating a large sample from a data stream SS of elements (i,v)(i,v), where i is a positive integer key, v is an integer equal to the count of key i  , and the sample consists of pairs (i,Ci)(i,Ci) for Ci=∑(i,v)∈SvCi=∑(i,v)∈Sv. We consider strict turnstile streams and general non-strict turnstile streams, in which CiCi may be negative. Our sample is useful for approximating both forward and inverse distribution statistics, within an additive error ϵ   and provable success probability 1−δ1−δ.Our sampling method improves by an order of magnitude the known processing time of each stream element, a crucial factor in data stream applications, thereby providing a feasible solution to the sampling problem. For example, for a sample of size O(ϵ−2log⁡(1/δ))O(ϵ−2log⁡(1/δ)) in non-strict streams, our solution requires O((log⁡log⁡(1/ϵ))2+(log⁡log⁡(1/δ))2)O((log⁡log⁡(1/ϵ))2+(log⁡log⁡(1/δ))2) operations per stream element, whereas the best previous solution requires O(ϵ−2log2⁡(1/δ))O(ϵ−2log2⁡(1/δ)) evaluations of a fully independent hash function per element.We achieve this improvement by constructing an efficient K-elements recovery structure from which K   elements can be extracted with probability 1−δ1−δ. Our structure enables our sampling algorithm to run on distributed systems and extract statistics on the difference between streams.

Related Topics
Physical Sciences and Engineering Computer Science Computational Theory and Mathematics
Authors
, , ,