Sampling strategies for extracting information from large data sets

Article ID	Journal	Published Year	Pages	File Type
6853915	Data & Knowledge Engineering	2018	15 Pages	PDF

Abstract

Getting information from large volumes of data is very expensive in terms of resources like CPU and memory, as well as computation time. The analysis of a small data set extracted from the original set is preferred. From this small set, called sample, approximate results can be obtained. The errors are acceptable given the reduced cost necessary for processing the data. Using sampling algorithms with small errors saves execution time and resources. This paper presents comparisons between sampling algorithms in order to determine which one performs better when taking into account set operations such as intersect, union and difference. The comparison focuses on the errors introduced by each algorithm for different sample sizes and on execution times.

Keywords

Time optimization Sampling algorithms Set operations Time complexity Space complexity