ProTraS: A probabilistic traversing sampling algorithm

Article ID	Journal	Published Year	Pages	File Type
6854963	Expert Systems with Applications	2018	42 Pages	PDF

Abstract

In the process of knowledge discovery in big data, sampling is a technological brick that can be included in a more general framework to speed up existing algorithms and contribute to the scalability issue. Two challenging and connected problems arise with complexity: tuning and timing. ProTraS1 is a new algorithm that fulfills both requirements. It is driven by a unique parameter, the sampling cost. The cost is overestimated by the maximum within group distance and the group cardinality. It is an iterative algorithm, at each step a new representative is added, chosen as the farthest-first traversal item from the representative in the group with the highest probability of cost reduction. The novel algorithm is robust to noise and time optimized. A detailed comparison with alternative algorithms, conducted on various synthetic and real world data sets, shows that the proposal yields competitive results in terms of quality of representation for clustering, sampling size and sampling time.

Keywords

Clustering Distance Density