An empirical study on selective partitioning dimensions for partition-based similarity joins

Article ID	Journal	Published Year	Pages	File Type
379400	Data & Knowledge Engineering	2007	12 Pages	PDF

Abstract

Real-world application data are usually distributed sparsely and non-uniformly in the high dimensional space that is huge in size. Hence, selection of effective partitioning dimensions is crucial for partition-based similarity joins. In this paper, we present two data partitioning algorithms for evaluations. PerDimSelect selects some dimension axes from the original perpendicular dimension axes pool, and maps each data point into the reduced dimension space. DiaDimSelect creates one-dimensional axis by combining some of original perpendicular dimensions, and maps each data point into the newly-created dimension. In the experiments, several measures are used to compare the performances of the algorithms including CPU cost, total response time, number of created buckets. In conclusion, DiaDimSelect shows better performance than PerDimSelect, for it creates much less partition buckets with the increasing number of partitioning dimensions, which leads to keep the IO cost less expensive while decreasing CPU cost considerably.