Article ID Journal Published Year Pages File Type
430579 Journal of Discrete Algorithms 2012 15 Pages PDF
Abstract

Distance-based indexing exploits only the triangle inequality to answer similarity queries in metric spaces. Lacking coordinate structure, mathematical tools in RnRn can only be applied indirectly, making it difficult to theoretically study metric-space indexing. Toward solving this problem, a common algorithmic step is to select a small number of special points, called pivots  , and map the data objects to a low-dimensional space, one dimension for each pivot, where each dimension represents the distances of a pivot to the data objects. We formalize a “pivot space model” where all the data objects are used as pivots such that data is mapped from metric space to RnRn, preserving all the pairwise distances under L∞L∞. With this model, it can be shown that the indexing problem in metric space can be equivalently studied in RnRn. Further, we show the necessity of dimension reduction for RnRn and that the only effective form of dimension reduction is to select existing dimensions, i.e. pivot selection. The coordinate structure of RnRn makes the application of many mathematical tools possible. In particular, Principle Component Analysis (PCA) is incorporated into a heuristic method for pivot selection and shown to be effective over a large range of workloads. We also show that PCA can be used to reliably measure the intrinsic dimension of a metric space.

Related Topics
Physical Sciences and Engineering Computer Science Computational Theory and Mathematics
Authors
, , ,