Article ID Journal Published Year Pages File Type
6874162 Information Processing Letters 2018 10 Pages PDF
Abstract
In practice, however, users are more interested in recall rate, i.e., the probability that a random query collides with its r-near neighbor over a fixed LSH data structure hℓk. Implicitly or explicitly, Phkℓ(r) is often misinterpreted as recall rate and used to predict the performance of LSH. This is problematic because Phkℓ(r) is actually the expectation of recall rates. Interestingly, numerous empirical studies show that, for most (if not all) real datasets and a fixed sample of random LSH data structure, the recall rate is very close to Phkℓ(r). In this paper, we provide a theoretical justification for this phenomenon. We show that (1) for random datasets the recall rate is asymptotically equal to Phkℓ(r); (2) for arbitrary datasets the variance of the recall rate is very small as long as the parameter k and ℓ are properly chosen and the size of datasets is large enough. Our analysis (1) explains why the practical performance of LSH (the recall rate) matches so well with the theoretical expectation (Phkℓ(r)); and (2) indicates that, in addition to the nice theoretical guarantee, the mechanism by which LSH data structures are constructed and the huge amount of data are also the main causes for the success of LSH in practice.
Related Topics
Physical Sciences and Engineering Computer Science Computational Theory and Mathematics
Authors
, , , ,