Article ID | Journal | Published Year | Pages | File Type |
---|---|---|---|---|
6874162 | Information Processing Letters | 2018 | 10 Pages |
Abstract
In practice, however, users are more interested in recall rate, i.e., the probability that a random query collides with its r-near neighbor over a fixed LSH data structure hâk. Implicitly or explicitly, Phkâ(r) is often misinterpreted as recall rate and used to predict the performance of LSH. This is problematic because Phkâ(r) is actually the expectation of recall rates. Interestingly, numerous empirical studies show that, for most (if not all) real datasets and a fixed sample of random LSH data structure, the recall rate is very close to Phkâ(r). In this paper, we provide a theoretical justification for this phenomenon. We show that (1) for random datasets the recall rate is asymptotically equal to Phkâ(r); (2) for arbitrary datasets the variance of the recall rate is very small as long as the parameter k and â are properly chosen and the size of datasets is large enough. Our analysis (1) explains why the practical performance of LSH (the recall rate) matches so well with the theoretical expectation (Phkâ(r)); and (2) indicates that, in addition to the nice theoretical guarantee, the mechanism by which LSH data structures are constructed and the huge amount of data are also the main causes for the success of LSH in practice.
Related Topics
Physical Sciences and Engineering
Computer Science
Computational Theory and Mathematics
Authors
Kejing Lu, Hongya Wang, Yingyuan Xiao, Hui Song,