Why locality sensitive hashing works: A practical perspective

Article ID	Journal	Published Year	Pages	File Type
6874162	Information Processing Letters	2018	10 Pages	PDF

Abstract

In practice, however, users are more interested in recall rate, i.e., the probability that a random query collides with its r-near neighbor over a fixed LSH data structure hâk. Implicitly or explicitly, Phkâ(r) is often misinterpreted as recall rate and used to predict the performance of LSH. This is problematic because Phkâ(r) is actually the expectation of recall rates. Interestingly, numerous empirical studies show that, for most (if not all) real datasets and a fixed sample of random LSH data structure, the recall rate is very close to Phkâ(r). In this paper, we provide a theoretical justification for this phenomenon. We show that (1) for random datasets the recall rate is asymptotically equal to Phkâ(r); (2) for arbitrary datasets the variance of the recall rate is very small as long as the parameter k and â are properly chosen and the size of datasets is large enough. Our analysis (1) explains why the practical performance of LSH (the recall rate) matches so well with the theoretical expectation (Phkâ(r)); and (2) indicates that, in addition to the nice theoretical guarantee, the mechanism by which LSH data structures are constructed and the huge amount of data are also the main causes for the success of LSH in practice.

Keywords

randomized algorithms Recall rate analysis of variance Locality sensitive hashing