Article ID Journal Published Year Pages File Type
430895 Journal of Discrete Algorithms 2013 12 Pages PDF
Abstract

Signature file is a well-studied method in information retrieval for indexing large text databases. Because of the small index size in this method, it is a good candidate for environments where memory is scarce. This small index size, however, comes at the cost of high false positive error rate. In this paper we address the problem of high false positive error rate of signature files by introducing COCA filters, a new variation of Bloom filters which exploits the co-occurrence probability of words in documents to reduce the false positive error. We show experimentally that by using this technique in real document collections we can reduce the false positive error by up to 21 times, for the same index size. It is also shown that in some extreme cases this technique is even able to completely eliminate the false positive error. COCA filters can be considered as a good replacement for Bloom filters wherever the co-occurrence of any two members of the universe is identifiable.

Related Topics
Physical Sciences and Engineering Computer Science Computational Theory and Mathematics
Authors
, , , ,