Article ID Journal Published Year Pages File Type
430890 Journal of Discrete Algorithms 2013 11 Pages PDF
Abstract

We give new space/time tradeoffs for compressed indexes that answer document retrieval queries on general sequences. On a collection of D documents of total length n  , current approaches require at least |CSA|+O(nlgDlglgD) or 2|CSA|+o(n)2|CSA|+o(n) bits of space, where CSACSA is a full-text index. Using monotone minimal perfect hash functions (mmphfs), we give new algorithms for document listing with frequencies and top-k   document retrieval using just |CSA|+O(nlglglgD) bits. We also improve current solutions that use 2|CSA|+o(n)2|CSA|+o(n) bits, and consider other problems such as colored range listing, top-k most important documents, and computing arbitrary frequencies. We give proof-of-concept experimental results that show that using mmphfs may provide relevant practical tradeoffs for document listing with frequencies.

Related Topics
Physical Sciences and Engineering Computer Science Computational Theory and Mathematics
Authors
, , ,