Article ID | Journal | Published Year | Pages | File Type |
---|---|---|---|---|
431723 | Journal of Discrete Algorithms | 2007 | 11 Pages |
We propose succinct data structures for text retrieval systems supporting document listing queries and ranking queries based on the tf*idftf*idf (term frequency times inverse document frequency) scores of documents. Traditional data structures for these problems support queries only for some predetermined keywords. Recently Muthukrishnan proposed a data structure for document listing queries for arbitrary patterns at the cost of data structure size. For computing the tf*idftf*idf scores there has been no efficient data structures for arbitrary patterns.Our new data structures support these queries using small space. The space is only 2/ϵ2/ϵ times the size of compressed documents plus 10n bits for a document collection of length n , for any 0<ϵ⩽10<ϵ⩽1. This is much smaller than the previous O(nlogn)O(nlogn) bit data structures. Query time is O(m+qlogϵn)O(m+qlogϵn) for listing and computing tf*idftf*idf scores for all q documents containing a given pattern of length m. Our data structures are flexible in a sense that they support queries for arbitrary patterns.