Article ID Journal Published Year Pages File Type
431723 Journal of Discrete Algorithms 2007 11 Pages PDF
Abstract

We propose succinct data structures for text retrieval systems supporting document listing queries and ranking queries based on the tf*idftf*idf (term frequency times inverse document frequency) scores of documents. Traditional data structures for these problems support queries only for some predetermined keywords. Recently Muthukrishnan proposed a data structure for document listing queries for arbitrary patterns at the cost of data structure size. For computing the tf*idftf*idf scores there has been no efficient data structures for arbitrary patterns.Our new data structures support these queries using small space. The space is only 2/ϵ2/ϵ times the size of compressed documents plus 10n bits for a document collection of length n  , for any 0<ϵ⩽10<ϵ⩽1. This is much smaller than the previous O(nlogn)O(nlogn) bit data structures. Query time is O(m+qlogϵn)O(m+qlogϵn) for listing and computing tf*idftf*idf scores for all q documents containing a given pattern of length m. Our data structures are flexible in a sense that they support queries for arbitrary patterns.

Related Topics
Physical Sciences and Engineering Computer Science Computational Theory and Mathematics
Authors
,