کد مقاله | کد نشریه | سال انتشار | مقاله انگلیسی | نسخه تمام متن |
---|---|---|---|---|
431723 | 688618 | 2007 | 11 صفحه PDF | دانلود رایگان |

We propose succinct data structures for text retrieval systems supporting document listing queries and ranking queries based on the tf*idftf*idf (term frequency times inverse document frequency) scores of documents. Traditional data structures for these problems support queries only for some predetermined keywords. Recently Muthukrishnan proposed a data structure for document listing queries for arbitrary patterns at the cost of data structure size. For computing the tf*idftf*idf scores there has been no efficient data structures for arbitrary patterns.Our new data structures support these queries using small space. The space is only 2/ϵ2/ϵ times the size of compressed documents plus 10n bits for a document collection of length n , for any 0<ϵ⩽10<ϵ⩽1. This is much smaller than the previous O(nlogn)O(nlogn) bit data structures. Query time is O(m+qlogϵn)O(m+qlogϵn) for listing and computing tf*idftf*idf scores for all q documents containing a given pattern of length m. Our data structures are flexible in a sense that they support queries for arbitrary patterns.
Journal: Journal of Discrete Algorithms - Volume 5, Issue 1, March 2007, Pages 12–22