Canonical Huffman code based full-text index

Article ID	Journal	Published Year	Pages	File Type
1548891	Progress in Natural Science: Materials International	2008	6 Pages	PDF

Abstract

Full-text indices are data structures that can be used to find any substring of a given string. Many full-text indices require space larger than the original string. In this paper, we introduce the canonical Huffman code to the wavelet tree of a string T[1…n]. Compared with Huffman code based wavelet tree, the memory space used to represent the shape of wavelet tree is not needed. In case of large alphabet, this part of memory is not negligible. The operations of wavelet tree are also simpler and more efficient due to the canonical Huffman code. Based on the resulting structure, the multi-key rank and select functions can be performed using at most nH0 + ∣Σ∣(lg lg n + lg n − lg ∣Σ∣)+O(nH0) bits and in O(H0) time for average cases, where H0 is the zeroth order empirical entropy of T. In the end, we present an efficient construction algorithm for this index, which is on-line and linear.

Keywords

Suffix automaton