کد مقاله کد نشریه سال انتشار مقاله انگلیسی نسخه تمام متن
515911 867139 2011 15 صفحه PDF دانلود رایگان
عنوان انگلیسی مقاله ISI
Improving semistatic compression via phrase-based modeling
موضوعات مرتبط
مهندسی و علوم پایه مهندسی کامپیوتر نرم افزارهای علوم کامپیوتر
پیش نمایش صفحه اول مقاله
Improving semistatic compression via phrase-based modeling
چکیده انگلیسی

In recent years, new semistatic word-based byte-oriented text compressors, such as Tagged Huffman and those based on Dense Codes, have shown that it is possible to perform fast direct search over compressed text and decompression of arbitrary text passages over collections reduced to around 30–35% of their original size. Much of their success is due to the use of words as source symbols and a byte-oriented target alphabet. This approach broke with traditional statistical compressors, which use characters as source symbols and a bit-oriented target alphabet.In this work we go one step beyond by using phrases as source symbols. We present two new semistatic modelers that we combined with a dense coding scheme to obtain two new compressors: Pair-Based End-Tagged Dense Code (PETDC), where source symbols can be either words or pairs of words, and Phrase-Based End-Tagged Dense Code (PhETDC), which considers words and sequences of words (phrases). PETDC compresses English texts to 28–29% and PhETDC to around 23%, outperforming the optimal byte-oriented zero-order prefix-free word-based semistatic compressor by up to 8 percentage points. Moreover, PETDC and PhETDC still permit random access and efficient direct searches using fast Boyer–Moore algorithms.

Research highlights
► We present new semistatic pair-based and phrase-based modelers.
► Our new modelers are coupled with Dense Coding to obtain two new compressors.
► The compressors are called: Pair-based and Phrase-based End-Tagged Dense Code.
► Main features: good compression ratio (23–28%) and fast decompression.
► Additional features: direct searches and random decompression are possible.

ناشر
Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Information Processing & Management - Volume 47, Issue 4, July 2011, Pages 545–559
نویسندگان
, , , ,