کد مقاله کد نشریه سال انتشار مقاله انگلیسی نسخه تمام متن
396891 670620 2013 18 صفحه PDF دانلود رایگان
عنوان انگلیسی مقاله ISI
XML tree structure compression using RePair
موضوعات مرتبط
مهندسی و علوم پایه مهندسی کامپیوتر هوش مصنوعی
پیش نمایش صفحه اول مقاله
XML tree structure compression using RePair
چکیده انگلیسی


• A generalization to trees of Larsson and Moffat's RePair algorithm is presented.
• TreeRePair outperforms all previous grammar-based tree compressors.
• TreeRePair produces smaller grammars, runs faster, and is easier to understand.
• TreeRePair produces the smallest queryable memory representation of ordered trees.

XML tree structures can conveniently be represented using ordered unranked trees. Due to the repetitiveness of XML markup these trees can be compressed effectively using dictionary-based methods, such as minimal directed acyclic graphs (DAGs) or straight-line context-free (SLCF) tree grammars. While minimal SLCF tree grammars are in general smaller than minimal DAGs, they cannot be computed in polynomial time unless P=NPP=NP. Here, we present a new linear time algorithm for computing small SLCF tree grammars, called TreeRePair, and show that it greatly outperforms the best known previous algorithm BPLEX. TreeRePair is a generalization to trees of Larsson and Moffat's RePair string compression algorithm. SLCF tree grammars can be used as efficient memory representations of trees. Using TreeRePair, we are able to produce the smallest queryable memory representation of ordered trees that we are aware of. Our investigations over a large corpus of commonly used XML documents show that tree traversals over TreeRePair grammars are 14 times slower than over pointer structures and 5 times slower than over succinct trees, while memory consumption is only 1/43 and 1/6, respectively. With respect to file compression we are able to show that a Huffman-based coding of TreeRePair grammars gives compression ratios comparable to the best known XML file compressors.

ناشر
Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Information Systems - Volume 38, Issue 8, November 2013, Pages 1150–1167
نویسندگان
, , ,