Article ID | Journal | Published Year | Pages | File Type |
---|---|---|---|---|
6856323 | Information Sciences | 2018 | 20 Pages |
Abstract
In this paper, we provide an empirical analysis on the compression of open data provided in a relational format, such as comma-separated value files. We consider several compression tools and parameter settings. Furthermore, we propose using a novel column-wise compression strategy, where items that have similar properties, are compressed together. We perform a comprehensive analysis on 24 datasets from different domains, such as life sciences, governmental data, finance sector, and public transportation, which cover a wide range of file sizes (from a few MB to several GB). Our results show that the traversal strategy is of paramount importance for achieving high compression ratios; with improvements of up to one order of magnitude. This study further highlights a set of issues for future work on compressing open data.
Related Topics
Physical Sciences and Engineering
Computer Science
Artificial Intelligence
Authors
Sebastian Wandelt, Xiaoqian Sun, Ulf Leser,