Visualizing CCITT Group 3 and Group 4 TIFF Documents and Transforming to Run-Length Compressed Format Enabling Direct Processing in Compressed Domain

Article ID	Journal	Published Year	Pages	File Type
488463	Procedia Computer Science	2016	9 Pages	PDF

Abstract

Compression of data could be thought of as an avenue to overcome Big data problem to a large extent particularly to combat the storage and transmission issues. In this context, documents, images, audios and videos are preferred to be archived and communicated in the compressed form. However, any subsequent operation over the compressed data requires decompression which implies additional computing resources. Therefore developing novel techniques to operate and analyze directly the contents within the compressed data without involving the stage of decompression is a potential research issue. In this context, recently in the literature of Document Image Analysis (DIA) some works have been reported on direct processing of run-length compressed document data specifically targeted on CCITT Group 3 1-D documents. Since, run-length data is the backbone of other advanced compression schemes of CCITT such as CCITT Group 3 2-D (T.4) and CCITT Group 4 2-D (T.6) which are widely supported by TIFF and PDF formats, the proposal in this paper is to intelligently generate the run-length data from the compressed data of T.4 and T.6, and thus extend the idea of direct processing of documents in Run-Length Compressed Domain (RLCD). The generated run-length data from the proposed algorithm is experimentally validated and 100% correlation is reported with a data set of compressed documents. In the end, text segmentation and word spotting application in RLCD is also demonstrated.