Article ID | Journal | Published Year | Pages | File Type |
---|---|---|---|---|
10341030 | Computers & Electrical Engineering | 2014 | 13 Pages |
Abstract
The widely-used Portable Document Format (PDF) documents are known to be layout-oriented and not suitable for mobile applications. In this paper, a Conditional Random Fields (CRF) based model is proposed to learn latent semantics of PDF page content. Local and contextual observations constructed from PDF attributes are incorporated to facilitate the determination of semantic roles. The observations are carefully designed to work even in different styles of documents. A local classifier is first used to generate posterior probabilities. The local estimate is then fed to the CRF model for joint classification. The experimental results evidently approve the positive effects of contextual information in logical labeling. Our work has revealed the potential usability of existing born-digital fixed-layout documents for mobile applications.
Related Topics
Physical Sciences and Engineering
Computer Science
Computer Networks and Communications
Authors
X. Tao, Z. Tang, C. Xu,