Article ID Journal Published Year Pages File Type
535018 Pattern Recognition Letters 2016 7 Pages PDF
Abstract

•A new hybrid over-split and merge algorithm that reduces simultaneously split and merge errors in document layout analysis.•An adaptive thresholding method for grouping text lines of variable font size in diversified and complicated document structure.•A new approach of context analysis to overcome the common failure in separating close text regions of similar font size.•Decomposing text regions of any shape into paragraphs.•Achieving highest score on the UW-III and ICDAR2009 datasets with different measures.

Page segmentation is a key step in building a document recognition system. Variation in character font sizes, narrow spacing between text blocks, and complicated structure are main causes of the most common over-segmentation and under-segmentation errors. We propose an adaptive over-split and merge algorithm to reduce simultaneously these types of error. The document image is firstly over-split into text blocks, even text lines. These text blocks are then considered to merge into text regions using a new adaptive thresholding method. Local context analysis uses a set of text line separators to split homogeneous text regions of similar font size and close text blocks into paragraphs. Experiments on the ICDAR2009 and UW-III benchmarking datasets show the effectiveness of the proposed algorithm in reducing both the under and over-segmentation errors and boost the performance significantly when comparing with popular page segmentation algorithms.

Related Topics
Physical Sciences and Engineering Computer Science Computer Vision and Pattern Recognition
Authors
, , ,