An adaptive over-split and merge algorithm for page segmentation

Article ID	Journal	Published Year	Pages	File Type
535018	Pattern Recognition Letters	2016	7 Pages	PDF

Abstract

•A new hybrid over-split and merge algorithm that reduces simultaneously split and merge errors in document layout analysis.•An adaptive thresholding method for grouping text lines of variable font size in diversified and complicated document structure.•A new approach of context analysis to overcome the common failure in separating close text regions of similar font size.•Decomposing text regions of any shape into paragraphs.•Achieving highest score on the UW-III and ICDAR2009 datasets with different measures.

Page segmentation is a key step in building a document recognition system. Variation in character font sizes, narrow spacing between text blocks, and complicated structure are main causes of the most common over-segmentation and under-segmentation errors. We propose an adaptive over-split and merge algorithm to reduce simultaneously these types of error. The document image is firstly over-split into text blocks, even text lines. These text blocks are then considered to merge into text regions using a new adaptive thresholding method. Local context analysis uses a set of text line separators to split homogeneous text regions of similar font size and close text blocks into paragraphs. Experiments on the ICDAR2009 and UW-III benchmarking datasets show the effectiveness of the proposed algorithm in reducing both the under and over-segmentation errors and boost the performance significantly when comparing with popular page segmentation algorithms.

Keywords

OCR Document analysis and recognition Page segmentation