OVER-SPLITTED AND MERGED FOR GEOMETRY DOCUMENT LAYOUT ANALYSIS

Ha Dai Ton, Nguyen Duc Dung, Le Duc Hieu



DOI: 10.15625/vap.2015.000191

Abstract


Automatic transformation of paper documents into electronic forms requires geometry document layout analysis at the first stage. However, variations in character font sizes, text-line spacing, and layout structures have made it difficult to design a general purpose method. The use of some parameters has therefore been unavoidable in geometry document layout analysis algorithms. This lead to errors over-segmentation and under-segmentation of previous algorithms. This paper present a new approach to geometry document layout analysis. Our algorithm use a set of whitespace covering document background to reduce candidate zones. Some of them may be considered as over-segmented. The way bottom-up is used to group over-segmentation zones each other based on adaptive parameters. Finally, we proposed context analysis at textline level to segment document images into paragraphs. Experimental results on the ICDAR2009 competition data set shown that the proposed algorithm reduces vast amount of both over-segmentation and under-segmentation errors, thus boost the performance significantly comparing to the state-of-the-art algorithms.

Keywords


Geometry document layout analysis, whitespaces covering document background, text regions over-segmented, parameter adaptive, performance evaluation

Full Text:

PDF


Copyright (c) 2016 PROCEEDING of Publishing House for Science and Technology



PROCEEDING

PUBLISHING HOUSE FOR SCIENCE AND TECHNOLOGY

Website: http://vap.ac.vn

Contact: nxb@vap.ac.vn