OVER-SPLITTED AND MERGED FOR GEOMETRY DOCUMENT LAYOUT ANALYSIS
DOI: 10.15625/vap.2015.000191
Abstract
Automatic transformation of paper documents into electronic forms requires geometry document layout analysis at the first stage. However, variations in character font sizes, text-line spacing, and layout structures have made it difficult to design a general purpose method. The use of some parameters has therefore been unavoidable in geometry document layout analysis algorithms. This lead to errors over-segmentation and under-segmentation of previous algorithms. This paper present a new approach to geometry document layout analysis. Our algorithm use a set of whitespace covering document background to reduce candidate zones. Some of them may be considered as over-segmented. The way bottom-up is used to group over-segmentation zones each other based on adaptive parameters. Finally, we proposed context analysis at textline level to segment document images into paragraphs. Experimental results on the ICDAR2009 competition data set shown that the proposed algorithm reduces vast amount of both over-segmentation and under-segmentation errors, thus boost the performance significantly comparing to the state-of-the-art algorithms.
Keywords
Geometry document layout analysis, whitespaces covering document background, text regions over-segmented, parameter adaptive, performance evaluation
Full Text:
PDFCopyright (c) 2016 PROCEEDING of Publishing House for Science and Technology
PROCEEDING
PUBLISHING HOUSE FOR SCIENCE AND TECHNOLOGY
Website: http://vap.ac.vn
Contact: nxb@vap.ac.vn