tailieunhanh - A hybrid paragraph level page segmentation

Automatic transformation of paper documents into electronic forms requires geometry document layout analysis at the first stage. However, variations in character font sizes, text-line spacing, and layout structures have made it difficult to design a general-purpose method. Page segmentation algorithms usually segment text blocks using global separation objects, or local relations among connected components such as distance and orientation, but typically do not consider information other than local component’s size. | Journal of Computer Science and Cybernetics, , (2016), 153–167 DOI no. A HYBRID PARAGRAPH-LEVEL PAGE SEGMENTATION HA DAI TON1 , NGUYEN DUC DUNG2 1 Ha 2 Institute Long Gifted High School, Quang Ninh Province, Viet Nam of Information Technology, Vietnam Academy of Science and Technology; 1 hadaiton83@; 2 nddung@ Abstract. Automatic transformation of paper documents into electronic forms requires geometry document layout analysis at the first stage. However, variations in character font sizes, text-line spacing, and layout structures have made it difficult to design a general-purpose method. Page segmentation algorithms usually segment text blocks using global separation objects, or local relations among connected components such as distance and orientation, but typically do not consider information other than local component’s size. As a result, they cannot separate blocks that are very close to each other, including text of different font sizes and paragraphs in the same column. To overcome this limitation, we proposed to use both separation objects at the whole page level and context analysis at text-line level to segment document images into paragraphs. The introduced hybrid paragraph-level page segmentation (HP2S) algorithm can handle difficult cases where the purely top-down and bottom-up approaches are not sufficient to separate. Experimental results on the test set ICDAR2009 competition and UW-III dataset show that our algorithm boosts the performance significantly comparing to the state of the art algorithms. Keywords. Page segmentation, text-lines, homogenous regions, separation objects, paragraphs, evaluation result. 1. INTRODUCTION Document layout analysis is one of the main components of any OCR (optical character recognition) system. The task of structural analysis includes automatically detecting image zones on a document image (physical structure analysis) and classifying them into .

TỪ KHÓA LIÊN QUAN
crossorigin="anonymous">
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.