tailieunhanh - Báo cáo khoa học: "Japanese OCR Error Correction using Character Shape Similarity and Statistical Language Model "

We present a novel OCR error correction method for languages without word delimiters that have a large character set, such as Japanese and Chinese. It consists of a statistical OCR model, an approximate word matching method using character shape similarity, and a word segmentation algorithm using a statistical language model. By using a statistical OCR model and character shape similarity, the proposed error corrector outperforms the previously published method. When the baseline character recognition accuracy is 90%, it achieves character recognition accuracy. . | Japanese OCR Error Correction using Character Shape Similarity and Statistical Language Model Masaaki NAGATA NTT Information and Communication Systems Laboratories 1-1 Hikari-no-oka Yokosuka-Shi Kanagawa 239-0847 Japan nagata@ Abstract We present a novel OCR error correction method for languages without word delimiters that have a large character set such as Japanese and Chinese. It consists of a statistical OCR model an approximate word matching method using character shape similarity and a word segmentation algorithm using a statistical language model. By using a statistical OCR model and character shape similarity the proposed error corrector outperforms the previously published method. When the baseline character recognition accuracy is 90 it achieves character recognition accuracy. 1 Introduction As our society is becoming more computerized people are getting enthusiastic about entering everything into computers. So the need for OCR in areas such as office automation and information retrieval is becoming larger contrary to our expectation. In Japanese although the accuracy of printed character OCR is about 98 sources such as old books poor quality photocopies and faxes are still difficult to process and cause many errors. The accuracy of handwritten OCR is still about 90 Hildebrandt and Liu 1993 and it worsens dramatically when the input quality is poor. If NLP techniques could be used to boost the accuracy of handwriting and poor quality documents we could enjoy a very large market for OCR related applications. OCR error correction can be thought of a spelling correction problem. Although spelling correction has been studied for several decades Kukich 1992 the traditional techniques are implicitly based on English and cannot be used for Asian languages such as Japanese and Chinese. The traditional strategy for English spelling correction is called isolated word error correction Word boundaries are placed by white spaces. If the .