tailieunhanh - Báo cáo khoa học: "Using Deep Morphology to Improve Automatic Error Detection in Arabic Handwriting Recognition"

Arabic handwriting recognition (HR) is a challenging problem due to Arabic’s connected letter forms, consonantal diacritics and rich morphology. In this paper we isolate the task of identification of erroneous words in HR from the task of producing corrections for these words. We consider a variety of linguistic (morphological and syntactic) and non-linguistic features to automatically identify these errors. Our best approach achieves a roughly ∼15% absolute increase in F-score over a simple but reasonable baseline. . | Using Deep Morphology to Improve Automatic Error Detection in Arabic Handwriting Recognition Nizar Habash and Ryan M. Roth Center for Computational Learning Systems Columbia University habash ryanr @ Abstract Arabic handwriting recognition HR is a challenging problem due to Arabic s connected letter forms consonantal diacritics and rich morphology. In this paper we isolate the task of identification of erroneous words in HR from the task of producing corrections for these words. We consider a variety of linguistic morphological and syntactic and non-linguistic features to automatically identify these errors. Our best approach achieves a roughly 15 absolute increase in F-score over a simple but reasonable baseline. A detailed error analysis shows that linguistic features such as lemma . citation form models help improve HR-error detection precisely where we expect them to semantically incoherent error words. 1 Introduction After years of development optical character recognition OCR for Latin-character languages such as English has been refined greatly. Arabic however possesses a complex orthography and morphology that makes OCR more difficult Margner and Abed 2009 Halima and Alimi 2009 Magdy and Darwish 2006 . Because of this only a few systems for Arabic OCR of printed text have been developed and these have not been thoroughly evaluated Margner and Abed 2009 . OCR of Arabic handwritten text handwriting recognition or HR whether online or offline is even more challenging compared to printed Arabic OCR where the uniformity of letter shapes and other factors allow for easier recognition Biadsy et al. 2006 Natarajan et al. 2008 Saleem et al. 2009 . OCR and HR systems are often improved by performing post-processing these are attempts to evaluate whether each word phrase or sentence in the 875 OCR HR output is legal and or probable. When an illegal word or phrase is discovered error detection these systems usually attempt to generate a legal .

TỪ KHÓA LIÊN QUAN