tailieunhanh - Báo cáo khoa học: "Maximum Entropy Based Restoration of Arabic Diacritics"

Short vowels and other diacritics are not part of written Arabic scripts. Exceptions are made for important political and religious texts and in scripts for beginning students of Arabic. Script without diacritics have considerable ambiguity because many words with different diacritic patterns appear identical in a diacritic-less setting. We propose in this paper a maximum entropy approach for restoring diacritics in a document. The approach can easily integrate and make effective use of diverse types of information; the model we propose integrates a wide array of lexical, segmentbased and part-of-speech tag features. . | Maximum Entropy Based Restoration of Arabic Diacritics Imed Zitouni Jeffrey S. Sorensen Ruhi Sarikaya IBM . Watson Research Center 1101 Kitchawan Rd Yorktown Heights NY 10598 izitouni sorenj sarikaya @ Abstract Short vowels and other diacritics are not part of written Arabic scripts. Exceptions are made for important political and religious texts and in scripts for beginning students of Arabic. Script without diacritics have considerable ambiguity because many words with different diacritic patterns appear identical in a diacritic-less setting. We propose in this paper a maximum entropy approach for restoring diacritics in a document. The approach can easily integrate and make effective use of diverse types of information the model we propose integrates a wide array of lexical segmentbased and part-of-speech tag features. The combination of these feature types leads to a state-of-the-art diacritization model. Using a publicly available corpus LDC s Arabic Treebank Part 3 we achieve a diacritic error rate of a segment error rate and a word error rate of . In case-ending-less setting we obtain a diacritic error rate of a segment error rate and a word error rate of . 1 Introduction Modern Arabic written texts are composed of scripts without short vowels and other diacritic marks. This often leads to considerable ambiguity since several words that have different diacritic patterns may appear identical in a diacritic-less setting. Educated modern Arabic speakers are able to accurately restore diacritics in a document. This is based on the context and their knowledge of the grammar and the lexicon of Arabic. However a text without diacritics becomes a source of confusion for beginning readers and people with learning disabilities. A text without diacritics is also problematic for applications such as text-to-speech or speech-to-text where the lack of diacritics adds another layer of ambiguity when processing the data. As an example .

TỪ KHÓA LIÊN QUAN
crossorigin="anonymous">
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.